Skip to main content

Data Statement

Dataset Name: coheretext-{filtered, unfiltered}
Dataset Version: 0.1
Dataset Creation Date: March 2021
Dataset Developers: Cohere Infrastructure Team
Data Statement Author: Cohere Safety Team & Responsibility Council
Size: ~200GB filtered, ~3TB unfiltered

Overview of Training Datasets#

The unfiltered dataset is used to train Representation models that reflect the world, including its current harms and biases. This enables them to be effective for harmful and toxic text detection, content moderation, and more.

The filtered dataset is used to train the Generation models to complete sentences based on a prompt, while minimizing harmful generations. Our use of filtered training data is motivated by the observation of (Bender et al., 2021) that uncurated data used to train language models encodes the dominant view, further “harming people at the margins.” Cohere continues to invest significant resources towards dataset curation to prevent harm.

Document Collection#

This model is trained on the Google Books dataset, CommonCrawl, and text from the internet scraped by the Cohere infrastructure team.

  • The top ten domains scraped by our team include: wordpress.com, medium.com, stackexchange.com, tumblr.com, elsevier.com, genius.com, bbc.co.uk, libsyn.com, yahoo.com, nytimes.com

Source Demographics#

The scraped data is similar in composition to many other large, Internet-sourced language modeling datasets, and hence reflects perspectives that skew young, white, and male (Bender et al., 2021). Language models trained on such data encode the hegemonic viewpoint; (Jo and Gebru, 2021) details issues and solutions around this topic in-depth. Investigating and addressing these concerns in later iterations of our training data is a top priority.

Document Curation#

Filtering harmful, biased, or otherwise noisy documents from training data can improve language model performance (Raffel et al., 2020) and reduce the chances of the model perpetuating harm. However, doing so with precision is critical so that we do not silence marginalized voices (Bender et. al, 2021).

With these considerations in mind, we designed a document curation process which aims to minimize undesirable text within our training data. The best way to do this is an active area of research within Cohere and the broader machine learning research community (Sharoff, 2020). As Cohere learns more about the types of harm large language models exhibit, it will adapt the composition of its datasets accordingly.

More Nuanced than Blockwords#

We recognize the dangers of using a blockword list (i.e. removing any documents containing words from a list of selected words). Our filtration techniques are designed to retain counterspeech by taking into account language and context in a nuanced way. For example, we do not want to remove documents addressing racism, but we do want to filter racist texts.

Note: Our current harm filtration technique is under review for publication, and will be described here once it has been published.

Language Filtration#

We currently train our language models on English documents only, and model performance is evaluated on English benchmarks. The heuristics we use to detect non-English text during document curation are imperfect and other languages may still remain in the dataset. Multilingual datasets and benchmarks will be incorporated into future iterations of our data and evaluation pipelines.