Dataset Developers: Cohere Infrastructure Team
Data Statement Authors: Cohere Safety Team & Responsibility Council
Size: ~200GB filtered, ~3TB unfiltered
The unfiltered dataset is used to train Representation models that reflect the world, including its current harms and biases. This enables them to be effective for use cases such as content moderation.
The filtered dataset is used to train the Generation models to complete sentences based on a prompt, while minimizing harmful generations. Our use of filtered training data is motivated by the observation of (Bender et al., 2021) that uncurated data used to train language models encodes the dominant view, further harming people at the margins. Cohere continues to invest significant resources towards dataset curation to prevent harm.
This model is trained on the Google Books dataset, CommonCrawl, and text from the internet scraped by the Cohere infrastructure team. The top ten domains scraped by our team include:
wordpress.com, medium.com, stackexchange.com, tumblr.com, elsevier.com, genius.com, bbc.co.uk, libsyn.com, yahoo.com, nytimes.com
The scraped data is similar in composition to many other large, Internet-sourced language modeling datasets, and hence reflects perspectives that skew young, white, and male (Bender et al., 2021). Language models trained on such data encode the hegemonic viewpoint; Jo and Gebru, 2021 detail issues and solutions around this topic in-depth. Enhancing the diversity of our training data is a top priority as we continue to iterate our data collection process.
Filtering harmful, biased, or otherwise undesirable documents from training data can improve language model performance (Raffel et al., 2020) and reduce the chances of the model perpetuating harm. However, doing so with precision is critical so that we do not silence marginalized voices (Bender et. al, 2021).
With these considerations in mind, we designed a document curation process which aims to minimize undesirable text within our training data. The best way to do this is an active area of research within Cohere and the broader machine learning research community (Sharoff, 2020). As Cohere learns more about the types of harm large language models exhibit, it will adapt the composition of its datasets accordingly.
We recognize the dangers of using a blockword list (i.e. removing any documents containing words from a list of selected words). Our filtration techniques are designed to retain counterspeech by taking into account language and context in a nuanced way. For example, we do not want to remove documents addressing racism, but we do want to filter racist texts. An example of a harm filtration technique we use has been published to Arxiv.
We currently train our language models on English documents only, and model performance is evaluated on English benchmarks. The heuristics we use to detect non-English text during document curation are imperfect and other languages may still remain in the dataset. Multilingual datasets and benchmarks will be incorporated into future iterations of our data and evaluation pipelines.