Model architecture: Masked Language Model
Model release dates: See release notes
Model sizes: Shrimp, Seal
Model Card Author(s): Cohere Safety Team & Responsibility Council
We report the average performance metric across SentEval's Similarity, Probing, and Downstream Tasks categories.
|Model||Similarity Tasks||Downstream Tasks||Probing Tasks|
Model performance is only currently reported on English benchmarks. Multilingual benchmarks will be reported in the future.
Embeddings may be used for purposes such as estimating semantic similarity between two sentences, choosing a sentence which is most likely to follow another sentence, sentiment analysis, topic extraction, or categorizing user feedback. Performance of embeddings will vary across use cases depending on the language, dialect, subject matter, and other qualities of the represented text.
Example: Embeddings are used in the Similarity function to determine that
Hello! How are you? is more similar to
Hey, how’s it going? than it is to
It is nice to meet you or
Goodbye!. For another Representation model example, see this tutorial on how to use Similarity for sentiment analysis.
Always refer to the Usage Guidelines for guidance on using the Cohere API responsibly. Additionally, please consult the following model-specific usage notes:
There is extensive research into the social biases learned by language model embeddings (Bolukbasi et al., 2016; Manzini et al., 2019; Kurita et al., 2019; Zhao et al., 2019). We recommend that developers using the Representation model take this into account when building downstream text classification systems. Embeddings may inadvertently capture inaccurate associations between groups of people and attributes such as sentiment or toxicity. Using embeddings in downstream text classifiers may lead to biased systems that are sensitive to demographic groups mentioned in the inputs. For example, it is dangerous to use embeddings or Similarity outputs in CV ranking systems due to known gender biases in the representations (Kurita et al., 2019).
- English only: The model provides meaningful representations for English text only.
- Distributional shift: Embeddings capture the state of the training data at the time it was scraped. Downstream classifiers will need to be validated or retrained upon release of new embedding models to ensure that they are still serving their intended purpose.
- Longer texts:
embedoutputs are the aggregation of contextualized word embeddings; hence, the embeddings of longer inputs may not capture the meaning accurately across the entire sequence length.
- Varying text length: Similarity performance may vary when using
targetspanning a wide range of lengths.
Guided by the NAACL Ethics Review Questions, we describe below the model-specific concerns around misuse of the Representation model. By documenting adverse use cases, we aim to encourage Cohere and its customers to prevent adversarial actors from leveraging our models to the following malicious ends.
- Extraction of identity and demographic information: Using embeddings to classify the group identity or demographics of text authors or persons mentioned in a text. Group identification and private information should be consensually provided by individuals and not inferred by any automatic system.
- Building purposefully opaque text classification systems: Algorithmic decisions that significantly affect people should be explainable to the persons affected; however, text classifications made using representations may not be explainable. A malicious actor may take advantage of this opacity to shield themselves from accountability for algorithmic decisions that may have disparate impact across demographic groups (Campolo and Crawford, 2020).
- Human-outside-the-loop: Building downstream classifiers that serve as automated decision-making systems that have real-world consequences on people, where those decisions are made without a human-in-the-loop.