Skip to main content

Representation

Model Description#

The model outlined in this card provides embedding representations of text. It powers the Embed and Similarity endpoints.

Model architecture: Masked Language Model
Model release date: April 2021
Model version: 0.1
Model sizes: Shrimp, Otter
Model Card Author(s): Cohere Safety Team & Responsibility Council

Training Dataset: coheretext-unfiltered dataset

View the API documentation.

Performance#

Performance has been evaluated on the SentEval research benchmark. We report the average performance metric across the Similarity, Probing, and Downstream Tasks categories.

ModelSimilarity TasksDownstream TasksProbing Tasks
Shrimp0.820.810.71
Otter0.820.820.67

Model performance is only currently reported on English benchmarks. Multilingual benchmarks will be reported in the future.

Intended Use Case#

Embeddings may be used for purposes such as estimating semantic similarity between two sentences, choosing a sentence which is most likely to follow another sentence, sentiment analysis, topic extraction, categorizing user feedback or flagging messages. Performance of embeddings will vary across use cases depending on the language, dialect, subject matter, and other qualities of the represented text.

Example usage of Similarity

Example: Embeddings are used in the Similarity function to determine that “Hello! How are you?” is more similar to “Hey, how’s it going?” than it is to “Goodbye!”.

Usage Notes#

Always refer to the Usage Guidelines for guidance on using the Cohere API responsibly. Additionally, please consult the following model-specific usage notes:

Model Bias#

There is extensive research into the social biases learned by word embeddings (Bolukbasi et al., 2016; Manzini et al., 2019) and contextual word embeddings (Kurita et al., 2019; Zhao et al., 2019). The use of Representation embeddings may reinforce or multiply social biases. We recommend that developers using the Representation model take model bias into account and design applications carefully to avoid the following:

  • Inaccurate and biased text classification systems: Embeddings may inadvertently capture inaccurate associations between groups of people, and using them to classify or infer attributes about people may result in inaccurate labels about individuals based on mischaracterizations of the groups they belong to.
  • Reinforcing historical social biases: Embeddings capture problematic associations and stereotypes prominent on the internet and society at large. They should not be used to make decisions about individuals or the groups they belong to. For example, it is dangerous to use embeddings and Similarity outputs in CV ranking systems due to known biases in the representations (Kurita et al., 2019).

Technical Notes#

  • English only: The model provides meaningful representations for English text only.
  • Distributional shift: Embeddings capture the state of the training data at the time it was scraped. Downstream classifiers will need to be validated or retrained upon release of new embeddings to ensure that they are still serving their intended purpose.
  • Longer texts: Sentence embeddings are the average of contextualized word embeddings, and the embeddings of longer sentences may not capture the meaning accurately across the entire sequence length.
  • Varying text length: Similarity performance may vary when using Targets spanning a wide range of lengths.

Potential for Misuse#

Guided by the NAACL Ethics Review Questions, we describe the potential for misuse of the Representation model. By documenting adverse use cases, we aim to keep our team accountable for addressing them. It is our goal to prevent adversarial actors from leveraging the model to the following malicious ends.

Note: The examples in this section are not comprehensive and are only meant to illustrate our understanding of potential harms. The examples are meant to be more model-specific and tangible than those in the Usage Guidelines. Each of these malicious use cases violates our usage guidelines and Terms of Use, and Cohere reserves the right to restrict API access at any time.

  • Extraction of identity and demographic information: Using embeddings to classify the group identity or demographics of text authors or persons mentioned in a text. Group identification and private information should be consensually provided by individuals and not inferred by any automatic system.
  • Building purposefully opaque text classification systems: Algorithmic decisions that significantly affect people should be explainable to the persons affected; however, text classifications made using representations may not be explainable. A malicious actor may take advantage of this opacity to shield themselves from accountability for algorithmic decisions that may have disparate impact across demographic groups (Campolo and Crawford, 2020).