Multilingual Embedding Models

At Cohere, we are committed to breaking down barriers and expanding access to cutting-edge NLP technologies that power projects across the globe. By making our innovative multilingual language models available to all developers, we continue to move toward our goal of empowering developers, researchers, and innovators with state-of-the-art NLP technologies that push the boundaries of Language AI.

Our Multilingual Model maps text to a semantic vector space, positioning text with a similar meaning in close proximity. This process unlocks a range of valuable use cases for multilingual settings. For example, one can map a query to this vector space during a search to locate relevant documents nearby. This often yields search results that are several times better than keyword search.

Differences Between English and Multilingual Embedding Models

Unlike our English language embedding model, our multilingual model was trained using dot product calculations. Using dot products produces a non-normalized similarity score, reflecting the magnitude of the two compared vectors. When this dimension is incorporated, multilingual embeddings perform better than standard. For more information on how our English language model works (using cosine similarity), see our introductory guide to the Cohere platform. The dimensions of our multilingual embeddings is 768 dimensions.

Use Cases

Get Started

To get started using the multilingual embedding models, you can either query our endpoints or install our SDK to use the model within Python:

import cohere  
co = cohere.Client(f"{api_key}")  
texts = [  
   'Hello from Cohere!', 'مرحبًا من كوهير!', 'Hallo von Cohere!',  
   'Bonjour de Cohere!', '¡Hola desde Cohere!', 'Olá do Cohere!',  
   'Ciao da Cohere!', '您好,来自 Cohere!', 'कोहेरे से नमस्ते!'  
]  
response = co.embed(texts=texts, model='multilingual-22-12')  
embeddings = response.embeddings # All text embeddings 
print(embeddings[0][:5]) # Print embeddings for the first text

Model Performance

ModelClusteringSearch- EnglishSearch- MultilingualCross-lingual Classification
Cohere: multilingual-22-1251.055.851.464.6
Sentence-transformers:
paraphrase-multilingual-mpnet-base-v2
46.744.415.356.1
Google: LaBSE41.020.913.259.2
Google: Universal Sentence Encoder40.114.33.459.8

List of Supported Languages

Our multilingual embedding model supports over 100 languages, including Chinese, Spanish, and French. For a full list of languages we support, please reference this page.