Embeddings

Embeddings are a way to represent the meaning of text as a list of numbers. This is useful because once text is in this form, it can be compared to other text for similarity. Using a simple comparison function, we can calculate a similarity score for two embeddings to figure out whether two texts are talking about similar things.

In the example below, the embeddings for two similar phrases have a high similarity score, and the embeddings for two unrelated phrases have a low similarity score:

import cohere
import numpy as np

co = cohere.Client("YOUR_API_KEY")

# get the embeddings
phrases = ["i love soup", "soup is my favorite", "london is far away"]
(soup1, soup2, london) = co.embed(phrases).embeddings

# compare them
def calculate_similarity(a, b):
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

calculate_similarity(soup1, soup2) # 0.9 - very similar!
calculate_similarity(soup1, london) # 0.3 - not similar!
Turning text into embeddings.Turning text into embeddings.

Turning text into embeddings.

Applications

  • Build a Frequently Asked Questions bot that compares the customer question for similarity to an existing collection of frequently asked questions.
  • Efficiently cluster large amounts of text, using k-means clustering, for example. The embeddings can also be visualized using projection techniques such as PCA, UMAP, or t-SNE. This can be helpful when trying to visualize large amounts of unstructured text.
  • Perform semantic search over text in a database
  • Pair with a downstream classifier like a random forest or an SVM to perform binary or multi-class classification or tasks such as sentiment classification or toxicity detection.

How Embeddings are Obtained

For short texts (shorter than 512 tokens), we return embeddings obtained by averaging the embeddings of each token in the text, following Reimers and Gurevych. The final embedding thus captures semantic information about the entirety of the text. For texts longer than 512 tokens, we first splice the text into 512-token chunks, and average the resulting embeddings of each chunk.