An embedding is a list of floating point numbers that our models use to represent text. Each token in a piece of text has an embedding. As these embeddings flow through the model, they get transformed and refined. At the end of the model, each token embedding contains information about the semantics of the text it was a part of, as well as some amount of world knowledge (Jawahar et al.; Petroni et al.). Deeper models produce more information-rich embeddings, both because the model has more time to transform and refine them, and because they have a higher dimensionality (i.e. contain more numbers).
For short texts (shorter than 512 tokens), we return embeddings obtained by averaging the embeddings of each token in the text, following Reimers and Gurevych. The final embedding thus captures semantic information about the entirety of the text. For texts longer than 512 tokens, we first splice the text into 512-token chunks, and average the resulting embeddings of each chunk.
- Embeddings can be used to efficiently cluster large amounts of text, using k-means clustering, for example. The embeddings can also be visualised using projection techniques such as PCA, UMAP, or t-SNE. This can be helpful when trying to visualise large amounts of unstructured text.
- Embeddings can also be used to rapidly match a query sentence with other semantically similar sentences whose embeddings are stored in a database by taking the dot product between the embedding of the query sentence and the matrix representing the other sentences.
- Embeddings can be paired with a downstream classifier like a random forest or an SVM to perform binary or multi-class classification or tasks such as sentiment classification or toxicity detection.