Language models give computers the ability to search by meaning and go beyond searching by matching keywords. This capability is called semantic search.
In this article, we'll build a simple semantic search engine. The applications of semantic search go beyond building a web search engine. They can empower a private search engine for internal documents or records. It can be used to power features like StackOverflow's "similar questions" feature.
- Get the archive of questions
- Embed the archive
- Search using an index and nearest neighbor search
- Visualize the archive based on the embeddings
Get Started now and get unprecedented access to world-class Generation and Representation models with billions of parameters.
We'll use the trec dataset which is made up of questions and their categories.
|0||0||0||How did serfdom develop in and then leave Russia ?|
|1||1||1||What films featured the character Popeye Doyle ?|
|2||0||0||How can I find a list of celebrities ' real names ?|
|3||1||2||What fowl grabs the spotlight after the Chinese Year of the Monkey ?|
|4||2||3||What is the full form of .com ?|
|5||3||4||What contemptible scoundrel stole the cork from my lunch ?|
|6||3||5||What team did baseball 's St. Louis Browns become ?|
|7||3||6||What is the oldest profession ?|
|8||0||7||What are liver enzymes ?|
|9||3||4||Name the scar-faced bounty hunter of The Old West .|
Let's now embed the text of the questions
To get a thousand embeddings of this length should take a few seconds.
Let's now use Annoy to build an index that stores the embeddings in a way that is optimized for fast search. This approach scales well to a large number of texts (other options include Faiss, ScaNN, and PyNNDescent).
After building the index, we can use it to retrieve the nearest neighbors either of existing questions (section 3.1), or of new questions that we embed (section 3.2).
If we're only interested in measuring the similarities between the questions in the dataset (no outside queries), a simple way is to calculate the similarities between every pair of embeddings we have.
|614||What animals do you find in the stock market ?||0.896121|
|137||What are equity securities ?||0.970260|
|601||What is `` the bear of beers '' ?||0.978348|
|307||What does NASDAQ stand for ?||0.997819|
|683||What is the rarest coin ?||1.027727|
|112||What are the world 's four oceans ?||1.049661|
|864||When did the Dow first reach ?||1.050362|
|547||Where can stocks be traded on-line ?||1.053685|
|871||What are the Benelux countries ?||1.054899|
We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.
|236||What is the name of the tallest mountain in the world ?||0.431913|
|670||What is the highest mountain in the world ?||0.436290|
|907||What mountain range is traversed by the highest railroad in the world ?||0.715265|
|435||What is the highest peak in Africa ?||0.717943|
|354||What ocean is the largest in the world ?||0.762917|
|412||What was the highest mountain on earth before Mount Everest was discovered ?||0.767649|
|109||Where is the highest point in Japan ?||0.784319|
|114||What is the largest snake in the world ?||0.789743|
|656||What 's the tallest building in New York City ?||0.793982|
|901||What 's the longest river in the world ?||0.794352|
This concludes this introductory guide to semantic search using sentence embeddings. As you continue the path of building a search product additional considerations arise (like dealing with long texts, or finetuning to better improve the embeddings for a specific use case).
We can’t wait to see what you start building! Share your projects or find support at community.cohere.ai.