Skip to main content

Semantic Search

Language models give computers the ability to search by meaning and go beyond searching by matching keywords. This capability is called semantic search.

Searching an archive using sentence embeddings

In this article, we'll build a simple semantic search engine. The applications of semantic search go beyond building a web search engine. They can empower a private search engine for internal documents or records. It can be used to power features like StackOverflow's "similar questions" feature.

You can find the code in the notebook and colab.

Contents#

  1. Get the archive of questions
  2. Embed the archive
  3. Search using an index and nearest neighbor search
  4. Visualize the archive based on the embeddings
New to Cohere?
Get Started now and get unprecedented access to world-class Generation and Representation models with billions of parameters.

1. Get the archive of questions#

We'll use the trec dataset which is made up of questions and their categories.

# Get dataset
dataset = load_dataset("trec", split="train")
# Import into a pandas dataframe, take only the first 1000 rows
df = pd.DataFrame(dataset)[:1000]
# Preview the data to ensure it has loaded correctly
df.head(10)
label-coarselabel-finetext
000How did serfdom develop in and then leave Russia ?
111What films featured the character Popeye Doyle ?
200How can I find a list of celebrities ' real names ?
312What fowl grabs the spotlight after the Chinese Year of the Monkey ?
423What is the full form of .com ?
534What contemptible scoundrel stole the cork from my lunch ?
635What team did baseball 's St. Louis Browns become ?
736What is the oldest profession ?
807What are liver enzymes ?
934Name the scar-faced bounty hunter of The Old West .

2. Embed the archive#

Let's now embed the text of the questions embedding archive texts

To get a thousand embeddings of this length should take a few seconds.

# Get the embeddings
embeds = co.embed(texts=list(df['text']),
model='large',
truncate='LEFT').embeddings

3. Search using an index and nearest neighbor search#

Building the search index from the embeddings Let's now use Annoy to build an index that stores the embeddings in a way that is optimized for fast search. This approach scales well to a large number of texts (other options include Faiss, ScaNN, and PyNNDescent).

After building the index, we can use it to retrieve the nearest neighbors either of existing questions (section 3.1), or of new questions that we embed (section 3.2).

# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
# Add all the vectors to the search index
for i in range(len(embeds)):
search_index.add_item(i, embeds[i])
search_index.build(10) # 10 trees
search_index.save('test.ann')

3.1. Find the neighbors of an example from the dataset#

If we're only interested in measuring the similarities between the questions in the dataset (no outside queries), a simple way is to calculate the similarities between every pair of embeddings we have.

# Choose an example (we'll retrieve others similar to it)
example_id = 92
# Retrieve nearest neighbors
similar_item_ids = search_index.get_nns_by_item(example_id,10,
include_distances=True)
# Format and print the text and distances
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'],
'distance': similar_item_ids[1]}).drop(example_id)
print(f"Question:'{df.iloc[example_id]['text']}'\nNearest neighbors:")
results
# Output:
Question:'What are bear and bull markets ?'
Nearest neighbors:
textsdistance
614What animals do you find in the stock market ?0.896121
137What are equity securities ?0.970260
601What is `` the bear of beers '' ?0.978348
307What does NASDAQ stand for ?0.997819
683What is the rarest coin ?1.027727
112What are the world 's four oceans ?1.049661
864When did the Dow first reach ?1.050362
547Where can stocks be traded on-line ?1.053685
871What are the Benelux countries ?1.054899

3.2. Find the neighbors of a user query#

We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.

query = "What is the tallest mountain in the world?"
# Get the query's embedding
query_embed = embedder.batch_embed([query])
# Retrieve the nearest neighbors
similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
include_distances=True)
# Format the results
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'],
'distance': similar_item_ids[1]})
results
textsdistance
236What is the name of the tallest mountain in the world ?0.431913
670What is the highest mountain in the world ?0.436290
907What mountain range is traversed by the highest railroad in the world ?0.715265
435What is the highest peak in Africa ?0.717943
354What ocean is the largest in the world ?0.762917
412What was the highest mountain on earth before Mount Everest was discovered ?0.767649
109Where is the highest point in Japan ?0.784319
114What is the largest snake in the world ?0.789743
656What 's the tallest building in New York City ?0.793982
901What 's the longest river in the world ?0.794352

This concludes this introductory guide to semantic search using sentence embeddings. As you continue the path of building a search product additional considerations arise (like dealing with long texts, or finetuning to better improve the embeddings for a specific use case).

We can’t wait to see what you start building! Share your projects or find support at community.cohere.ai.