News Article Recommender

News Article Recommender

Often, we run across a news article that sparks our interest in a particular topic or theme, leaving us curious, or even hungry, to learn more. We want to build on that momentum by reading more articles on the topic right away. Article recommendation apps make this easier by surfacing similar articles to the one we are reading, so we can expand our understanding as much as we want.

Our News Article Recommender app is based on the concept of text embeddings, which capture the meaning of a piece of text beyond simple keyword matching. It compares the embeddings similarity between a selected article to all other available articles. The app refines the list of candidate articles by using text classification to only recommend articles within the same category (e.g., sports or business). It then extracts a list of tags from each recommended article and displays them alongside the article to help the reader scan and choose the next article to read.

All of this is done via three Cohere API endpoints stacked together: Embed, Classify, and Generate. The steps to build the News Article Recommender are:

Step 1: Get a list of articles
Step 2: Embed articles
Step 3: Create article selector
Step 4: Find the most similar articles
Step 5: Classify each article’s category
Step 6: Extract tags from these articles
Step 7: Put everything together

Read on for more details on each of these steps.

Building an Article Recommender App

Step 1: Get a List of Articles

First, we’ll need to load the articles. In this example, we’re using the BBC news article dataset [source], which consists of articles from a few categories: business, politics, tech, entertainment, and sports, and we’re focusing on a subset of this dataset containing 100 articles.

# Load the dataset to a dataframe
@st.cache
def load_df(csv_file):
    return pd.read_csv(csv_file, delimiter=',')

df = load_df('bbc_news_test.csv')
articles = df["news"].tolist()
titles = df["title"].tolist()

Step 2: Embed Articles

Next, we’ll turn each article's text into embeddings. An embedding is a list of numbers that a model uses to represent a piece of text, capturing its context and meaning. We do this by calling the Cohere Embed endpoint, which takes in texts as input and returns embeddings as output.

# Get text embeddings via the Embed endpoint
def embed_text(texts):
    embeddings = co.embed(
                model='large',
                texts = texts)
    embeddings = np.array(embeddings.embeddings)
    return embeddings

Step 3: Create Article Selector

We’ll then create a Streamlit selectbox where a reader can select the article that he or she is currently reading (let's call this the target).

# Drop down selectbox
col1, col2 = st.columns(2)
with col1:
   current_id = st.selectbox(
   "Select an Article",
   range(len(titles)),
   format_func=lambda x: titles[x])

Step 4: Find the Most Similar Articles

Next, we’ll create a function to find other articles with the most similar embeddings (let's call these candidates) using cosine similarity.

Cosine similarity is a metric that measures how similar two sequences of numbers are (embeddings in our case), and we compute it for each target-candidate pair.

# Calculate cosine similarity between the target and candidate articles
def get_similarity(target,candidates):
   # Turn list into array
   candidates = np.array(candidates)
   target = np.expand_dims(np.array(target),axis=0)
 
   # Calculate cosine similarity
   similarity_scores = cosine_similarity(target,candidates)
   similarity_scores = np.squeeze(similarity_scores).tolist()
 
   # Sort by descending order in similarity
   similarity_scores = list(enumerate(similarity_scores))
   similarity_scores = sorted(similarity_scores, key=lambda x:x[1], reverse=True)
 
   # Return similarity scores
   return similarity_scores
# Get the similarity between the target and candidate articles
embs = rec.embed_text(articles)
current_emb = embs[current_id]
similarity_scores = rec.get_similarity(current_emb,embs)

Step 5: Classify Each Article’s Category

In the context of news articles, it’s important to note that two articles may have similar meanings but can fall under totally different categories. For example, an article about a player’s anger over a racism fine (Sports) can appear to be similar to an article about a politician’s disagreement over an apology (Politics) because both articles revolve around confrontations and the clash of individuals.

In our case, we don’t want to recommend articles from a different category even if they appear to be similar to the current article. For this, we need to build a news category classifier that will only select articles from the same category.

We’ll use the Cohere Classify endpoint to build a news category classifier that classifies articles into five classes: Business, Politics, Tech, Entertainment, and Sports. For this app, we’ll train a custom model using another subset of the BBC news article. Find more information in the training a representation model documentation.

def classify_text(input_text):
   text = truncate_text(input_text)
   classifications = co.classify(
   model='504b1b30-4927-464d-9d4c-412f9771775b-ft',
   inputs=[text]
   )
   return classifications.classifications[0].prediction
  
current_class = rec.classify_text(articles[current_id])

Step 6: Extract Tags from These Articles

Next, we’ll want to enrich the recommended articles with more information to help users scan the list for key information and discover content. The outcome: each recommended article will be displayed with a list of tags containing the key topics mentioned in the article.

We’ll build this tags extractor via the Cohere Generate endpoint. First, we create a prompt with a few examples of text and its tags. Then we call the Generate endpoint using the following prompt, appending each recommended article to get the recommended tags.

Refer to the prompt engineering documentation if you’d like to know more about creating prompts for the Generate endpoint.

# Create the prompt for tags extraction
def create_prompt(text):
   return f"""Given a news article, this program returns the list tags containing keywords of that article.
Article: japanese banking battle at an end japan s sumitomo mitsui financial has withdrawn its takeover offer for rival bank ufj holdings  enabling the latter to merge with mitsubishi tokyo.  sumitomo bosses told counterparts at ufj of its decision on friday  clearing the way for it to conclude a 3 trillion
Tags: sumitomo mitsui financial, ufj holdings, mitsubishi tokyo, japanese banking
--
Article: france starts digital terrestrial france has become the last big european country to launch a digital terrestrial tv (dtt) service.  initially  more than a third of the population will be able to receive 14 free-to-air channels. despite the long wait for a french dtt roll-out
Tags: france, digital terrestrial
--
Article: apple laptop is  greatest gadget  the apple powerbook 100 has been chosen as the greatest gadget of all time  by us magazine mobile pc.  the 1991 laptop was chosen because it was one of the first  lightweight  portable computers and helped define the layout of all future notebook pcs.
Tags: apple, apple powerbook 100, laptop
--
Article: {text}
Tags:"""
 
# Get extractions via the Generate endpoint
def extract_tags(complete_prompt):
   # Truncate the complete prompt if too long
   token_check = co.tokenize(text=complete_prompt)
   if len(token_check.tokens) > 2000:
       complete_prompt = co.detokenize(token_check.tokens[:2000]).text
 
   prediction = co.generate(
       model='xlarge',
       prompt=complete_prompt,
       max_tokens=10,
       temperature=0.3,
       k=0,
       p=1,
       frequency_penalty=0,
       presence_penalty=0,
       stop_sequences=["--"],
       return_likelihoods='NONE')
   tags_raw =  prediction.generations[0].text
 
   if "\n" in tags_raw:
       tags_clean = re.search(".+?(?=\\n)",tags_raw).group()
   else:
       tags_clean = tags_raw
 
   if tags_clean:
       tags = tags_clean.split(",")
       tags = list(dict.fromkeys(tags)) # remove duplicates
       tags = [tag.strip() for tag in tags] # remove empty spaces
       tags = [tag for tag in tags if tag] # remove empty tags
       tags = [tag for tag in tags if len(tag) > 3] # remove short tags
       tags = [f"`{tag}`" for tag in tags] # format tag string
       tags = ",".join(tags)
   else:
       tags = "None"
 
   return tags

tags = rec.extract_tags(complete_prompt)

Step 7: Put Everything Together

Finally, we’ll tie together all the steps that we’ve completed so far: embedding, finding the most similar articles, news category classification, and tags extraction.

We then render the Streamlit components to display the current and recommended articles.

# Current article
with col3:
   st.markdown("***You are currently reading...***")
   current_class = rec.classify_text(articles[current_id])
   st.markdown(f"###### {current_class.capitalize()}")
   st.subheader(f"{titles[current_id]}")
   current_article = articles[current_id]
   st.write(current_article.replace("$","\$"))
 
# Recommended article
with col4:
   st.markdown("***You might also like...***")
   SHOW_TOP = 3
   count = 0
   for candidate_id,_ in similarity_scores:
       candidate_article = articles[candidate_id]
       candidate_title = titles[candidate_id]
 
       candidate_class = rec.classify_text(candidate_article)
 
       if candidate_class ==  current_class and candidate_id != current_id:
           # Show title
           st.markdown(f"###### {candidate_class.capitalize()}")
           st.subheader(candidate_title)
 
           # Show article
           expander = st.expander("Read article")
           expander.write(candidate_article.replace("$","\$"))
          
           # Show tags
           complete_prompt = rec.create_prompt(candidate_article)
           tags = rec.extract_tags(complete_prompt)
           st.markdown(tags)
           st.write("------")
       
           count += 1
 
       if count == SHOW_TOP:
           break

Conclusion

And that concludes the process of creating our News Article Recommender app, built via three Cohere API endpoints stacked together: Embed, Classify, and Generate. To get started building your own version, create a free Cohere account.