Text Summarization

📘

This Guide Uses the Generate Endpoint.

You can find more information about the endpoint here.

This article demonstrates a simple way of using Cohere's generation models to summarize text.

You can find the code in the notebook and colab.

1. Install the Cohere API

# Let's first install Cohere's python SDK
!pip install cohere pandas

1a. Import Cohere and the Dependencies

import cohere
import time
import pandas as pd
# Paste your API key here. Remember to not share it publicly 
api_key = ''
co = cohere.Client(api_key)

We will use a simple prompt that includes two examples and a task description:

"<input phrase>"
In summary: "<summary>"

Our prompt is geared for paraphrasing to simplify an input sentence. It contains two examples that demonstrate the task to the model. The sentence we want it to summarize is:

Killer whales have a diverse diet, although individual populations often specialize in particular types of prey.

prompt = '''"The killer whale or orca (Orcinus orca) is a toothed whale
belonging to the oceanic dolphin family, of which it is the largest member"
In summary: "The killer whale or orca is the largest type of dolphin"
"It is recognizable by its black-and-white patterned body"
In summary:"Its body has a black and white pattern"
"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey"
In summary:"'''
print(prompt)

2. Call the Cohere API to get Several Iterations of the Prompt

We get several completions from the model via the API

n_generations = 5
prediction = co.generate(
    model='large',
    prompt=prompt,
    return_likelihoods = 'GENERATION',
    stop_sequences=['"'],
    max_tokens=50,
    temperature=0.7,
    num_generations=n_generations,
    k=0,
    p=0.75)

2b. Calculate the sum likelihoods of each generated paragraph

We are taking the token likelihoods of each generated paragraph and summing to get to a total "paragraph" score.

# Get list of generations
gens = []
likelihoods = []
for gen in prediction.generations:
    gens.append(gen.text)
    sum_likelihood = 0
    for t in gen.token_likelihoods:
        sum_likelihood += t.likelihood
    # Get sum of likelihoods
    likelihoods.append(sum_likelihood)

We then rank each paragraph by their likelihood score.

pd.options.display.max_colwidth = 200
# Create a dataframe for the generated sentences and their likelihood scores
df = pd.DataFrame({'generation':gens, 'likelihood': likelihoods})
# Drop duplicates
df = df.drop_duplicates(subset=['generation'])
# Sort by highest sum likelihood
df = df.sort_values('likelihood', ascending=False, ignore_index=True)
print('Candidate summaries for the sentence: \n"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey."')
df

The model suggests the following candidate summaries for the sentence:

"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey."

generation likelihood
0 Killer whales have a diverse diet" -3.208850
1 Its diet is diverse" -3.487236
2 Their diet is diverse" -3.761171
3 Different populations have different diets" -6.415764
4 Their diet consists of a variety of marine life" -11.764865

In a lot of cases, better generations can be reached by creating multiple generations then ranking and filtering them. In this case we're ranking the generations by their average likelihoods.

Hyperparameters

It's worth spending some time learning the various hyperparameters of the generation endpoint. For example, temperature tunes the degree of randomness in the generations. Other parameters include top-k and top-p as well as frequency_penalty and presence_penalty which can reduce the amount of repetition in the output of the model. See the API reference of the generate endpoint for more details on all the parameters.