Skip to main content

Text Summarization

This article demonstrates a simple way of using Cohere's generation models to summarize text. You can find the code in the notebook and colab.

provided with the right prompt, a language model can generate multiple candidate summaries

We will use a simple prompt that includes two examples and a task description:

"<input phrase>"
In summary: "<summary>"

Our prompt is geared for paraphrasing to simplify an input sentence. It contains two examples that demonstrate the task to the model. The sentence we want it to summarize is:

Killer whales have a diverse diet, although individual populations often specialize in particular types of prey.

prompt = '''"The killer whale or orca (Orcinus orca) is a toothed whale
belonging to the oceanic dolphin family, of which it is the largest member"
In summary: "The killer whale or orca is the largest type of dolphin"
"It is recognizable by its black-and-white patterned body"
In summary:"Its body has a black and white pattern"
"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey"
In summary:"'''
print(prompt)

We get several completions from the model via the API

n_generations = 5
prediction = co.generate(
model='large',
prompt=prompt,
return_likelihoods = 'GENERATION',
stop_sequences=['"'],
max_tokens=50,
temperature=0.7,
num_generations=n_generations,
k=0,
p=0.75)
# Get list of generations
gens = []
likelihoods = []
for gen in prediction.generations:
gens.append(gen.text)
sum_likelihood = 0
for t in gen.token_likelihoods:
sum_likelihood += t.likelihood
# Get sum of likelihoods
likelihoods.append(sum_likelihood)
pd.options.display.max_colwidth = 200
# Create a dataframe for the generated sentences and their likelihood scores
df = pd.DataFrame({'generation':gens, 'likelihood': likelihoods})
# Drop duplicates
df = df.drop_duplicates(subset=['generation'])
# Sort by highest sum likelihood
df = df.sort_values('likelihood', ascending=False, ignore_index=True)
print('Candidate summaries for the sentence: \n"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey."')
df

The model suggests the following candidate summaries for the sentence:

"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey."

generationlikelihood
0Killer whales have a diverse diet"-3.208850
1Its diet is diverse"-3.487236
2Their diet is diverse"-3.761171
3Different populations have different diets"-6.415764
4Their diet consists of a variety of marine life"-11.764865

In a lot of cases, better generations can be reached by creating multiple generations then ranking and filtering them. In this case we're ranking the generations by their average likelihoods.

Hyperparameters#

It's worth spending some time learning the various hyperparameters of the generation endpoint. For example, temperature tunes the degree of randomness in the generations. Other parameters include top-k and top-p as well as frequency_penalty and presence_penalty which can reduce the amount of repetition in the output of the model. See the API reference of the generate endpoint for more details on all the parameters.