Text Summarization
This Guide Uses the Generate Endpoint.
You can find more information about the endpoint here.
This article demonstrates a simple way of using Cohere's generation models to summarize text.
You can find the code in the notebook and colab.
1. Install the Cohere API
# Let's first install Cohere's python SDK
!pip install cohere pandas
1a. Import Cohere and the Dependencies
import cohere
import time
import pandas as pd
# Paste your API key here. Remember to not share it publicly
api_key = ''
co = cohere.Client(api_key)
We will use a simple prompt that includes two examples and a task description:
"<input phrase>"
In summary: "<summary>"
Our prompt is geared for paraphrasing to simplify an input sentence. It contains two examples that demonstrate the task to the model. The sentence we want it to summarize is:
Killer whales have a diverse diet, although individual populations often specialize in particular types of prey.
prompt = '''"The killer whale or orca (Orcinus orca) is a toothed whale
belonging to the oceanic dolphin family, of which it is the largest member"
In summary: "The killer whale or orca is the largest type of dolphin"
"It is recognizable by its black-and-white patterned body"
In summary:"Its body has a black and white pattern"
"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey"
In summary:"'''
print(prompt)
2. Call the Cohere API to get Several Iterations of the Prompt
We get several completions from the model via the API
n_generations = 5
prediction = co.generate(
model='large',
prompt=prompt,
return_likelihoods = 'GENERATION',
stop_sequences=['"'],
max_tokens=50,
temperature=0.7,
num_generations=n_generations,
k=0,
p=0.75)
2b. Calculate the sum likelihoods of each generated paragraph
We are taking the token likelihoods of each generated paragraph and summing to get to a total "paragraph" score.
# Get list of generations
gens = []
likelihoods = []
for gen in prediction.generations:
gens.append(gen.text)
sum_likelihood = 0
for t in gen.token_likelihoods:
sum_likelihood += t.likelihood
# Get sum of likelihoods
likelihoods.append(sum_likelihood)
We then rank each paragraph by their likelihood score.
pd.options.display.max_colwidth = 200
# Create a dataframe for the generated sentences and their likelihood scores
df = pd.DataFrame({'generation':gens, 'likelihood': likelihoods})
# Drop duplicates
df = df.drop_duplicates(subset=['generation'])
# Sort by highest sum likelihood
df = df.sort_values('likelihood', ascending=False, ignore_index=True)
print('Candidate summaries for the sentence: \n"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey."')
df
The model suggests the following candidate summaries for the sentence:
"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey."
generation | likelihood | |
---|---|---|
0 | Killer whales have a diverse diet" | -3.208850 |
1 | Its diet is diverse" | -3.487236 |
2 | Their diet is diverse" | -3.761171 |
3 | Different populations have different diets" | -6.415764 |
4 | Their diet consists of a variety of marine life" | -11.764865 |
In a lot of cases, better generations can be reached by creating multiple generations then ranking and filtering them. In this case we're ranking the generations by their average likelihoods.
Hyperparameters
It's worth spending some time learning the various hyperparameters of the generation endpoint. For example, temperature tunes the degree of randomness in the generations. Other parameters include top-k and top-p as well as frequency_penalty
and presence_penalty
which can reduce the amount of repetition in the output of the model. See the API reference of the generate endpoint for more details on all the parameters.
Updated 3 months ago