Skip to main content

Controlling Generation with top-k & top-p

The method of picking output tokens is a key concept in text generation with language models. There are several methods (also called decoding strategies) for picking the output token, and two of the leading ones are top-k sampling and top-p sampling.

Let’s look at the example where the input to the model is the prompt The name of that country is the:

language model input: 'the name of that country is the' and the output is: 'United'

Example output of a generation language model

The output token in this case, United, was picked in the last step of processing -- after the language model has processed the input and calculated a likelihood score for every token in its vocabulary. This score indicates the likelihood that it will be the next token in the sentence (based on all the text the model was trained on).

language model output probabilities of tokens: United 12%, Netherlands 2.7%, Czech 1.9%

The model calculates a likelihood for each token in its vocabulary. The decoding strategy then picks one as the output.

1- Pick the top token: greedy decoding#

You can see in this example that we picked the token with the highest likelihood, ‘United’.

Greedy decoding always picks the top token. In this case, that's the token 'United'

Always picking the highest scoring token is called "Greedy Decoding". It's useful but has some drawbacks.

This strategy is called greedy decoding. It’s a reasonable strategy but has some drawbacks, such as outputs with repetitive loops of text (think of the suggestions in your smartphone’s auto-suggest. When you continually pick the highest suggested word, it may devolve into repeated sentences).

2- Pick from amongst the top tokens: top-k#

Another commonly used strategy is to sample from a shortlist of the top 3 tokens. This approach allows the other high-scoring tokens a chance of being picked. The randomness introduced by this sampling helps the quality of generation in a lot of scenarios.

top-k shortlist the top three tokens: United, Netherlands, Czech then samples one of the three

Adding some randomness helps make output text more natural. In top-3 decoding, we first shortlist three tokens then sample one of them considering their likelihood scores.

More broadly, choosing the top three tokens means setting the top-k parameter to 3. Changing the top-k parameter sets the size of the shortlist the model samples from as it outputs each token. Setting top-k to 1, gives us greedy decoding.

3- Pick from amongst the top tokens whose probabilities add up to 15%: top-p#

The difficulty of selecting the best top-k value opens the door for a popular decoding strategy that dynamically sets the size of the shortlist of tokens. This method, called Nucleus Sampling, shortlists the top tokens whose sum of likelihoods does not exceed a certain value. A toy example with a top-p value of 0.15 could look like this:

top-p of 0.15 the tokens whose probabilities are equal to 15% or less. In this case it shortlists the tokens United and Netherlands.

In top-p, the size of the shortlist is dynamically selected based on the sum of likelihood scores reaching some threshold.

Top-p is usually set to a high value (like 0.75) with the purpose of limiting the long tail of low-probability tokens that may be sampled. We can use both top-k and top-p together. If both k and p are enabled, p acts after k.