Skip to main content


Our models learn to model language by reading text scraped from the internet. Given a sentence, such as I like to bake cookies, the model is asked to repeatedly predict what the next token [?] is:

I [?]
I like [?]
I like to [?]
I like to bake [?]
I like to bake cookies

The model learns that the word to is quite likely to follow the word like in English, and that the word cookies is likely to follow the word bake.


The likelihood of a token can be thought of as a number (typically between -15 and 0) that quantifies a model's level of surprise that this token was used in a sentence. If a token has a low likelihood, the model was not expecting this token to be used. Conversely, if a token has a high likelihood, the model was confident that it would be used. For example, using our shrimp model, the likelihood of to from the sentence I like to is roughly -1, which is quite high, meaning that the model was fairly confident that the tokens I like would be followed by the token to. Similarly, the likelihood of cookies from the sentence I like to bake cookies is roughly -2.5, a bit lower than the previous example (which makes intuitive sense: brownies or cake would have also been reasonable options), but still quite high. However, if we change the sentence to I like to bake chairs, then the likelihood of the token chairs is considerably lower, around -15, meaning that the model is extremely surprised at its use in this sentence.