Token likelihood is a useful tool for model evaluation. For instance, let's say you've finetuned a model and would like to know how much it's improved over the baseline - you could use token likelihoods to compare the performance of the models on some held-out text. Here is a quick demonstration of how to use the
return_likelihoods parameter from the Generate endpoint for model evaluation.
Let's say we've finetuned a
small model on Shakespeare data. We'd like to confirm that this finetuned model has higher likelihood on Shakespeare text compared to the baseline model. To do this, we could hold out the following snippet from the training data:
Then we could use the following example code to retrieve the average log-likelihood of the above snippet:
The following are the average log-likelihoods of the snippet using the baseline and finetuned
This demonstrates that finetuning this model increased the likelihood of Shakespeare data!