Skip to main content

Model Evaluation

Token likelihood is a useful tool for model evaluation. For instance, let's say you've finetuned a model and would like to know how much it's improved over the baseline - you could use token likelihoods to compare the performance of the models on some held-out text. Here is a quick demonstration of how to use the return_likelihoods parameter from the Generate endpoint for model evaluation.

Example Setup#

Let's say we've finetuned a small model on Shakespeare data. We'd like to confirm that this finetuned model has higher likelihood on Shakespeare text compared to the baseline model. To do this, we could hold out the following snippet from the training data:

"To be, or not to be: that is the question:"
"Whether ’tis nobler in the mind to suffer"
"The slings and arrows of outrageous fortune,"
"Or to take arms against a sea of troubles,"
"And by opposing end them. To die: to sleep..."

Then we could use the following example code to retrieve the average log-likelihood of the above snippet:

    Results#

    The following are the average log-likelihoods of the snippet using the baseline and finetuned small models:

    ModelAverage Log-Likelihood
    small-1.44
    finetuned-small-1.12

    This demonstrates that finetuning this model increased the likelihood of Shakespeare data!