Skip to main content

Model Evaluation

The Likelihood endpoint can be a useful tool for model evaluation. For instance, let's say you've finetuned a model and would like to know how much it's improved over the baseline - you could use the Likelihood endpoint to compare the performance of the models on some held-out text. Here is a quick demonstration of how to use the Likelihood endpoint for model evaluation.

Example Setup#

Let's say we've finetuned a small model on Shakespeare data. We'd like to confirm that this finetuned model has higher likelihood on Shakespeare text compared to the baseline model. To do this, we could hold out the following snippet from the training data:

"To be, or not to be: that is the question:"
"Whether ’tis nobler in the mind to suffer"
"The slings and arrows of outrageous fortune,"
"Or to take arms against a sea of troubles,"
"And by opposing end them. To die: to sleep..."

Then we could use the following example code to compute the log-likelihood of the above snippet:

    Results#

    The following are the per-token log-likelihoods of the snippet using the baseline and finetuned small models:

    ModelAverage Log-Likelihood
    small-1.44
    finetuned-small-1.12

    This demonstrates that finetuning this model increased the likelihood of Shakespeare data!