Comparing Baseline and Custom Models

Token likelihood is a useful tool for model evaluation. For instance, let's say you've trained a custom model and would like to know how much it's improved over the default model - you could use token likelihoods to compare the performance of the models on some held-out text. Here is a quick demonstration of how to use the return_likelihoods parameter from the Generate endpoint for model evaluation.

Example Setup

Let's say we've custom trained a medium model on Shakespeare data. We'd like to confirm that this custom model has higher likelihood on Shakespeare text compared to the default model. To do this, we could hold out the following snippet from the training data:

"To be, or not to be: that is the question:"
"Whether ’tis nobler in the mind to suffer"
"The slings and arrows of outrageous fortune,"
"Or to take arms against a sea of troubles,"
"And by opposing end them. To die: to sleep..."

Then we could use the following example code to retrieve the average log-likelihood of the above snippet:

curl --location --request POST 'https://api.cohere.ai/generate' \
  --header 'Authorization: BEARER {api_key}' \
  --header 'Content-Type: application/json' \
  --data-raw '{
      "model": "medium",
      "prompt": "To be, or not to be: that is the question: Whether ’tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them. To die: to sleep...",
      "max_tokens": 1,
      "temperature": 1,
      "k": 0,
      "p": 0.75,
      "return_likelihoods": "ALL"
    }'
import cohere
co = cohere.Client('{api_key}')
response = co.generate(
  model='small',
  prompt='To be, or not to be: that is the question: Whether ’tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them. To die: to sleep...',
  max_tokens=1,
  temperature=1,
  k=0,
  p=0.75,
  return_likelihoods='ALL')
print('Likelihood: {}'.format(response.generations[0].likelihood))
const cohere = require('cohere-ai');
cohere.init('{api_key}');
(async () => {
  const response = await cohere.generate({
    model: 'small',
    prompt: 'To be, or not to be: that is the question: Whether ’tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them. To die: to sleep...',
    max_tokens: 1,
    temperature: 1,
    k: 0,
    p: 0.75,
    return_likelihoods: 'ALL']
  });
  console.log(`Likelihood: ${response.body.generations[0].likelihood}`);
})();
package main

import (
  "fmt"

  cohere "github.com/cohere-ai/cohere-go"
)

func main() {
  co, err := cohere.CreateClient("{api_key}")
  if err != nil {
    fmt.Println(err)
    return
  }

  response, err := co.Generate(cohere.GenerateOptions{
    Model:             "small",
    Prompt:            `To be, or not to be: that is the question: Whether ’tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them. To die: to sleep...`,
    MaxTokens:         1,
    Temperature:       1,
    K:                 0,
    P:                 0.75,
    ReturnLikelihoods: "ALL",
  })
  if err != nil {
    fmt.Println(err)
    return
  }

  fmt.Println("Likelihood:", *response.Generations[0].Likelihood)
}
co model generate small 'To be, or not to be: that is the question: Whether ’tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them. To die: to sleep...' --max-tokens=1 --temperature=1 --k=0 --p=0.75 --return_likelihoods={likelihoods}'

Results

The following are the average log-likelihoods of the snippet using the baseline and custom medium models:

ModelAverage Log-Likelihood
medium-2.99
custom-medium-1.12

This demonstrates that customizing this model increased the likelihood of Shakespeare data!