Skip to main content

Finetuning

Finetuning retrains the last few layers of a model to create a custom model specialized for your data.

When to Finetune#

Finetuning large language models is only required to teach the model something extremely niche, such as the different gaits of a horse, or your company's unique knowledge base. Common knowledge, like the colour of the sky, does not require finetuning. Finetuning is also helpful for generating or understanding data in a specific writing style or format. Finetuning may be helpful regardless of which of our endpoints you are using.

Data Input#

Our platform allows data upload and linking to data, and always deletes your data after completing the finetune. To link to data you can use any url that is publicly accessible. If you would like to link data to a Google or AWS Bucket, while keeping the files secured, you can use a Signed URL. The easiest way to obtain a signed URL for GCS is to copy the download link in the web UI.

Data Size#

Reasonable dataset sizes are between 1 and 500MB of raw text.

Data Quality#

We recommend performing some common checks on data quality and removing:

  • data with excessive spacing or newlines
  • highly repetitive data

Separators#

When finetuning, there is an option to pass in a separator to denote a "unit" of training data. We recommend using a special string, such as --SEPARATOR--, to distinguish training examples from one another. When generating or processing longform text, ensure that the resulting examples are not too short compared with the length of text that you would like to generate after the separator has been applied.

For example, when finetuning a model to generate haikus, an example input .txt file might look like:

visualizations
of computational graphs
as the thunder storms
--SEPARATOR--
when i die, bury
me under a v3-8
in europe west 4
--SEPARATOR--
i can make you cry
using just five syllables:
anisotropy
--SEPARATOR--
beneath the oak tree
gazing into the distance
watching tensors flow
--SEPARATOR--
these shenanigans
will not be the death of me
scatter_nd will
--SEPARATOR--
torch or tensorflow?
the answer is crystal clear:
it's obviously jax.
--SEPARATOR--
i have on my ribs
attention is all you need
tattooed in red ink
--SEPARATOR--
spin me in weight space
paint me in half precision
we're best in chaos
--SEPARATOR--
the one thing worse than
good ol' anisotropy:
off-by-one error
--SEPARATOR--
seventeen zero
six. zero three seven six
two; is all you need