Finetuning retrains the last few layers of a model to create a custom model specialized for your data.
Finetuning large language models is only required to teach the model something extremely niche, such as the different gaits of a horse, or your company's unique knowledge base. Common knowledge, like the colour of the sky, does not require finetuning. Finetuning is also helpful for generating or understanding data in a specific writing style or format. Finetuning may be helpful regardless of which of our endpoints you are using.
Our platform allows data upload and linking to data, and always deletes your data after completing the finetune. To link to data you can use any url that is publicly accessible. If you would like to link data to a Google or AWS Bucket, while keeping the files secured, you can use a Signed URL. The easiest way to obtain a signed URL for GCS is to copy the download link in the web UI.
Reasonable dataset sizes are between 1 and 500MB of raw text.
We recommend performing some common checks on data quality and removing:
- data with excessive spacing or newlines
- highly repetitive data
When finetuning, there is an option to pass in a separator to denote a "unit" of training data. We recommend using a special string, such as
--SEPARATOR--, to distinguish training examples from one another. When generating or processing longform text, ensure that the resulting examples are not too short compared with the length of text that you would like to generate after the separator has been applied.
For example, when finetuning a model to generate haikus, an example input
.txt file might look like: