Our language models understand "tokens" rather than characters or bytes. One token can be a part of a word, an entire word, or punctuation. Very common words like "water" will have their own unique tokens. A longer, less frequent word might be encoded into 2-3 tokens, e.g. "waterfall" gets encoded into two tokens, one for "water" and one for "fall". Note that tokenization is sensitive to whitespace and capitalization.
Here are some references to calibrate how many tokens are in a text:
- one word tends to be about 2-3 tokens
- a verse of a song is about 128 tokens
- this short article has about 300 tokens
The number of tokens per word depends on the complexity of the text. Simple text may approach 1 token per word on average, while complex texts may use less common words that require 3-4 tokens per word on average. Our models are currently limited to processing sequences with a maximum length of 1024 tokens.
Our vocabulary of tokens is created using Byte Pair Encoding.
The easiest way to determine a good number of tokens is to guess and check using our playground. It is common to request more tokens than required and then run additional processing to retrieve the desired output.