Our language models understand "tokens" rather than characters or bytes. One token can be a part of a word, an entire word, or punctuation. Very common words like "water" will have their own unique tokens. A longer, less frequent word might be encoded into 2-3 tokens, e.g. "waterfall" gets encoded into two tokens, one for "water" and one for "fall". Note that tokenization is sensitive to whitespace and capitalization.
Here are some references to calibrate how many tokens are in a text:
- one word tends to be about 2-3 tokens
- a verse of a song is about 128 tokens
- this short article has about 300 tokens
The number of tokens per word depends on the complexity of the text. Simple text may approach 1 token per word on average, while complex texts may use less common words that require 3-4 tokens per word on average. Our representation models are currently limited to processing sequences with a maximum length of 1024 tokens. Generation models vary, with the Small model having a maximum length of 1024 while Medium and Large support up to 2048 tokens.
Our vocabulary of tokens is created using Byte Pair Encoding.
The easiest way to determine a good number of tokens is to guess and check using our playground. It is common to request more tokens than required and then run additional processing to retrieve the desired output.