Back to Glossary
LLMs

Tokens

Tokens are the basic units of text that a language model processes. They are typically words or sub-words that have been broken down from the original input text through a process called tokenization.

Explanation

In the context of Large Language Models (LLMs), tokens are crucial for processing and generating text. The tokenization process involves breaking down text into smaller units that the model can understand and manipulate. Common tokenization methods include word-based tokenization (splitting text by spaces) and subword tokenization (splitting words into smaller, more frequent units like 'un', 'ing', or parts of words). Subword tokenization helps handle rare or out-of-vocabulary words more effectively. The choice of tokenization method and the size of the vocabulary significantly impact the model's performance, memory usage, and ability to generalize to new text. LLMs use tokens to predict the next token in a sequence, enabling them to generate coherent and contextually relevant text. The number of tokens in a prompt or generated output also affects the cost of using LLMs, as many services charge based on token consumption.

Related Terms