Back to Glossary
LLMs

Tokenisation

Tokenization is the process of breaking down a text string into smaller units called tokens. These tokens can be words, parts of words, or even characters, depending on the specific tokenization method used.

Explanation

In the context of Natural Language Processing (NLP) and particularly Large Language Models (LLMs), tokenization is a crucial initial step. It involves converting raw text into a numerical representation that the model can understand and process. Different tokenization algorithms exist, each with its own strengths and weaknesses. Common methods include word-based tokenization (splitting text by spaces), subword tokenization (breaking words into smaller, more frequent units like 'un', 'ing', or 'est', often used to handle out-of-vocabulary words), and character-based tokenization. The choice of tokenization method can significantly impact the performance of an LLM. For example, subword tokenization helps reduce the vocabulary size and handle rare or unseen words more effectively. The tokens are then mapped to unique integer IDs within the model's vocabulary, enabling mathematical operations and pattern recognition. Furthermore, efficient tokenization is critical for optimizing LLM inference speed and memory usage.

Related Terms