Back to Glossary
Natural Language Processing

Tokenization

The process of breaking down a sequence of text into smaller units, such as words, characters, or subwords, called tokens.

Explanation

Tokenization is a fundamental step in Natural Language Processing (NLP) where raw text is converted into a format that machine learning models can process. It involves splitting strings into individual components (tokens) which are then mapped to numerical representations (embeddings). Common methods include Word-level, Character-level, and Subword tokenization (like Byte Pair Encoding or WordPiece). Effective tokenization helps models handle vocabulary size, out-of-vocabulary words, and linguistic nuances.

Related Terms