Back to Glossary
LLM Architectures

Transformers

A Transformer is a deep learning architecture that utilizes self-attention mechanisms to process entire sequences of data simultaneously rather than sequentially. It serves as the foundational framework for nearly all modern large language models, enabling them to understand complex relationships between words regardless of their distance in a text.

Explanation

Introduced by Google researchers in the 2017 paper 'Attention is All You Need,' the Transformer architecture revolutionized natural language processing by replacing Recurrent Neural Networks (RNNs) and LSTMs. Its core innovation is 'Self-Attention,' which allows the model to weigh the importance of different parts of the input data relative to each other. By eliminating sequential processing, Transformers allow for massive parallelization during training on GPUs, which facilitated the scaling of models to trillions of parameters. The architecture typically consists of an encoder to process input and a decoder to generate output, though many modern generative models (like GPT) utilize a decoder-only variation. This design effectively solves the 'vanishing gradient' problem, allowing the model to maintain context over much longer sequences of data than previous architectures.

Related Terms