Architecture / LLMs
Mixture of Experts (MoE)
Mixture of Experts (MoE) is a machine learning architecture that uses a sparse ensemble of specialized sub-networks, known as 'experts,' to process data. Instead of activating the entire neural network for every input, a gating mechanism selectively routes specific tasks to the most relevant experts, allowing for massive model scaling with efficient computation.
Explanation
In a traditional dense model, every parameter is used for every input token. In contrast, an MoE model consists of multiple 'experts' (usually feed-forward layers) and a gating network (or router). When a token enters the MoE layer, the router determines which top-k experts—often just one or two out of many—are best suited to handle that specific piece of information. This process, known as conditional computation, enables models to increase their total parameter count and 'knowledge capacity' significantly without a linear increase in the floating-point operations (FLOPs) required for inference. MoE is a critical architectural choice for state-of-the-art Large Language Models (LLMs) because it allows for the efficiency of a smaller model during runtime while maintaining the reasoning capabilities of a much larger one.