Back to Glossary
Optimization

Distillation

Distillation, also known as knowledge distillation, is a model compression technique where a smaller, more efficient model (the student) is trained to mimic the behavior of a larger, more complex model (the teacher). The goal is to transfer the knowledge learned by the teacher model to the student model, enabling the student to achieve comparable performance with significantly fewer parameters and computational resources.

Explanation

Knowledge distillation involves training a student model using the output probabilities or 'soft targets' produced by a pre-trained teacher model. Instead of simply training the student on the ground truth labels, the student learns to replicate the teacher's probabilistic output distribution, which often contains richer information about the relationships between classes and the model's uncertainty. This can involve using the teacher's logits (pre-softmax outputs) directly, often with a temperature parameter to soften the probabilities and expose more information to the student during training. The loss function for the student typically includes a combination of the cross-entropy loss between the student's predictions and the true labels, and a distillation loss that measures the difference between the student's and teacher's output distributions (e.g., using Kullback-Leibler divergence). Distillation is valuable for deploying models on resource-constrained devices, accelerating inference, and improving generalization performance. It's frequently used with large language models to create smaller, faster versions that retain much of the original model's capabilities.

Related Terms