MasterAI Agents

Knowledge distillation involves training a student model using the output probabilities or 'soft targets' produced by a pre-trained teacher model. Instead of simply training the student on the ground truth labels, the student learns to replicate the teacher's probabilistic output distribution, which often contains richer information about the relationships between classes and the model's uncertainty. This can involve using the teacher's logits (pre-softmax outputs) directly, often with a temperature parameter to soften the probabilities and expose more information to the student during training. The loss function for the student typically includes a combination of the cross-entropy loss between the student's predictions and the true labels, and a distillation loss that measures the difference between the student's and teacher's output distributions (e.g., using Kullback-Leibler divergence). Distillation is valuable for deploying models on resource-constrained devices, accelerating inference, and improving generalization performance. It's frequently used with large language models to create smaller, faster versions that retain much of the original model's capabilities.

Distillation

Explanation

Related Terms