Knowledge Distillation¶

Distill “knowledge” from large ANN to small ANN

Targets

Teacher target can be from an ensemble of

Distillation Types¶

Type
Offline	Pre-trained teacher network
Collaborative/mutual learning	Teacher & student trained simultaneously
Self-distillation	Eg: Progressive hierarchical inference


Adversarial	Teacher also acts as discriminator in GAN to supplement training data to “teach” true data distribution
Multi-Teacher
Cross-Modal	Teacher trained on RGB distills info to student learning on heat maps. Unlabeled image pairs needed
Graph-Based
Attention-Based
Data-Free
Quantized	Use full-precision network to transfer knowledge to quantized network
Lifelong
NAS-Based


Response-based	Output probs as soft targets	Most common
Feature-based	- Output/weights of 1 or more “hint layers” and minimize MSE loss or - Minimize difference in attention maps between student & teacher
Relation-based	Correlations between feature maps; eg: Gramian

Last updated: 2025-12-02 • Contributors: AhmedThahir,