Knowledge Distillation¶

Distill “knowledge” from large ANN to small ANN
- Larger DNNs are easier to train
- Smaller DNNs are easier to deploy
Targets
- Hard targets: No info about wrong classes
- Soft targets: Have info about wrong classes
- Get using expert annotation
- From a trained NN

- Training
- Use softmax with temperature, usually \(T=5\)
- Loss function: Distillation loss + Student loss
- Inference
- T=1
Teacher target can be from an ensemble of
- multiple initializations
- multiple teacher architectures
- Specialists & generalists
Distillation Types¶
| Type | ||
|---|---|---|
| Offline | Pre-trained teacher network | ![]() |
| Collaborative/mutual learning | Teacher & student trained simultaneously | ![]() |
| Self-distillation | Eg: Progressive hierarchical inference | ![]() |
Distillation Algorithms¶
| Adversarial | Teacher also acts as discriminator in GAN to supplement training data to “teach” true data distribution | |
| Multi-Teacher | ||
| Cross-Modal | Teacher trained on RGB distills info to student learning on heat maps. Unlabeled image pairs needed | |
| Graph-Based | ||
| Attention-Based | ||
| Data-Free | ||
| Quantized | Use full-precision network to transfer knowledge to quantized network | |
| Lifelong | ||
| NAS-Based |
Knowledge Types¶
| Response-based | Output probs as soft targets | Most common |
| Feature-based | - Output/weights of 1 or more “hint layers” and minimize MSE loss or - Minimize difference in attention maps between student & teacher | |
| Relation-based | Correlations between feature maps; eg: Gramian |


