In a joint optimization model, information from a large, complex teacher model is transported to small, light student models using knowledge distillation. Dynamic knowledge distillation allows the student models to learn from the teacher model on the fly. However, the performance of a joint optimization model that uses dynamic knowledge distillation suffers if the teacher model contains too much noise from the negative labels, or does not have enough information from the negative labels. This disclosure describes techniques to implement dynamic knowledge distillation by using temperature to control the amount of information transmission about negative labels from a teacher model to a student model in a joint optimization model. Greater amount of information about the negative labels can be transmitted by setting the temperature high, while noise from the negative labels in the teacher model can be suppressed by setting the temperature low.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.