Despite significant accuracy improvement in convolutional neural networks (CNN) based object detectors, they often require prohibitive runtimes to process an image for real-time applications. State-of-the-art models often use very deep networks with a large number of floating point operations. Efforts such as model compression learn compact models with fewer number of parameters, but with much reduced accuracy. In this work, we propose a new framework to learn compact and fast object detection networks with improved accuracy using knowledge distillation [20] and hint learning [34]. Although knowledge distillation has demonstrated excellent improvements for simpler classification setups, the complexity of detection poses new challenges in the form of regression, region proposals and less voluminous labels. We address this through several innovations such as a weighted cross-entropy loss to address class imbalance, a teacher bounded loss to handle the regression component and adaptation layers to better learn from intermediate teacher distributions. We conduct comprehensive empirical evaluation with different distillation configurations over multiple datasets including PASCAL, KITTI, ILSVRC and MS-COCO. Our results show consistent improvement in accuracy-speed trade-offs for modern multi-class detection models.




On the other hand, seminal works on knowledge distillation show that a shallow or compressed model trained to mimic the behavior of a deeper or more complex model can recover some or all of the accuracy drop [3, 20, 34]. However, those results are shown only for problems such as classification, using simpler networks without strong regularization such as dropout.

Applying distillation techniques to multi-class object detection, in contrast to image classification, is challenging for several reasons. First, the performance of detection models suffers more degradation with compression, since detection labels are more expensive and thereby, usually less voluminous.Second, knowledge distillation is proposed for classification assuming each class is equally important, whereas that is not the case for detection where the background class is far more prevalent. Third, detection is a more complex task that combines elements of both classification and bounding box regression. Finally, an added challenge is that we focus on transferring knowledge within the same domain (images of the same dataset) with no additional data or labels, as opposed other works that might rely on data from other domains (such as high-quality and low-quality image domains, or image and depth domains)






To address the above challenges, we propose a method to train fast models for object detection with knowledge distillation. Our contributions are four-fold: 

• We propose an end-to-end trainable framework for learning compact multi-class object detection models through knowledge distillation (Section 3.1). To the best of our knowledge, this is the first successful demonstration of knowledge distillation for the multi-class object detection problem.

• We propose new losses that effectively address the aforementioned challenges. In particular, we propose a weighted cross entropy loss for classification that accounts for the imbalance in the impact of misclassification for background class as opposed to object classes (Section 3.2), a teacher bounded regression loss for knowledge distillation (Section 3.3) and adaptation layers for hint learning that allows the student to better learn from the distribution of neurons in intermediate layers of the teacher (Section 3.4).

• We perform comprehensive empirical evaluation using multiple large-scale public benchmarks.Our study demonstrates the positive impact of each of the above novel design choices, resulting in significant improvement in object detection accuracy using compressed fast networks, consistently across all benchmarks (Sections 4.1 – 4.3).

• We present insights into the behavior of our framework by relating it to the generalization and under-fitting problems (Section 4.4).



• 我们提出了一个端到端可训练的框架,通过知识蒸馏学习紧凑的多类目标检测模型(第3.1节)。据我们所知,这是对多类目标检测问题进行知识蒸馏的首次成功演示。

• 我们提出了新的损失函数,有效解决了上述挑战。特别地,我们提出了一种加权交叉熵损失,用于分类,考虑了对背景类别和目标类别的误分类影响不平衡(第3.2节),一种用于知识蒸馏的教师边界回归损失(第3.3节),以及用于提示学习的适应层,允许学生更好地从教师的中间层神经元分布中学习(第3.4节)。

• 我们使用多个大规模公共基准进行了全面的实证评估。我们的研究表明了上述每个新设计选择的积极影响,在所有基准测试中,使用压缩快速网络显著提高了目标检测准确性(第4.1 - 4.3节)。

• 我们通过将其与泛化和欠拟合问题相关联,提供了对我们框架行为的深入见解(第4.4节)。

Related Works



Overall Structure

对于主干网络,作者使用FitNet中的hint learning进行蒸馏,即加入adaptation layers使得feature map的维度匹配

对于分类任务的输出,使用加权cross entropy loss来解决类别失衡严重问题

对于回归任务,除了原本的smooth L1 loss,作者还提出teacher bounded regression loss,将教师的回归预测作为上界,学生网络回归的结果更优则该损失为0。


Knowledge Distillation for Classification with Imbalanced Classes



Knowledge Distillation for Regression with Teacher Bounds


Hint Learning with Feature Adaptation


FitNets中hint learning的误差


We propose a novel framework for learning compact and fast CNN based object detectors with the knowledge distillation. Highly complicated detector models are used as a teacher to guide the learning process of efficient student models. Combining the knowledge distillation and hint framework together with our newly proposed loss functions, we demonstrate consistent improvements over various experimental setups. Notably, the compact models trained with our learning framework execute significantly faster than the teachers with almost no accuracy compromises at PASCAL dataset. Our empirical analysis reveals the presence of under-fitting issue in object detector learning, which could provide good insights to further advancement in the field.






