SSD-Single Shot Detector

文章目录

SSD模型
- 主要改进点
- 模型说明
训练
- Choosing scales and aspect ratios for default boxes
- Matching strategy
- Training objective
- Hard negative mining
- Data augmentation
实验结果
- 基本网络参数
- PASCAL VOC2007
- 模型消融实验
- PASCAL VOC2012
- COCO
- 推理速度比较

前面提到了两种经典的目标检测算法：one stage的yolo系列，还有就是two stage的RCNN系列。这里两种算法特点非常分明：

RCNN系列的模型复杂，运算量大，但是可以达到比较高的准确率。Faster RCNN的模型可以达到mAP=78.8。
YOLO系列去掉了候选框的过程，直接通过网络将目标的box和分类预测出来。速度较RCNN快了非常多，但是在准确率上会有所下降。

而SSD在2016年提出的时候，就是想综合YOLO的速度和RCNN的准确率。但是好像在YOLO2，也就是YOLO9000的时候就已经超过了SSD。当然SSD后面也有不少的改进版本，这里先说说这个SSD的原始版本。

作为平衡准确率和推理速度FPS的模型，SSD的性能是：For 300 × 300 input, SSD achieves 74.3% mAP on VOC2007 test at 59 FPS on a Nvidia Titan X。也就是对于一个300 * 300的输入图像，可以达到74.3的mAP准确率，而且还有59FPS。

SSD模型

SSD应该是可以看成是一个one stage的目标检测模型，没有分成候选框+预测的路子，而是通过一组卷积层来直接预测box和类别。但是综合了Faster RCNN中的anchor的概念，而且还融合了FPN或者是SPP这种金字塔的概念，使用多中分辨率的featrue map去做预测。这样就可以比较好的检测出小尺寸和大尺寸的物体。 Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.

主要改进点

在论文的第一章里，提到了几个关键的改进点：Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.

这一段主要说的是三个改进点：

using a small convolutional filter to predict，也就是用一些卷积核作为卷积层去预测box的4个值，还有类别。在yolo和rcnn中，基本上都是通过一个全连接层去做预测的，在SSD中，增加了很多的卷积层来做预测。
using separate predictors (filters) for different aspect ratio detections。采用不同尺寸的detection，实际上就是和anchor类似，在featrue map的某个location点，使用不同的尺寸进行预测。
perform detection at multiple scales。不同的卷积层就是不同尺度和分辨率的特征图，就是SPP和FPN中的金字塔的概念。

在接下来的内容中详细说说这些改进点。

第一章的最后，论文提出了论文的贡献点，也就是改进后的效果：

We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state-of-the-art for single shot detectors (YOLO), and significantly more accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including Faster R-CNN). 这就是提到了在速度和准确率中取得了比较好的平衡。
The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps. 采用小的卷积核，也就是3 * 3的卷积核进行预测。
To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio. 多尺度预测。
These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off.
Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.

模型说明

论文给出了SSD模型与YOLO v1模型的对比图：

论文对网络的基本描述为：
The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we will call the base network。

我理解SSD模型可以分成三个部分：
- 第一部分是卷积特征提取部分，这部分采取的是VGG-16的backbone。但是没有用VGG-16的最后三个全连接层。也就是图中标识的Conv5_3，对应了VGG-16中的第五组卷积层的最后一个卷积层。
- 第二部分就是6个卷积层，每个卷积层之间都是通过3 * 3卷积核，再配合1 * 1的卷积核进行降维。还有就是配合max_pooling进行尺度的下采样。这部分就是上文提到的通过3 * 3卷积核进行预测，还有就是SPP和FPN的金字塔过程。
- 最后就是一个NMS的非极大值抑制，去除大量无用的bounding box。
Convolutional predictors for detection。在上图中与YOLO的模型结构对比图中，YOLO最后是采用一个全连接层，生成了一个7 * 7 * 30的output，这个output中，每个1 * 1 * 30的表示了两个bounding box的内容，所以总共是预测了98个框。而在SSD中，论文中提到，不是使用全连接层来进行预测，而是通过一个小的卷积核来进行预测，生成相对应的预测结果。在上图中，每个横着的线就是一个3 * 3的卷积核。直接通过卷积核来生成最终的预测结果。
Multi-scale feature maps for detection。论文中的这个部分描述的就是第二部分，6个卷积层，用于提取不同分辨率的特征图，用于检测。SSD使用6个不同特征图检测不同尺度的目标。低层预测小目标，高层预测大目标。因为高层的感受野要大得多，最后的那个卷积层就是1 * 1，感受野就就是整个图像了，拿来预测整个图像的类别了。
- 其中第一个featrue layer是利用了VGG中的第4组最后一个卷积层的输出，也就是Conv4_3。这个featrue map的尺寸是38 * 38 * 512。这个featrue map直接通过一个3 * 3 * (4*(classes+4))的卷积层进行卷积计算。那么输出的featrue map的channel数就是4*(classes+4)，比如是PASCAL VOC的数据，classes=20，那么输出的通道数就是24 * 4 = 96个通道。而输出的尺寸大小为38 * 38，也就是说在第一个featrue layer的预测(predictor)的输出尺寸为38 * 38 * 96。其中每一个1 * 1 * 96的值可以理解为3 * 3的卷积层在这个location的预测结果。总共有 38 * 38 = 1444个location，每个location有 $k$ 个预测结果，最上面的这个classifier的 $k = 4$ ，总共5776个预测结果(1)，每个预测结果包含了24个值(20classes + [x, y, w, h])，这个bounding box的四元组[x, y, w, h]是一个相对值，相对default box的位置，这个default box也称作priorbox，也就是先验框，在后面会提到。For a feature layer of size m × n with p channels, the basic element for predicting parameters of a potential detection is a 3 × 3 × p small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m × n locations where the kernel is applied, it produces an output value. The bounding box offset output values are measured relative to a default box position relative to each feature map location。
- 在上图中，从上往下数，总共有6根横线，每根横线就表示一个尺度的3 * 3预测classifier。每个尺度的classifier是不相同的，从上往下的k的取值是 $[4, 4, 6, 6, 6, 4]$ ，没写的就是和上面的一根横线相同，所以总共的预测结果是 $38 * 38 * 4 + 19 * 19 * 6 + 10 * 10 * 6 + 5 * 5 * 6 + 3 * 3 * 4 + 1 * 1 * 4 = 8732$ 个预测结果。再对这个8732个预测结果进行NMS，去除一些不必要的预测框。
Default boxes and aspect ratios。在上一条已经说到了，每个featrue map的每个元素(cell)都会做一个预测结果。其实我理解就是一个3 * 3 * (classes + 4)的卷积核对一个38 * 38的特征图进行卷积计算，就会得到一个38 * 38 * (classes + 4)的特征输出图。用四个这样的卷积图，那就是上一条说的有4个不同的卷积结果了。At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, we compute c class scores and the 4 offsets relative to the original default box shape. This results in a total of (c + 4)k filters that are applied around each location in the feature map, yielding (c + 4)kmn outputs for a m × n feature map.

右图中的虚线框，指的就是4个不同的卷积核对于某个特征图中的location，或者说是cell的bounding box的预测结果： $[\Delta x, \Delta y, \Delta w, \Delta h]$ ，都是相对于defalut box的偏移量预测。
而且，Ground Truth中尺寸比较大的物体，比如dog，就需要在更高层，也就是featrue map更小，语义信息更强的层来预测，也就是红色预测框，而且这个featrue map的尺寸只有4 * 4(eg.)。

训练

上面说的是网络结构，接下来就是论文提出的训练方法。简单来说，就是怎么使用Ground Truth的标签来训练模型。

Choosing scales and aspect ratios for default boxes

在上面说模型结构的章节中提到了，通过模型预测得到的结果都是偏移量，既然是偏移量，肯定是针对某个框的偏移量，那么这个框就是论文中提到的default box，或者叫做prior box，先验框。

论文在这个小节，就描述了怎么来在图像中确认先验框的大小，纵横比和中心位置。

首先是大小，或者是尺寸。这个是通过下面的公式来计算的：
$s_k=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1), k\in [1,m]$
其中， $m$ 为总共新增的卷积层(featrue layer)，在论文中 $m = 6$ 。 $k$ 表示第几个layer， $s_k$ 就为这个layer的尺寸，论文中的 $s_{max}=0.9$ ， $s_{min}=0.2$ 。这个 $s_k$ 计算出来表示的是输入尺寸的比例。而且根据 $k$ 的增加，这个default box的尺寸就越大。因为越高层的感受野越大，所以这个default box也必须要大一些，才能有好的预测结果。
- 论文中在实验的章节提到，We set default box with scale 0.1 on conv4_3。也就是六个featrue map中，在VGG中的那个的scale设置为0.1。那么，如果输入尺寸是300 * 300的图像(SSD300)的话，这个featrue map的default box的尺寸就是 $s_1 = 30$ .
- 从第二个featrue layer开始，也就是图中的FC6。这个featrue layer，设置成 $k = 1$ ，那么 $s_1=0.2$ ，default box的尺寸就为60。因为第一层已经去掉了，所以 $m = 5$ 。整个 $s = [30, 60, 112, 165, 217, 270]$ 。
大小定好了，然后就是纵横比。We impose different aspect ratios for the default boxes, and denote them as $a_r$ 。论文中，这个 $a_r \in \{1,2,3,\frac{1}{2}, \frac{1}{3}\}$ ，然后根据不同的纵横比和上面的 $s_k$ ，可以将default box的尺寸固定下来了,width是 $w_k = s_k\sqrt{a_r}$ ， $h_k = s_k/\sqrt{a_r}$ 。然后对于纵横比为1的框，增加了一个尺寸 $s_k^{'}=\sqrt{s_ks{k+1}}$ ，也就是说，每个featrue layer的default box都有6种不同尺寸。
尺寸好了之后，然后就是位置。在论文中，一个矩形框的位置是通过定义其中心位置来确定的。论文中给出的中心位置坐标为： $(\frac{i+0.5}{|f_k|}, \frac{j+0.5}{|f_k|})$ 。i和j都是指的某个特征层，featrue layer中的索引，也就是第i行第j列。 $f_k|$ 为featrue layer的尺寸，也就是[38,19, 10, 5, 3, 1]。也就是每个格子的中心点坐标。
这样，每个格子有了6种不同的尺寸+纵横比组合，总共也是8732个default box。配置3 * 3的卷积核做的预测的 $\Delta x$ 等，就可以得到8732个预测值。再通过NMS进行过滤。

Matching strategy

在目标检测算法的训练过程中，最重要的就是计算损失值，那么计算损失值的话，对于某个预测值，对应的会有一个default box(default box + 预测的偏移量才是最终的值)。这个default box和谁去计算损失呢？也就是和哪个Ground Truth进行损失计算的问题。也被称作是Matching strategy。论文中关于这个的描述为：We begin by matching each ground truth box to the default box with the best jaccard overlap (as in MultiBox [7]). Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5). This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap.

也就是说，首先是只把与Ground Truth的交并比最高的default box拿出来当做为预测值，也就是哪里计算损失值。但是，论文针对这个策略做了一下改进，也就是说把与Ground Truth的交并比高于某个阈值的default box全部拿出来，来进行损失计算。这样效果更好。

Training objective

既然把所有的高于某个交并比阈值的default box全部拿出来做损失计算，那么势必要改进一下损失计算函数，损失函数为：

首先设 $x_{ij}^p=\{1,0\}$ 来表示在第p个类别中，第i个default box和第j个ground truth的匹配程度，非0即1。根据上面的匹配策略，一个ground truth可能会有多个default box匹配到，所以 $\sum_i{x_{ij}^p} \ge 1$ 。
整体的损失函数就为检测损失和分类损失之和，并通过权重参数 $\alpha$ 来平衡。The overall objective loss function is a weighted sum of the localization loss (loc) and the confidence loss (conf)：
$L(x,c,l,g)=\frac{1}{N}(L_{conf}(x,c) + \alpha L_{loc}(x,l,g))$
- N就是与某个类的Ground Truth匹配到的default box的总数，如果为0，则损失L=0。表示这个default box对应的区域为背景区域。
- x为上面提到的是否匹配因子。
- $L_{conf}(x,c)$ 为分类损失
- $L_{loc}(x,l,g)$ 为检测损失
- $\alpha$ 为权重参数
检测损失的计算方法为：
- 使用Smooth L1 loss计算损失。
- l为预测框，g为Ground Truth
- cx和cy为bounding box的中心点
- w和h分别代表width和height
分类损失的计算方法为,这个就不细说了：

Hard negative mining

After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. 论文中提到，和其他的这种先验框的目标检测算法类似，大量的先验框是无法达到与某个Ground Truth的交并比超过阈值的，所以大量的先验框是作为negative存在的。也就是会用到Hard negative mining，用negative来训练模型。而且如果positive和negative不平衡的话，模型无法学习的很好。Instead of using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and a more stable training.

也不是使用所有的negative，而是在这些negative中，根据置信度来进行排序，选出最高的一些negative default box，使得negative：positive=3:1。

Data augmentation

简单来说就是做了一个随机裁剪，直接贴原文，不细说。
To make the model more robust to various input object sizes and
shapes, each training image is randomly sampled by one of the following options:

Use the entire original input image.
Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.
Randomly sample a patch.

The size of each sampled patch is [0.1, 1] of the original image size, and the aspect ratiois between 1/2 and 2. We keep the overlapped part of the ground truth box if the center of it is in the sampled patch. After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14].

实验结果

基本网络参数

这里直接贴原文，不细说。

Our experiments are all based on VGG16 [15], which is pre-trained on the ILSVRC CLS-LOC dataset [16]. Similar to DeepLab-LargeFOV [17], we convert fc6 and fc7 to convolutional layers, subsample parameters from fc6 and fc7, change pool5 from 2 × 2 − s2 to 3 × 3 − s1, and use the a trous algorithm [18] to fill the ”holes”. We remove all the dropout layers and the fc8 layer. We fine-tune the resulting model using SGD with initial learning rate 10−3, 0.9 momentum, 0.0005 weight decay, and batch size 32.

PASCAL VOC2007

这里提到了第一个featrue layer的尺寸为输入的0.1之外，还提到了3个featrue map只使用了三种尺寸(就是上面的k=4的情况)，没有用到纵横比为1:3和3:1的： For conv4_3, conv10_2 and conv11_2, we only associate 4 default boxes at each feature map location – omitting aspect ratios of 1/3 and 3.

其他的内容： Since, as pointed out in [12], conv4 3 has a different feature scale compared to the other layers, we use the L2 normalization technique introduced in [12] to scale the feature norm at each location in the feature map to 20 and learn the scale during back propagation. We use the $10^{−3}$ learning rate for 40k iterations, then continue training for 10k iterations with $10^{−4}$ and $10^{−5}$ 。

结果如下：

错误情况分析：

模型消融实验

Effects of various design choices and components on SSD performance
- 数据增强很重要
- 不同的default box尺寸是有效的，增加了1/2, 2, 1/3, 3之后对提升准确率有帮助。
- Atrous，扩张卷积可以提升速度。
Effects of using multiple output layers
- 从多尺度的featrue map中做预测是比较有帮助的
- 但是需要比较好的处理图像边缘。