[论文精读]Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection

论文网址：[2304.08876] 用于定向微小目标检测的动态粗到细学习 (arxiv.org)

论文代码：https://github.com/ChaselTsui/mmrotate-dcfl

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

1. 省流版

1.1. 心得

（1）为什么学脑科学的我要看这个啊？愿世界上没有黑工

（2）最开始写小标题的时候就发现了，分得好细啊，好感度++

（3）作为一个外行人，这文章感觉提出了好多东西

1.2. 论文总结图

2. 论文逐段精读

2.1. Abstract

①Extreme geometric shapes (tiny) and finite features (few pixels) of tiny rotating objects will cause serious mismatch (inaccurate positional prior?) and imbalance (inaccurate positive sample features?) issues

②They proposed dynamic prior and coarse-to-fine assigner, called DCFL

posterior adj.在后部的;在后面的 n.臀部;屁股

2.2. Introduction

①Oriented bounding box greatly eliminates redundant background area, especially in aerial images

②Comparison figure:

where M* denotes matching function;

green, blue and red boxes are true positive, false positive, and false negative predictions respectively,

the left figure set is static and the right is dynamic

③Figure of mismatch and imbalance issues:

each point in the left figure denotes a prior location（先验打那么多个点啊...而且为啥打得那么整齐，这是什么one-stage吗）

饼状图是说当每个框都是某个角度的时候吗？当每个框都不旋转的时候阳性样本平均数量是5.2？还是说饼状图的意思是自由旋转，某个特定角度的框的阳性样本是多少多少？这个饼状图并没有横向比较诶，只有这张图自己内部比较。

柱状图是锚框大小不同下平均阳性

④They introduce dynamic Prior Capturing Block (PCB) as their prior method. Based on this, they further utilize Cross-FPN-layer Coarse Positive Sample (CPS) to assign labels. After that, they reorder these candidates by prediction (posterior), and present gt by finer Dynamic Gaussian Mixture Model (DGMM)

eradicate vt.根除;消灭;杜绝 n.根除者;褪色灵

2.3. Related Work

2.3.1. Oriented Object Detection

（1）Prior for Oriented Objects

（2）Label Assignment

2.3.2. Tiny Object Detection

（1）Multi-scale Learning

（2）Label Assignment

（3）Context Information

（4）Feature Enhancement

2.4. Method

（1）Overview

①For a set of dense prior $P\in\mathbb{R}^{W\times H\times C}$ , where $W$ denotes width, $H$ denotes height and $C$ denotes the number of shape information（什么东西啊，是那些点吗）, mapping it to $D$ by Deep Neural Network (DNN):

$D=\mathrm{DNN}_{h}(P)$

where $\mathrm{DNN}_{h}$ represents the detection head（探测头...外行不太懂，感觉也就是一个函数嘛？）;

one part $D_{cls}\in\mathbb{R}^{W\times H\times A}$ in $D$ denotes the classification scores, where $A$ means the class number（更被认为是阳性的样本那层的 $W\times H$ 里的数据会更大吗）;

one part $D_{reg}\in\mathbb{R}^{W\times H\times B}$ in $D$ denotes the classification scores, where $B$ means the box parameter number（查宝说是w, h, x, y, a之类的是box parameter）

②In static methods, the pos labels assigned for $P$ is $G=\mathcal{M}_{s}(P,GT)$

③In dynamic methods, the pos labels set $G$ integrate posterior information: $G={\mathcal M}_{d}(P,D,GT)$

④The loss function:

$\mathcal{L}=\sum_{i=1}^{N_{pos}}\mathcal{L}_{pos}(D_{i},G_{i})+\sum_{j=1}^{N_{neg}}\mathcal{L}_{neg}(D_{j},y_{j})$

where $N_{pos}$ and $N_{neg}$ represent the number of positive and negative samples, $y_i$ is the neg labels set

⑤Modelling $D$ , ${\mathcal M}_{d}$ and $GT$ :

$\tilde{D}=\mathrm{DNN}_{h}(\underbrace{\mathrm{DNN}_{p}(P)}_{\text{Dynamic Prior}\hat{P}})$

$\tilde{G}=\mathcal{M}_{d}(\mathcal{M}_{s}(\tilde{P},GT),\tilde{GT})$

$\mathcal{L}=\sum_{i=1}^{\hat{N}_{pos}}\mathcal{L}_{pos}(\tilde{D}_{i},\tilde{G}_{i})+\sum_{j=1}^{\tilde{N}_{neg}}\mathcal{L}_{neg}(\tilde{D}_{j},y_{j})$

2.4.1. Dynamic Prior

①Flexibility may alleviate mismatch problem

②Each prior represents a feature point

③The structure of Prior Capturing Block (PCB):

the surrounding information is considered by dilated convolution. Then caputure dynamic prior by Deformable Convolution Network (DCN). Moreover, using the offset learned from the regression branch to guide feature extraction in the classification branch and improve alignment between the two tasks.

④To achieve dynamic prior capturing, initializing each prior loaction $\mathbf{p}(x,y)$ by each feature point’s spatial location $\mathbf{s}$ . In each iteration, capture the offset set of each prior position $\Delta \mathbf{o}$ to update $\mathbf{s}$ :

$\tilde{\mathbf{s}}=\mathbf{s}+st\sum_{i=1}^{n}\Delta\mathbf{o}_{i}/2n$

where $st$ denotes the stride of feature map, $n$ denotes the number of offsets;

2D Gaussian distribution $\mathcal{N}_{p}(\boldsymbol{\mu}_{p},\boldsymbol{\Sigma}_{p})$ is regarded as the prior distribution;

动态的 $\tilde{\mathbf{s}}$ 作为高斯的平均向量 $\boldsymbol{\mu}_{p}$ （啥玩意儿？？）;

⑤Presetting a square $\left ( w,h,\theta \right )$ on each feature point

⑥The co-variance matrix:

$\Sigma_p=\begin{bmatrix}\cos\theta&-\sin\theta\\\sin\theta&\cos\theta\end{bmatrix}\begin{bmatrix}\frac{w^2}{4}&0\\0&\frac{h^2}{4}\end{bmatrix}\begin{bmatrix}\cos\theta&\sin\theta\\-\sin\theta&\cos\theta\end{bmatrix}\\\\ =\begin{bmatrix}\cos\theta&-\sin\theta\\\sin\theta&\cos\theta\end{bmatrix}\begin{bmatrix}\frac{w}{2}&0\\0&\frac{h}{2}\end{bmatrix}\begin{bmatrix}\frac{w}{2}&0\\0&\frac{h}{2}\end{bmatrix}\begin{bmatrix}\cos\theta&\sin\theta\\-\sin\theta&\cos\theta\end{bmatrix}\\\\ =RR^T$

dilate v.扩张;(使)膨胀;扩大 deformable adj.可变形的；应变的；易变形的

2.4.2. Coarse Prior Matching

①For prior, limiting $gt$ to a single FPN may cause sub-optimal layer selection and releasing $gt$ to all layers may cause slow convergence

②Therefore, they propose Cross-FPN-layer Coarse Positive Sample (CPS) candidates, expanding candidate layers to $gt$ 's nearby spatial location and adjacent FPN layers

③Generalized Jensen-Shannon Divergence (GJSD) constructs CPS between $\mathcal{N}_{p}(\boldsymbol{\mu}_{p},\boldsymbol{\Sigma}_{p})$ and $\mathcal{N}_{g}(\boldsymbol{\mu}_{g},\boldsymbol{\Sigma}_{g})$ :

$\mathrm{GJSD}(\mathcal{N}_{p},\mathcal{N}_{g})=(1-\alpha)\mathrm{KL}(\mathcal{N}_{\alpha},\mathcal{N}_{p})+\alpha\mathrm{KL}(\mathcal{N}_{\alpha},\mathcal{N}_{g})$

$\left\{\begin{matrix} \operatorname{KL}\left(P\left\|Q\right)\right. =\sum P\left(x\right)\log\frac{P\left(x\right)}{Q\left(x\right)} \\\\ \operatorname{KL}\left(P\left\|Q\right)\right) =\int P\left(x\right)\log\frac{P\left(x\right)}{Q\left(x\right)}dx \end{matrix}\right.$

which yields a closed-form solution;

where $\Sigma_{\alpha}=(\Sigma_{p}\Sigma_{g})_{\alpha}^{\Sigma}=\left((1-\alpha)\Sigma_{p}^{-1}+\alpha\Sigma_{g}^{-1}\right)^{-1}$ ;

$\begin{aligned} \mu_{\alpha}& =\left(\mu_{p}\mu_{g}\right)_{\alpha}^{\mu} \\ &=\Sigma_{\alpha}\left((1-\alpha)\Sigma_{p}^{-1}\mu_{p}+\alpha\Sigma_{g}^{-1}\mu_{g}\right) \end{aligned}$

and due to the homogeneity of $\mathcal{N}_{p}$ and $\mathcal{N}_{g}$ , $\alpha =0.5$

④Choosing top $K$ prior with highest GJSD for each $gt$ （选差异最大的那些）

2.4.3. Finer Dynamic Posterior Matching

①Two main steps are contained in this section, a posterior re-ranking strategy and a Dynamic Gaussian Mixture Model (DGMM) constraint

②The Possibility of becoming True predictions (PT) of the $i^{th}$ sample $D_i$ is:

$PT_i=\frac{1}{2}Cls(D_i)+\frac{1}{2}IoU(D_i,gt_i)$

choosing top $Q$ samples with the highest scores as Medium Positive Sample (MPS) candidates

③They apply DGMM, which contains geometry center and semantic center in one object, to filter far samples

④For specific instance $gt_i$ , the mean vector $\boldsymbol{\mu}_{i,1}$ of the first Gaussian is the geometry center $\left ( cx_i,cy_i \right )$ , the deduced $\boldsymbol{\mu}_{i,2}$ in MPS denotes semantic center $\left ( sx_i,sy_i \right )$

⑤Parameterizing a instance:

$DGMM_i(s|x,y)=\sum_{m=1}^2w_{i,m}\sqrt{2\pi|\Sigma_{i,m}|}\mathcal{N}_{i,m}(\mu_{i,m},\Sigma_{i,m})$

where $w_{i,m}$ denotes weight of each Gaussian distribution and their summation is 1;

$\mu_{i,m}$ equals to $gt$ 's $\boldsymbol{\Sigma}_{g}$ （什么啊这是，但是m可以等于1或者2诶，那你g的协方差不就又是语义中心又是几何中心了吗）

⑥For any $DGMM(s|MPS)<e^{-g}$ , setting negative masks

2.5. Experiments

2.5.1. Datasets

①Datasets: DOTAv1.0 /v1.5/v2.0, DIOR-R, VisDrone, and MS COCO

②Ablation dataset: DOTA-v2.0 with the most numbet of tiny objects

③Comparing dataset: DOTA-v1.0, DOTAv1.5, DOTA-v2.0, VisDrone2019, MS COCO and DIOR-R

2.5.2. Implementation Details

①Batch size: 4

②Framework based: MMDetection and MMRotate

③Backbone: ImageNet pre-trained models

④Learning rate: 0.005 with SGD

⑤Momentum: 0.9

⑥Weight decay: 0.0001

⑦Default backbone: ResNet-50 with FPN

⑧Loss: Focal loss for classifying and IoU loss for regression

⑨Data augmentation: random flipping

⑩On DOTA-v1.0 and DOTA-v2.0, using official setting to crop images to 1024×1024. The overlap is 200 and epoch is 12

⑪On other datasets, setting the input size to 1024 × 1024 (overlap 200), 800 × 800, 1333 × 800, and 1333×800 for DOTA-v1.5, DIOR-R, VisDrone, and COCO respectively. Epoch is set as 40, 40, 12, and 12 on the DOTA-v1.5, DIOR-R, COCO, and VisDrone