[VarifocalNet] VarifocalNet: An IoU-aware Dense Object Detector (CVPR. 2021oral)


1. Motivation


Prior work uses the classification score or a combination of classification and predicted localization scores to rank candidates.


Generally, the classification score is used to rank the bounding box in NMS.

This harms the detection performance, because the classification score is not always a good estimate of the bounding box localization accuracy [10] and accurately localized detections with low classification scores may be mistakenly removed in NMS.

在之前的方法中,额外添加的IOU score或者Center-ness Score作为localization accuracy estimation,,在测试中,把分类分数乘上对应2者的score,得到的分数才作为NMS中的’classification score’。但作者认为这种方法是sub-optimal,额外的网络分支来预测定位分数,是不简洁的方法,并且会导致(incurs)额外的计算开销。

They are sub-optimal because multiplying the two imperfect predictions may lead to a worse rank basis and we show in Section 3 that the upper bound of the performance achieved by such methods is limited.


Instead of predicting an additional localization accuracy score, can we merge it into the classification score?

针对之前两种方法都无法得到一个reliable的排位,作者提出了IACS(Iou arare Classification Score)作为一种分类和定位的联合表示,并且制定了Varifocal Loss,提出了一种star-shaped星行候选框表示,用于IACS的预测以及边界框的refinement。结合以上两种新的成分以及一个bbox refinement branch,基于FCOS_ATSS检测器的性能得到改善。

如表1所示,在val2017上,在FCOS+ATSS模型,作者分析了添加了不同的先验信息对网络精度的影响。如果使用gt_ctrness以及gt_iou替代原本的FCOS中的预测信息,上升的精度分别为41.1以及43.5,但是实际inference中实际是没有gt信息的(test2017),因此作者认为,这点精度的提高,只能说明ctr x scores以及iou x scores来进行得分的排序并不能带来高性能。

This indicates that using the product of either the predicted centerness score or the IoU score and the classification score to rank detections is certainly unable to bring significant performance gain.

如果使用gt_bbox,没有ctrness的情况下都能达到56.1,但是如果使用gt_cls(即在对于标签位置设置为1)的情况下,有ctrness和没有ctrness差别很大,(43.1 vs 58.1),说明ctrness对于筛选精确的框有帮助。

Because the centerness score can differentiate accurate and inaccurate boxes to some extent

作者发现将classification scores替换为gt_IOU(只不过作者重新命名为gt_cls_IOU,之前使用gt_IO替换ctrness,命名为了gt_ctr_IOU)。(gt_IOU定义如下):

The IoU between the predicted bounding box and the ground-truth one (termed as gt IoU).

The most surprising result is the one obtained by replacing the classification score of the ground-truth class with the gt IoU (gt cls iou).



This in fact reveals that there already exist accurately localized bounding boxes in the large candidate pool for most objects.

replacing the classification score of the ground-truth class with the gt IoU is the most promising selection measure. We refer to the element of such a score vector as the IoU-aware Classification Score (IACS)

2. Contribution

  • 作者展示了正确的rank候选框的方法是对于dense detectors高性能的关键,IACS可以实现很好的ranking。

We show that accurately ranking candidate detections is critical for high performing dense object detectors, and IACS achieves a better ranking than other methods (Section 3).

  • 作者提出了Varifocal Loss。

We propose a new Varifocal Loss for training dense object detectors to regress the IACS.

  • 作者提出star-shape bbox用于计算IACS和修正bbox。

We design a new star-shaped bounding box feature representation for computing the IACS and refining the bounding box.

  • 作者将此网络命名为VarifocalNet / VFNet.

We develop a new dense object detector based on the FCOS [9]+ATSS [12] and the proposed components, named VarifocalNet or VFNet for short, to exploit the advantage of the IACS. An illustration of our method is shown in Figure 1.

3. Method

Compared with the FCOS+ATSS, it has three new components: the varifcoal loss, the star-shaped bounding box feature representation and the bounding box refinement.

3.1 IACS – IoU-Aware Classification Score

作者将gt class label的位置(原本1的gt)替换为了预测bbox和gt bbox的IOU的值。其余位置的label 还是为0。

We define the IACS as a scalar element of a classification score vector, in which the value at the ground-truth class label position is the IoU between the predicted bounding box and its ground truth, and 0 at other positions.

3.2 Varifocal Loss

原始的Focal Loss:

作者借鉴了Focal loss中对于类别不平衡问题的处理,但是不同于Focal loss平等的处理正负样本,作者asymmetrically(不对称)处理它们(我认为不对称是指上下2个式子的函数不是对称的)。本文提出的Varifocal Loss:

其中p是pred ICAS,q分为两类,对于foreground point,q是IOU,对于background point,对于所有类的target,q是0,如图1所示。

作者提到,varifocal loss只是减少了负样本对于loss的贡献,使用了 p γ p^\gamma pγ,但是并没有在正样本中(q>0)采用同样的down-weight的方法,因为作者认为正样本相对很少。收到PISA的启发,作者将target q和正样本进行加权。如果一个这样本有high gt_IOU,那么他对于loss的贡献就会相对大一点。这关注训练那些高质量的正样本,可以对实现高精度AP 有重要的英雄。

同时为了平衡正样本和负样本的损失,作者在负样本loss中加入了可以调整的超参数 α \alpha α

3.3 Star-Shaped Box Feature Representation

作者提出了9点采样法,使用可变卷积deformable convolution来表示bbox。作者认为这种表示方法了可以补货bbox的集合信息,并且附近的上下文信息。,这对于编码bbox和gt bbox的misalignment是很有效的。

It uses the features at nine fixed sampling points (yellow circles in Figure 1) to represent a bounding box with the deformable convolution [13, 14]. This new representation can capture the geometry of a bounding box and its nearby contextual information, which is essential for encoding the misalignment between the pre- dicted bounding box and the ground-truth one.

给定一个location(x,y),作者先对于初始的bbox使用3x3的卷积进行回归,然后参考FCOS,bbox被编码为$(l’,t’,r’,b’)$4D vector。然后给出9点法的表示:

( x , y ) , ( x − l ’ , y ) , ( x , y − t ’ ) , ( x + r ’ , y ) , ( x , y + b ’ ) , ( x − l ’ , y − t ’ ) , ( x + l ’ , y − t ’ ) , ( x − l ’ , y + b ’ ) , ( x + r ’ , y + b ’ ) (x, y), (x- l’, y), (x, y-t’), (x+r’, y), (x, y+b’), (x-l’, y-t’), (x+l’, y-t’), (x-l’, y+b’) ,(x+r’, y+b’) (x,y),(xl,y),(x,yt),(x+r,y),(x,y+b),(xl,yt),(x+l,yt),(xl,y+b)(x+r,y+b)

这些对于(x,y)的relative offsets,可以作为卷积中的offsets输入,进而通过可变卷积来表示bbox,因为这些点都是人为选择的并且没有格外的预测的开销,因此作者认为新的表示方法是computation efficient。

3.4 Bounding Box Refinement


We model the bounding box refinement as a residual learning problem.

对于初始的回归的bbox ( l ′ , t ′ , r ′ , b ′ ) (l',t',r',b') (l,t,r,b),作者先提取start-shaped表示特征来编码。然后,基于此表示特征, 学习了4个distance scaling factors ( Δ l , Δ t , Δ r , Δ b ) (\Delta l, \Delta t, \Delta r, \Delta b) Δl,Δt,Δr,Δb,因此可以得到refined bounding box的表示:
( l , t , r , b ) = ( Δ l × l ′ , Δ t × t ′ , Δ r × r ′ , Δ b × b ′ ) (l,t,r,b) = (\Delta l \times l', \Delta t \times t', \Delta r \times r', \Delta b \times b') (l,t,r,b)=(Δl×l,Δt×t,Δr×r,Δb×b)

3.5 VarifocalNet

网络结构如图3所示,分为2个subnet,localization subnet 以及 classification subnet。

3.6 Loss Function and Inference

Loss Function


The inference of the VFNet is straightforward. It involves simply forwarding an input image through the network and a NMS post-processing step for removing re- dundant detections.

4. Experiments

4.1 Varifocal Loss

4.2 Individual Component Contribution

4.3 Comparison with State-of-the-Art

4.5 Generality and Superiority of Varifocal Loss


