CVPR 2020: Object Detection(2)

 

Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector

Problem

Few-shot object detection: aims to detect objects of unseen class with a few training examples.

Insight

Central to our method is the Attention-RPN and the multi-relation module which fully exploit the similarity between the few shot training examples and the test set to detect novel objects while suppressing the false detection in background.

A new few-shot detection dataset.

Technical overview

Our network architecture uses ResNet-50 as backbone. The support image (in green) and query image (in blue) are fed into the weight-shared backbone. The RPN use attention feature generated by the depth-wise cross correlation between compact 1 × 1 × C support feature and H ×W × C query feature. The class score generated by patch-relation head (the top head), global-relation head (the middle head) and local-correlation head (the bottom head) is added together as the final matching score, and the bounding box prediction are generated by the patch-relation head.

Attention RPN. The support feature is average pooled to a 1× 1×C vector, and then caculate depth-wise cross correlation with the query feature whose output is used as attention feature and is fed into RPN to generate proposals.

The Dataset

Proof

Dont Even Look Once: Synthesizing Features for Zero-Shot Detection

  • Author: Pengkai Zhu, Hanxiao Wang, Venkatesh Saligrama
  • Arxiv: 1911.07933

Problem

Zero-Shot Detection

Assumption in prior work

At a fundamental level, while vanilla detectors are capable of proposing bounding boxes, which include unseen objects, they are often incapable of assigning high-confidence to unseen objects, due to the inherent precision/recall tradeoffs that requires rejecting background objects.

Technical overview

Don’t Even Look Once (DELO): synthesizes visual features for unseen objects and augments existing training algorithms to incorporate unseen object detection.

(a) An illustration of seen/unseen classes and the semantic description; (b) A vanilla detector trained using seen objects only tends to relegate the confidence score of unseen objects; (c) The proposed approach. We first train a visual feature generator by taking a pool of visual features of foreground/background objects and their semantics with a balanced ratio. We then use it to synthesize visual features for unseen objects; Finally we add the synthesized visual features back to the pool and re-train the confidence predictor module of the vanilla detector. The re-trained confidence predictor can be plugged back into the detector and detect unseen objects.

Proof

Zero-shot detection evaluation results on various datasets and seen/unseen splits. TU = Test-Unseen, TS = Test-Seen, TM = Test-Mix represents different data configurations. Overall average precision (AP) in % is reported. The highest AP for every setting is in bold

Multiple Anchor Learning for Visual Object Detection

  • Author: Wei Ke, Tianliang Zhang, Zeyi Huang, Qixiang Ye, Jianzhuang Liu, Dong Huang
  • Arxiv: 1912.02252

Problem

Classification and localization are two pillars of visual object detectors

Assumption in prior work

In CNN-based detectors, these two modules are usually optimized under a fixed set of candidate (or anchor) bounding boxes. This configuration significantly limits the possibility to jointly optimize classification and localization.

Technical overview

Our approach, referred to as Multiple Anchor Learning (MAL), constructs anchor bags and selects the most representative anchors from each bag. Such an iterative selection process is potentially NP-hard to optimize. To address this issue, we solve MAL by repetitively depressing the confidence of selected anchors by perturbing their corresponding features. In an adversarial selection-depression manner, MAL not only pursues optimal solutions but also fully leverages multiple anchors/features to learn a detection model.

In the feature pyramid network, an anchor bag Ai is constructed for each object bi. Together with the network parameter learning, i.e., back-propagation, MAL evaluates the joint classification and localization confidence of each anchor inAi. Such confidence is used for anchor selection and indicates the importance of anchors during network parameter evolution.

MAL implementation. During training, it includes the additional anchor selection and anchor depression modules added to RetinaNet. During test, it uses exactly the same architecture as RetinaNet. “U” and “V ” respectively denote convolutional feature maps before and after depression. “M” and “M ′” respectively denote an activation map before and after depression.

Proof

Rethinking Classification and Localization in R-CNN

  • Author: Yue Wu, Yinpeng Chen, Lu Yuan, Zicheng Liu, Lijuan Wang, Hongzhi Li, Yun Fu
  • Arxiv: 1904.06493

Problem

Classification and Localization misalignment

Assumption in prior work

Modern R-CNN based detectors share the RoI feature extractor head for both classification and localization tasks, based upon the correlation between the two tasks.

Insight

In contrast, we found that different head structures (i.e. fully connected head and convolution head) have opposite preferences towards these two tasks. Specifically, the fully connected head is more suitable for the classification task, while the convolution head is more suitable for the localization task.

Technical overview

We propose a double-head method to separate these two tasks into different heads (i.e. a fully connected head for classification and a convolution head for box regression).

Proof

Revisiting the Sibling Head in Object Detector

Problem

Classification and Localization misalignment

Assumption in prior work

This paper provides the observation that the spatial misalignment between the two object functions in the sibling head can considerably hurt the training process,

Insight

Considering the classification and regression, TSD decouples them from the spatial dimension by generating two disentangled proposals for them, which are estimated by the shared proposal.

Technical overview

Illustration of the proposed TSD cooperated with Faster RCNN [30]. Input images are first fed into the FPN backbone and then, region proposal P is generated by RPN. TSD adopts the RoI feature of P as input and estimates the derived proposals P̂c and P̂r for classification and localization. Finally, two parallel branches are used to predict specific category and regress precise box, respectively

Proof

DR Loss: Improving Object Detection by Distributional Ranking

Problem:

Imbalance issue in one-stage detector

Assumption in prior work

One-stage detector is efficient but can suffer from the imbalance issue with respect to two aspects: the imbalance between classes and that in the distribution of background, where only a few candidates are hard to be identified.

Technical overview

First, we convert the classification problem to a ranking problem to alleviate the class-imbalance problem. Then, we propose to rank the distribution of foreground candidates above that of background ones in the constrained worst-case scenario. This strategy not only handles the imbalance in background candidates but also improves the efficiency for the ranking algorithm.

First, we re-weight examples to derive the constrained distributions for foreground and background from the original distributions, respectively. Then, we learn to rank the expectation of the derived distribution of foreground above that of background by a large margin.

Proof