Fast RCNN - Grishick - ICCV 2015 - Caffe Code



  • Title: Fast RCNN
  • Task: Object Detection
  • Author: Ross Girshick
  • Arxiv: 1504.08083
  • Date: April 2015
  • Published: ICCV 2015


  • An improvement to [R-CNN] (, ROI Pooling Design
  • Article structure is clear

Motivation & Design

R-CNN’s Drawbacks

  • Training is a multi-stage process (Proposal, Classification, Regression)
  • Training takes time and effort
  • Infer time-consuming

The reason of time-consuming is that CNN is performed separately on each Proposal, with no shared calculations.



The picture above shows the architecture of Fast R-CNN. The image is generated by the feature extractor, and the Selective Search algorithm is used to map the RoI (Region of Interest) to the feature map. Then, the RoI Pooling operation is performed for each RoI to obtain the feature vector of the same length. Classification and BBox Regression.

This structure of Fast R-CNN is the prototype of the meta-structure used in the main 2-stage method of the detection task. The entire system consists of several components: Proposal, Feature Extractor, Object Recognition & Localization. The Proposal part is replaced by RPN (Faster R-CNN), the Feature Extractor part uses SOTA’s classified CNN network (ResNet, etc.), and the last part is often a parallel multitasking structure (Mask R-CNN, etc.).

RoI Pooling

This operation is a process of unifying RoI (feature map) of different sizes. The specific method is to divide the RoI into a target number of meshes, and perform max pooling on each mesh to obtain the same length RoI feature vector. .

Mini-batch Sampling

The article points out that the reason why SPPNet training is slow is that RoI from different pictures cannot share calculations. Therefore, Fast R-CNN adopts such a mini-batch sampling strategy: first sample N pictures, then sample R/N RoIs on each picture. , constitutes the R-size mini-batch.

When sampling, always maintain a 25% proportional positive sample (iou is greater than 0.5), iou is 0.1 to 0.5 as a hard example.

Multi-task Loss

After obtaining the RoI feature vector, the subsequent operation is a parallel structure. Fast R-CNN unifies the loss of Classification and Regression, and replaces L2 Loss with Smooth L1 Loss in Regression.

Fine Tuning

The article also found that for pre-trained VGG networks, the parameter update of the open Conv part helps to improve performance, rather than just updating the FC layer. Unify proposal, classification, regression in a framework

Design Evaluation

The article concludes with a discussion of the system structure:

  • Multi-loss training does improve compared to training alone.
  • In scale invariance, multi-scale has a slight improvement over single-scale accuracy, but it brings more time overhead. To some extent, the CNN structure can learn scale invariance intrinsically.
  • After training on more data (VOC), mAP is further improved.
  • The Softmax classifier performs slightly better than the “one vs rest” type SVM, introducing inter-class competition.
  • More Proposal does not necessarily lead to performance improvements

Performance & Ablation Study