Convolutional Neural Network Must Reads: Xception, ShuffleNet, ResNeXt and DenseNet

 

Xception: Deep Learning with Depthwise Seperable Convolutions - Chollet et al. - 2016

Info

  • Title: Xception: Deep Learning with Depthwise Seperable Convolutions
  • Author: F. Chollet
  • Arxiv: 1610.02357
  • Date: Oct. 2016

Highlights & Drawbacks

Replaced 1×1 convolution and 3×3 convolution in Inception unit with Depth-wise seperable convolution

Motivation & Design

The article points out that the assumption behind the Inception unit is that the correlation between the channel and the space can be fully decoupled, similarly the convolution structure in the length and height directions (the 3 × 3 convolution in Inception-v3 is 1 × 3 and 3 × 1 convolution replacement).

Further, Xception is based on a stronger assumption: the correlation between channels and cross-space is completely decoupled. This is also the concept modeled by Depthwise Separable Convolution. A simple Inception Module:

Xception: Deep Learning with Depthwise Seperable Convolutions

is equal to:

Xception: Deep Learning with Depthwise Seperable Convolutions

Push # of channel to extreme, we obtain Depthwise Separable Convolution:

Xception: Deep Learning with Depthwise Seperable Convolutions

NetScope Visualization and source code: awesome_cnn.

Xception: Deep Learning with Depthwise Seperable Convolutions

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design - Ma - ECCV 2018

Info

  • Title: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design
  • Task: Image Classification
  • Author: N. Ma, X. Zhang, H.-T. Zheng, and J. Sun
  • Date: Jul. 2018
  • Arxiv: 1807.11164
  • Published: ECCV 2018

Highlights & Drawbacks

  • Detailed analysis from hardware perspective
  • Design guidelines for efficient CNN architecture

Motivation & Design

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

There is discrepancy between the indirect (FLOPs) and direct (speed) metrics. FLOPs doesn’t take factors like memory access cost (MAC) and degree of parallelism into account, which is critical in actual calculation process.

The authors proposed four guidelines for efficient CNN architecture design, based on analysis from hardware and processing framework’s perspective:

  1. Equal channel width minimizes memory access cost (MAC) ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design (c1 and c2 closer, faster)

  2. Excessive group convolution increases MAC ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design (larger group get lower speed)

  3. Network fragmentation reduces degree of parallelism ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

  4. Element-wise operations are non-negligible ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

The final ShuffleNet v2 Unit ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

Performance & Ablation Study

Image classification results ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

Object detection results ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

Code

PyTorch

Channel Shuffle

def channel_shuffle(x, groups):
    batchsize, num_channels, height, width = x.data.size()

    channels_per_group = num_channels // groups
    
    # reshape
    x = x.view(batchsize, groups, 
        channels_per_group, height, width)

    x = torch.transpose(x, 1, 2).contiguous()

    # flatten
    x = x.view(batchsize, -1, height, width)

    return x

Inverted Residual Module

class InvertedResidual(nn.Module):
    def __init__(self, inp, oup, stride, benchmodel):
        super(InvertedResidual, self).__init__()
        self.benchmodel = benchmodel
        self.stride = stride
        assert stride in [1, 2]

        oup_inc = oup//2
        
        if self.benchmodel == 1:
            #assert inp == oup_inc
        	self.banch2 = nn.Sequential(
                # pw
                nn.Conv2d(oup_inc, oup_inc, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup_inc),
                nn.ReLU(inplace=True),
                # dw
                nn.Conv2d(oup_inc, oup_inc, 3, stride, 1, groups=oup_inc, bias=False),
                nn.BatchNorm2d(oup_inc),
                # pw-linear
                nn.Conv2d(oup_inc, oup_inc, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup_inc),
                nn.ReLU(inplace=True),
            )                
        else:                  
            self.banch1 = nn.Sequential(
                # dw
                nn.Conv2d(inp, inp, 3, stride, 1, groups=inp, bias=False),
                nn.BatchNorm2d(inp),
                # pw-linear
                nn.Conv2d(inp, oup_inc, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup_inc),
                nn.ReLU(inplace=True),
            )        
    
            self.banch2 = nn.Sequential(
                # pw
                nn.Conv2d(inp, oup_inc, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup_inc),
                nn.ReLU(inplace=True),
                # dw
                nn.Conv2d(oup_inc, oup_inc, 3, stride, 1, groups=oup_inc, bias=False),
                nn.BatchNorm2d(oup_inc),
                # pw-linear
                nn.Conv2d(oup_inc, oup_inc, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup_inc),
                nn.ReLU(inplace=True),
            )
          
    @staticmethod
    def _concat(x, out):
        # concatenate along channel axis
        return torch.cat((x, out), 1)        

    def forward(self, x):
        if 1==self.benchmodel:
            x1 = x[:, :(x.shape[1]//2), :, :]
            x2 = x[:, (x.shape[1]//2):, :, :]
            out = self._concat(x1, self.banch2(x2))
        elif 2==self.benchmodel:
            out = self._concat(self.banch1(x), self.banch2(x))

        return channel_shuffle(out, 2)

(ResNeXt)Aggregated Residual Transformations for Deep Neural Networks - Xie et al. - CVPR 2017

Info

  • Title: Aggregated Residual Transformations for Deep Neural Networks
  • Task: Image Classification
  • Author: S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. H
  • Arxiv: 1611.05431
  • Date: Nov. 2016
  • Published: CVPR 2017
  • 1st Runner Up in ILSVRC 2016

Highlights & Drawbacks

The core idea of ResNeXt is normalizing the multi-path structure of Inception module. Instead of using hand-designed 1x1, 3x3, and 5x5 convolutions, ResNeXt proposed a new hyper-parameter with reasonable meaning for network design.

The authors proposed a new dimension on designing neural network, which is called cardinality. Besides # of layers, # of channels, cardinality describes the count of paths inside one module. Compared to the Inception model, the paths share the exactly the same hyper-parameter. Additionally, short connection is added between layers.

Motivation & Design

The three classical pattern on designing a neural network:

  • Repeat: Starting with AlexNet and VGG, repeating the same structure is one of the most popular patterns of deep networks.
  • Multi-path: Presented by the Inception-Series. Splitting inputs, transforming with multiple-size convolutions, then concatenation.
  • Skip-connection: Applied to Image Recognition by ResNet. Simply rewriting the target function into identity mapping and residual function, allowing the interaction between shallow layers and deep layers.

The residual function is rewritten into: C denotes the number of transformations(paths) inside the layer, a.k.a. cardinality.

(ResNeXt)Aggregated Residual Transformations for Deep Neural Networks

As the number of paths increases, the number of channel for each path is reduced to maintain capacity of network.

NetScope Visualization and source code(Pytorch+Caffe):awesome_cnn.

Performance & Ablation Study

(ResNeXt)Aggregated Residual Transformations for Deep Neural Networks

(ResNeXt)Aggregated Residual Transformations for Deep Neural Networks

Code

Torch

(DenseNet)Densely Connected Convolutional Networks - Huang - CVPR 2017

Info

  • Title: Densely Connected Convolutional Network
  • Task: Image Classification
  • Author: Gao Huang, Zhuang Liu, Laurens van der Maaten and Kilian Weinberger
  • Arxiv: 1608.06993
  • Published: CVPR 2017(Best Paper Award)

Highlights

DenseNet takes the idea of ​​shortcut-connection to its fullest. Inside a DenseBlock, the output of each layer is created with the following layers. It is important to note that unlike the addition in ResNet, the DenseNet connection shortcut is Concat, so the deeper the layer, the more the input channel number. Big.

Motivation & Design

(DenseNet)Densely Connected Convolutional Networks

The entire network is divided into Dense Block and Transition Layer. The former is densely connected internally and maintains the same size feature map. The latter is the connection layer between DenseBlocks and performs the downsampling operation.

Within each DenseBlock, the accepted data dimension will become larger as the number of layers deepens (because the output of the previous layer is spliced ​​continuously), and the rate of growth is the initial channel number. The article calls the channel number as the growth rate. A hyper-parameter of the model. When the initial growth rate is 32, the number of channels in the last layer will increase to 1024 under the DenseNet121 architecture.

[Netscope Visualization] (http://ethereon.github.io/netscope/#/gist/56cb18697f42eb0374d933446f45b151) and source code: awesome_cnn.

Performance & Ablation Study

The authors have done experiments on both CIFAR and ImageNet. DenseNet has achieved comparable performance with ResNet. After adding Botleneck and a part of the compression technique, it can achieve the same effect as ResNet with fewer parameters:

(DenseNet)Densely Connected Convolutional Networks

Code

Caffe