Few-shot Video-to-Video Synthesis

Info

Title: Few-shot Video-to-Video Synthesis
Task: Video-to-Video Translation
Author: Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, Bryan Catanzaro
Date: Oct. 2019
Published: NIPS 2019

Abstract

Video-to-video synthesis (vid2vid) aims at converting an input semantic video, such as videos of human poses or segmentation masks, to an output photorealistic video. While the state-of-the-art of vid2vid has advanced signiﬁcantly, existing approaches share two major limitations. First, they are data-hungry. Numerous images of a target human subject or a scene are required for training. Second, a learned model has limited generalization capability. A pose-to-human vid2vid model can only synthesize poses of the single person in the training set. It does not generalize to other humans that are not in the training set. To address the limitations, we propose a few-shot vid2vid framework, which learns to synthesize videos of previously unseen subjects or scenes by leveraging few example images of the target at test time. Our model achieves this few-shot generalization capability via a novel network weight generation module utilizing an attention mechanism. We conduct extensive experimental validations with comparisons to strong baselines using several large-scale video datasets including human-dancing videos, talking-head videos, and street-scene videos. The experimental results verify the effectiveness of the proposed framework in addressing the two limitations of existing vid2vid approaches.

Motivation & Design

Comparison between the vid2vid (left) and the proposed few-shot vid2vid (right).

Existing vid2vid methods [7, 12, 57] do not consider generalization to unseen domains. A trained model can only be used to synthesize videos similar to those in the training set. For example, a vid2vid model can only be used to generate videos of the person in the training set. To synthesize a new person, one needs to collect a dataset of the new person and uses it to train a new vid2vid model. On the other hand, our few-shot vid2vid model does not have the limitations. Our model can synthesize videos of new persons by leveraging few example images provided at the test time.

Formulation

F take two more input arguments: one is a set of K example images {e1 , e2, …, eK } of the target domain, and the other is the set of their corresponding semantic images {se1 , se2 , …, seK }. That is

{\tilde{x}}_{t} = F ({\tilde{x}}_{t - τ}^{t - 1}, s_{t - τ}^{t}, {e_{1}, e_{2}, \dots, e_{K}}, {s_{e_{1}}, s_{e_{2}}, \dots, s_{e_{K}}})

This modeling allows F to leverage the example images given at the test time to extract some useful patterns to synthesize videos of the unseen domain. We propose a network weight generation module E for extracting the patterns. Specifically, E is designed to extract patterns from the provided example images and use them to compute network weights θH for the intermediate image synthesis network H:

θ_{H} = E ({\tilde{x}}_{t - τ}^{t - 1}, s_{t - τ}^{t}, {e_{1}, e_{2}, \dots, e_{K}}, {s_{e_{1}}, s_{e_{2}}, \dots, s_{e_{K}}})

Overall Architecture

(a) Architecture of the vid2vid framework. (b) Architecture of the proposed few-shot vid2vid framework. It consists of a network weight generation module $E$ that maps example images to part of the network weights for video synthesis. The module $E$ consists of three sub-networks: $E_{F}$ , $E_{P}$ , and $E_{A}$ (used when K > 1). The sub-network $E_{F}$ extracts features q from the example images. When there are multiple example images (K > 1), $E_{A}$ combines the extracted features by estimating soft attention maps α and weighted averaging different extracted features. The final representation is then fed into the network $E_{P}$ to generate the weights $θ_{H}$ for the image synthesis network $H$ .

Network Weight Generation

We decompose E into two sub-networks: an example feature extractor $E_{F}$ , and a multi-layer perceptron $E_{P}$ . The network $E_{F}$ consists of several convolutional layers and is applied on the example image e1 to extract an appearance representation q. The representation q is then fed into $E_{P}$ to generate the weights $θ_{H}$ in the intermediate image synthesis network H.

Attention-based Aggregation

$E_{A}$ is applied to each of the semantic images of the example images, sek . This results in a key vector $a_{k} \in R_{C \times N}$ , where C is the number of channels and N = H × W is the spatial dimension of the feature map. We also apply $E_{A}$ to the current input semantic image st to extract its key vector $a_{t} \in R_{C \times N}$ . We then compute the attention weight $α_{k} \in R N \times N$ by taking the matrix product $α_{k} = (a_{k})^{T} \otimes a_{t}$ . The attention weights are then used to compute a weighted average of the appearance representation $q = Σ_{k = 1}^{K} q_{k} \otimes α_{k}$ , which is then fed into the multi-layer perceptron EP to generate the network weights.