Mocycle-GAN: Unpaired Video-to-Video Translation - Yang Chen - ACM MM 2019

Info

Title: Mocycle-GAN: Unpaired Video-to-Video Translation
Task: Video-to-Video Translation
Author: Yang Chen, Yingwei Pan, Ting Yao, Xinmei Tian and Tao Mei
Date: Aug. 2019
Arxiv: 1908.09514
Published: ACM MM 2019
Affiliation: USTC!

Abstract

Unsupervised image-to-image translation is the task of translating an image from one domain to another in the absence of any paired training examples and tends to be more applicable to practical applications. Nevertheless, the extension of such synthesis from image-to-image to video-to-video is not trivial especially when capturing spatio-temporal structures in videos. The difficulty originates from the aspect that not only the visual appearance in each frame but also motion between consecutive frames should be realistic and consistent across transformation. This motivates us to explore both appearance structure and temporal continuity in video synthesis. In this paper, we present a new Motion-guided Cycle GAN, dubbed as Mocycle-GAN, that novelly integrates motion estimation into unpaired video translator. Technically, Mocycle-GAN capitalizes on three types of constrains: adversarial constraint discriminating between synthetic and real frame, cycle consistency encouraging an inverse translation on both frame and motion, and motion translation validating the transfer of motion between consecutive frames. Extensive experiments are conducted on video-to-labels and labels-to-video translation, and superior results are reported when comparing to state-of-the-art methods. More remarkably, we qualitatively demonstrate our Mocycle-GAN for both flower-to-flower and ambient condition transfer.

Motivation & Design

Comparison between two unpaired translation approaches and Mocycle-GAN

Mocycle-GAN: Unpaired Video-to-Video Translation - Yang Chen - ACM MM 2019

(a) Cycle-GAN exploits cycle- consistency constraint to model appearance structure for unpaired image-to-image translation. (b) Recycle-GAN utilizes temporal predictor (PX and PY) to explore cycle consistency across both domains and time for unpaired video-to-video translation. (c) Mocycle-GAN explicitly models motion across frames with optical flow (fxt and fys ), and pursuits cycle consistency on motion that enforces the re- construction of motion. Motion translation is further exploited to transfer the motion across domains via motion translator(MX and MY ), strengthening the temporal continuity in video synthesis. Dot- ted line denotes consistency constraint between its two endpoints.

Mocycle-GAN: Unpaired Video-to-Video Translation - Yang Chen - ACM MM 2019

The overview of Mocycle-GAN for unpaired video-to-video translation (X : source domain; Y : target domain). Note that here we only depict the forward cycle X → Y → X for simplicity. Mocycle-GAN consists of generators (GX and GY ) to synthesize frames across domains, discriminators (DX and DY ) to distinguish real frames from synthetic ones, and motion translator (MX ) for motion translation across domains. Given two real consecutive frames xt and xt+1, we firstly translate them into the synthetic frames x􏰂t and x􏰂t+1 via GX , which are further transformed into the reconstructed frames x r e c and x r e c through the inverse mapping G . In addition, two optical flow fx and tt+1 Y t fxrec are obtained by capitalizing on FlowNet to represent the motion before and after the forward cycle.

During training, we leverage three kinds of spatial/temporal constrains to explore appearance structure and temporal continuity for video translation:

Adversarial Constraint (LAdv ) ensures each synthetic frame realistic at appearance through adversarial learning;
Frame and Motion Cycle Consistency Constraint (LFC and LMC) encourage an inverse translation on both frames and motions;
Motion Translation Constraint(LMT) validates the transfer of motion across domains in video synthesis. Specifically, the motion translator MX converts the optical flow fxt in source to fxt in target, which will be utilized to further warp the synthetic frame x􏰂t to the subsequent frame W (fx , x􏰂t ). This constraint encourages the synthetic 􏰂subsequent frame x􏰂t+1 to be consistent with the warped version W(fx ,x􏰂t) in the traceable points, leading to pixel-wise temporal continuity.

Motion Cycle Consistency Constraint Mocycle-GAN: Unpaired Video-to-Video Translation - Yang Chen - ACM MM 2019

Motion Translation Constraint Mocycle-GAN: Unpaired Video-to-Video Translation - Yang Chen - ACM MM 2019

The Training Procedure Mocycle-GAN: Unpaired Video-to-Video Translation - Yang Chen - ACM MM 2019

Performance & Ablation Study

Mocycle-GAN: Unpaired Video-to-Video Translation - Yang Chen - ACM MM 2019

Examples of (a) video-to-labels and (b) labels-to-video results in Viper dataset under various ambient conditions. The original inputs, the output results by different models, and the ground truth outputs are given.

Ablation study for each design (i.e., Motion Cycle Con- sistency (MC) and Motion Translation (MT)) in Mocycle-GAN for video-to-labels on Viper.

Mocycle-GAN: Unpaired Video-to-Video Translation - Yang Chen - ACM MM 2019