Info
- Title: Semantic Photo Manipulation with a Generative Image Prior
- Task: Image Manipulation
- Author: DAVID BAU, HENDRIK STROBELT, JONAS WULFF, BOLEI ZHOU, JUN-YAN ZHU, ANTONIO TORRALBA
- Date: July. 2019
- Published: ACM SIGGRAPH 2019
- Affiliation: MIT CSAIL
Highlights & Drawbacks
- Image-specific generator for preserving semantic representation after editing
- Interactive tool for semantic editing
- An optimization step needed after editing, which takes about 30 seconds on a modern GPU
Motivation & Design
The roll of deep generative models will be to provide latent semantic representations in which concepts can be directly manipulated and then to preserve image realism when semantic changes are made.
Overall process:
- We first compute a latent vector $z = E(x)$ representing $x$.
- We then apply a semantic vector space operation $z_e = edit(z)$ in the latent space; this could add, remove, or alter a semantic concept in the image.
- Finally, we regenerate the image from the modified $z_e$ .
Unfortunately, as can be seen in (b), usually the input image $x$ cannot be precisely generated by the generator $G$ , so (c) using the generator $G$ to create the edited image $G (x_e )$ will result in the loss of many attributes and details of the original image (a). Therefore to generate the image we propose a new last step: (d) We learn an image-specific generator $G′$ which can produce $x′_e = G′(z_e )$ that is faithful to the original image $x$ in the unedited regions.
Controllable Image Synthesis with GANs
Seek for a latent code $z$ that minimizes the reconstruction loss between the input image $x$ and generated image $G(z)$:
To ensure that the image- specific generator $G′$ has a similar latent space structure as the original generator $G$, we construct $G′$ by preserving all the early layers of $G$ precisely and applying perturbations only at the layers of the network that determine the fine-grained details.
A small network $R$ was trained to produce small perturbations $δ_i$ that multiply each layer’s output in $G_F$ by $1 + δ_i$ . Each $δ_i$ has the same number of channels and dimensions as the feature map of $G_F$ at layer $i$ . This multiplicative change adjusts each feature map activation to be faithful to the output image. (Similar results can be obtained by using additive $δ_i$ .) Formally, we construct $G’_{F}$ as follows:
Performance & Ablation Study
Examples of editing work-flow. From left to right: input image $x$ is first converted to GAN image $G(z)$, edited by painting a mask, the effect of this mask edit can be previewed at interactive rates as $G(z_e )$. It can be finally rendered using image-specific adaption as $G′(z_e)$.
Changing the appearance of domes, grass, and trees. In each section, we show the original image $x$, the user’s edit overlayed on $x$ and three variations under different selections of the reference image. Additionally, we show reconstructions of the reference image from $G$. In (c), we fix the reference image and only vary the strength term $s$.
Code
Related
- Deep Generative Models(Part 1): Taxonomy and VAEs
- Deep Generative Models(Part 2): Flow-based Models(include PixelCNN)
- Deep Generative Models(Part 3): GANs
- PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modification - Salimans - ICLR 2017
- Gated PixelCNN: Conditional Image Generation with PixelCNN Decoders - van den Oord - NIPS 2016
- PixelRNN & PixelCNN: Pixel Recurrent Neural Networks - van den Oord - ICML 2016
- VQ-VAE: Neural Discrete Representation Learning - van den Oord - NIPS 2017
- VQ-VAE-2: Generating Diverse High-Fidelity Images with VQ-VAE-2 - Razavi - 2019
- Image to Image Translation(1): pix2pix, S+U, CycleGAN, UNIT, BicycleGAN, and StarGAN
- Image to Image Translation(2): pix2pixHD, MUNIT, DRIT, vid2vid, SPADE, INIT, and FUNIT