Understanding FaceShifter: a new face

Understanding FaceShifter: a new face-swapping model

A simple yet complete explanation

Nowadays, deep learning can conceive marvelous results in the field of image synthesis and manipulation. We have seen websites that hallucinate imaginary people, videos demonstrating famous people saying things they never said, and tools that make people dance all with enough realism to fool most of us. One of the novel such feats is FaceShifter [1], a deep learning model that swaps faces in images that outperformed state-of-the-art. In this article, we are going to understand how it works.

Problem Statement

We have a source face image Xₛ and a target face image Xₜ, and we want to produce a new face image Yₛₜ that has the attributes of Xₜ (pose, lighting, eyeglasses, … etc), but has the identity of the person in Xₛ. This problem statement is summarized in figure 1. Now, we move on to explaining the model.

Figure 1. The problem statement of face swapping. The shown result is from the FaceShifter model. Adapted from [1].

The FaceShifter Model

FaceShifter is comprised of two networks called AEI-Net & HEAR-Net. AEI-Net produces a preliminary face-swapping result, and HEAR-Net refines this output. Let’s break down this pipeline.

AEI-Net

AEI-Net is an acronym for “Adaptive Embedding Integration Network”. This is because AEI-Net consists of 3 sub-networks:

  1. Identity encoder: an encoder concerned with embedding Xₛ into a space that describes the identity of the face in the image.
  2. Multi-level attributes encoder: An encoder concerned with embedding Xₜ into a space that describes the attributes that we want to preserve when we swap faces.
  3. AAD Generator: A generator that integrates the output of the two previous sub-networks to produce the face of Xₜ swapped with the identity of Xₛ.

AEI-Net is shown in figure 2. Let’s flesh out its details.

Figure 2. The architecture of AEI-Net. Adapted from [1].

Identity Encoder

This sub-network projects the source image Xₛ to a lower-dimensional feature space. The output is just a single vector, which we will call zᵢ, as seen in figure 3. This vector encodes the identity of the face in Xₛ which means that it should extract features we humans use to differentiate between the faces of different persons like the shape of their eyes, the distance between the eyes and the mouth, the curvature of the mouth and so on.

The authors use a pre-trained encoder. They used a network trained for face recognition. This is expected to meet our requirements since a network that differentiates faces has to extract features related to identity.

Figure 3. The identity network. Adapted from [1].

Multi-level Attributes Encoder

This sub-network encodes the target image Xₜ. It produces multiple vectors each one describing the attributes of Xₜ at a different spatial resolution, specifically 8 vectors, called zₐ. Attributes here mean configurations of the face in the target image like the pose of the face, its outline, facial expression, hairstyle, skin color, background, scene lighting, … etc. It is a ConvNet with a U-Net like structure, as can be seen in figure 4, where the output vectors are simply the feature maps of each level in the upscaling/decoding part. Note that this sub-network isn’t pretrained.

Figure 4. The Multi-level Attributes Encoder architecture. Adapted from [1].

Representing Xₜ as multiple embeddings is necessary because using a single one at a single spatial resolution will lead to a loss of information needed to produce the required output image with a swapped face (i.e. there are too many fine details we want to preserve from Xₜ that makes compressing the image infeasible). This is evident in the ablation study the authors made where they tried representing Xₜ using only the first 3 zₐ embeddings instead of 8 which resulted in more blurry outputs seen in figure 5.

Figure 5. The effect of using multiple embeddings to represent the target. Compressed is the output if we use the first 3 zₐ embeddings, and AEI-Net is when we use all 8 of them. Adapted from [1].

AAD Generator

AAD Generator is an acronym for “Adaptive Attentional Denormalization Generator”. It integrates the outputs of the previous two sub-networks in increasing spatial resolution order to produce the final output of AEI-Net. It does that by stacking a novel block called AAD Resblock demonstrated in figure 6.

Figure 6. The AAD Generator architecture in the left image, and the AAD ResBlock in the right image. Adapted from [1].

The new piece of this block is the AAD layer. Let’s break it down into 3 parts as in figure 7. From a high level, part 1 tells us how to edit the input feature map hᵢₙ to be more like Xₜ in terms of attributes. Concretely, it outputs two tensors that have the same size as that of the hᵢₙ; one that contains scaling values that will be multiplied by each cell in hᵢₙ, and one containing shifting values. The input to part 1 layers is one of the attribute vectors. Similarly, part 2 is going to tell us how to edit the feature map hᵢₙ to be more like Xₛ in terms of identity.

Figure 7. The architecture of the AAD layer. Adapted from [1].

Part 3 is tasked with choosing which part (2 or 3) we should listen to at each cell/pixel. As an example, at cells/pixels that are related to the mouth, this network will tell us to listen more to part 2 since the mouth is more related to the identity. This was shown empirically with an experiment depicted in figure 8.

Figure 8. An experiment showing what part 3 in the AAD layer learnt. The images on the right show the output of part 3 at different steps/spatial resolutions throughout the AAD Generator. Bright regions indicate cells where we should listen to the identity (i.e part 2), and black regions are for listening to part 1. Notice that at high spatial resolutions we are mainly listening to part 1. Adapted from [1].

Thus, the AAD Generator will be able to build the final image step by step where in each step it decides the best way to upscale the current feature map given the identity and attributes encoding.

Now, we have a network, the AEI-Net, that can embed Xₛ & Xₜ and integrate them in a way that fulfills our target. We will call the output of AEI-Net Yₛₜ*.

Training losses

Generally speaking, losses are the mathematical formulation of what we want the network to do. There are 4 losses for training AEI-Net:

  1. We want it to output a realistic human face, so we will have an adversarial loss just like any GAN.
  2. We want the generated face to have the identity of Xₛ. The only mathematical object we have that represents identity is zᵢ. So, this goal can be represented by the following loss:

3. We want the output to have the attributes of Xₜ. The loss for that is:

4. The authors added another loss based on the idea that the network should output Xₜ if Xₜ and Xₛ are actually the same image:

I believe this last loss is necessary to drive zₐ to actually encode attributes since it isn’t pre-trained like zᵢ. Without it, as a guess, AEI-Net could neglect Xₜ and make zₐ produce just zeros.

Our total loss is just a weighted summation of the previous losses.

HEAR-Net

The AEI-Net is a complete network than can do face swapping. However, it is not good enough at preserving occlusions. Specifically, whenever there is an item that occludes part of the face in the target image that should be present in the final output (like eyeglasses, a hat, hair, or a hand), AEI-Net removes it. Such items should still persist since it is not related to the identity that is going to be changed. Consequently, the authors implemented an additional network called “Heuristic Error Acknowledging Refinement network” that has a single job of recovering such occlusions.

They noticed that when they made the inputs of AEI-Net (i.e Xₛ & Xₜ) the same image, it still didn’t preserve occlusions as in figure 9.

Figure 9. The output of AEI-Net when we input the same image as Xₛ & Xₜ. Notice how the chains from the turban have been lost in the output. Adapted from [1].

Hence, instead of making the input to HEAR-Net Yₛₜ* and Xₜ, they made it Yₛₜ* & (Xₜ -Yₜₜ*) where Yₜₜ* is the output of AEI-Net when Xₛ & Xₜ are the same image. This will point HEAR-Net towards the pixels where the occlusions weren’t preserved. HEAR-Net can be seen in figure 10.

Figure 10. The architecture of HEAR-Net. Adapted from [1].

Training Losses

The losses of HEAR-Net are:

  1. A loss for preserving identity:

2. A loss for not changing Yₛₜ* a lot:

3. A loss built upon the fact that if Xₛ & Xₜ are the same image, then the output of HEAR-Net should be Xₜ:

The total loss is the summation of these losses.

Results

The results of FaceShifter are phenomenal. In figure 11, you can find some examples of its generalization performance on images outside the dataset it was engineered upon (i.e from the wild). Notice how it is able to work correctly in different and hard conditions.

Figure 11. Results demonstrating the outstanding performance of FaceShifter. Adapted from [1].
(0)

相关推荐