Deep Video Portraits

Photo-realistic re-animation of portrait videos using only an input video

Erez Posner

Follow

Published in

Heartbeat

9 min readAug 4, 2020

--

Synthesizing and editing video portraits—i.e., videos framed to show a person’s head and upper body—is an important problem in computer graphics, with applications in video editing and movie postproduction, visual effects, visual dubbing, virtual reality, and telepresence, among others.

The problem of synthesizing a photo-realistic video portrait of a target actor that mimics the actions of a source actor—and especially where the source and target actors can be different subjects—is still an open problem.

There hasn’t been an approach that enables one to take full control of the rigid head pose, face expressions, and eye motion of the target actor; even face identity can be modified to some extent. Until now.

In this post, I’m going to review “Deep Video Portraits”, which presents a novel approach that enables photo-realistic re-animation of portrait videos using only an input video.

In this post, I’ll cover two things: First, a short definition of a DeepFake. Second, an overview of the paper “Deep Video Portraits” in the words of the authors.

1. Defining DeepFakes

The word DeepFake combines the terms “deep learning” and “fake”, and refers to manipulated videos or other digital representations that produce fabricated images and sounds that appear to be real but have in fact been generated by deep neural networks.

2. Deep Video Portraits

2.1 Overview

The core method presented in the paper provides full control over the head of a target actor by transferring the rigid head pose, facial expressions, and eye motion of a source actor, while preserving the target’s identity and appearance.

On top of that, full video of the target is synthesized, including consistent upper body posture, hair, and background.

Figure 1. Facial reenacement results from “DVP”. Expressions from the source are transferred from source to target actor, while retaining the head pose (rotation and translation) as well as the eye gaze of the target actor

The overall architecture of the paper’s framework is illustrated below in Figure 2.

First, the source and target actors are being tracked using a state-of-the-art face reconstruction approach from a single image, and a 3D morphable model (3DMM) is derived to best fit the source and target actors.

The resulting sequence of low-dimensional parameter vectors represents the actor’s identity, head pose, expression, eye gaze, and the scene lighting for every video frame.

Then, the head pose, expressions and/or eye gaze parameters from the source are taken and mixed with the illumination and identity parameters of the target. This allows the network to generate a full-head reenactment while preserving the actor’s identity and look.

Next, new synthetic renderings of the target actor are generated based on the mixed parameters. These renderings are the input to the paper’s novel “rendering-to-video translation network”, which is trained to convert the synthetic input into photo-realistic output.

Figure. 2. Deep video portraits enable a source actor to fully control a target video portrait. First, a low-dimensional parametric representation (let) of both videos is obtained using monocular face reconstruction. The head pose, expression and eye gaze can now be transferred in parameter space (middle). Finally, render conditioning input images that are converted to a photo-realistic video portrait of the target actor (right). Obama video courtesy of the White House (public domain)

2.2 Face Reconstruction from a single image

3D morphable models are used for face analysis because the intrinsic properties of 3D faces provide a representation that’s immune to intra-personal variations, such as pose and illumination. Given a single facial input image, a 3DMM can recover 3D face (shape and texture) and scene properties (pose and illumination) via a fitting process.

The authors employ a state-of-the-art dense face reconstruction approach that fits a parametric model of the face and illumination to each video frame. It obtains a meaningful parametric face representation for both the source and the target, given an input video sequence.

Equation 1. source actor video sequence where N_s denotes the total number of source frames.

The meaningful parametric face representation consists of a set of parameters P. , which could be denoted as the corresponding parameter sequence that fully describes the source or target facial performance.

Equation 2. A meaningful parametric face representation best describes each frame in the input video sequence.

The set of reconstructed parameters P encode the rigid head pose, facial identity coefficients, expressions coefficients, gaze direction for both eyes, and spherical harmonics illumination coefficients. Overall, the face reconstruction process estimates 261 parameters per video frame.

Below are more details on the parametric face representation and the fitting process.

2.2.1 Parametric Face Representation

The paper represents the space of facial identity based on a parametric head model, and the space of facial expressions via an affine model. Mathematically, they model geometry variation through an affine model v∈ R^(3N) that stacks per-vertex deformations of the underlying template mesh with N vertices, as follows:

Equation 3. per-vertex deformations of the underlying template mesh with N vertices

Where a_{geo} ∈ R^(3N) stores the average facial geometry. The geometry bases b_k for the geometry has been computed by applying principal component analysis (PCA) to 200 high-quality face scans, and b_k for the expressions has been obtained in the same manner on blendshapes.

2.2.2 Image Formation Model

To render synthetic head images, a full perspective camera is assumed that maps model-space 3D points v via camera space to 2D points on the image plane. The perspective mapping Π contains the multiplication with the camera intrinsics and the perspective division.

In addition, based on a distant illumination assumption, spherical harmonics basis functions are used to approximate the incoming radiance B from the environment.

Equation 4. A spherical harmonics basis functions are used to approximate the incoming radiance B from the environment

Where B is the number of spherical harmonics bands, ɣ_b the spherical harmonics coefficients, and r_i and n_i the reflectance and unit normal vector of the i-th vertex, respectively.

2.3 Synthetic Conditioning Input

Using the face reconstruction approach described above, a face is reconstructed in each frame of the source and target video. Next, the rigid head pose, expression, and eye gaze of the target actor is modified. All parameters are copied in a relative manner from the source to the target.

Then the authors render synthetic conditioning images of the target actor’s face model under the modified parameters using hardware rasterization.

For each frame, three different conditioning inputs are generated: a color rendering, a correspondence image, and an eye gaze image.

Figure 3. The synthetic input used for conditioning the rendering-to-video translation network: (1) colored face rendering under target illumination, (2) correspondence image, and (3) the eye gaze image

The color rendering shows the modified target actor model under the estimated target illumination, while keeping the target identity (geometry and skin reflectance) fixed. This image provides a good starting point for the following rendering-to-video translation, since in the face region only the delta to a real image has to be learned.

A correspondence image encoding the index of the parametric model’s vertex that projects into each pixel is also rendered to keep the 3D information.

Finally, a gaze map is provided to provide information about the eye gaze direction and blinking.

All of the images are stacked to obtain the input to the rendering-to-video translation network.

2.4 Rendering-To-Video Translation

The generated conditioning space-time stacked images are the input to the rendering-to-video translation network.

The network learns to convert the synthetic input into full frames of a photo-realistic target video, in which the target actor now mimics the head motion, facial expression, and eye gaze of the synthetic input.

The network learns to synthesize the entire actor in the foreground, i.e., the face for which conditioning input exists, but also all other parts of the actor, such as hair and body, so that they comply with the target head pose.

It also synthesizes the appropriately modified and filled-in background, even including some consistent lighting effects between the foreground and background.

The network shown in Figure 4 follows an encoder-decoder architecture and is trained in an adversarial manner.

Figure 4. Architecture of the rendering-to-video translation network follows an encoder-decoder architecture

The training objective function comprises a conditioned adversarial loss and L1 photometric loss.

Equation 5. Rendering-To-Video Translation objective function

During adversarial training, the discriminator D tries to get better at classifying given images as real or synthetic, while the transformation network T tries to improve in fooling the discriminator. The L1 loss penalizes the distance between the synthesized image T(x) and the ground truth image Y, which encourages the sharpness of the synthesized output:

Equation 6. L1 loss

3. Experiments & Results

This approach enables us to take full control of the rigid head pose, facial expression, and eye motion of a target actor in a video portrait, thus opening up a wide range of video rewrite applications.

3.1 Reenactment under full head control

This approach is the first that can photo-realistically transfer the full 3D head pose (spatial position and rotation), facial expression, as well as eye gaze and eye blinking of a captured source actor to a target actor video.

Figure 5 shows some examples of full-head reenactment between different source and target actors. Here, the authors use the full target video for training and the source video as the driving sequence.

As can be seen, the output of their approach achieves a high level of realism and faithfully mimics the driving sequence, while still retaining the mannerisms of the original target actor.

Figure 5. qualitative results of full-head reenactment

3.2 Facial Reenactment and Video Dubbing

Besides full-head reenactment, the approach also enables facial reenactment. In this experiment, the authors replaced the expression coefficients of the target actor with those of the source actor before synthesizing the conditioning input to the rendering-to-video translation network.

Here, the head pose and position and eye gaze remain unchanged. Figure 6 shows facial reenactment results.

Video dubbing could also be applied by modifying the facial motion of actors speaking originally in another language to an ensign translation, spoken by a professional dubbing actor in a dubbing studio.

More precisely, the captured facial expressions of the dubbing actor could be transferred to the target actor, while leaving the original target gaze and eye blinks intact.

4. Discussion

In this post, I presented Deep Video Portraits, a novel approach that enables photo-realistic re-animation of portrait videos using only an input video.

In contrast to existing approaches that are restricted to manipulations of facial expressions only, the authors are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor.

The authors have shown, through experiments and a user study, that their method outperforms prior work, both in terms of model performance and expanded capabilities. This opens doors to many applications, like video reenactment for virtual reality and telepresence, interactive video editing, and visual dubbing.

5. Conclusions

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.

Till then, see you in the next post! 😄

For the enthusiastic reader:
For more details on “Deep Video Portraits” check out the formal project page or check out their video demo.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.