EMO2

End-Effector Guided Audio-Driven Avatar Video Generation

Linrui Tian Siqi Hu Qi Wang Bang Zhang Liefeng Bo

Institute for Intelligent Computing, Alibaba Group

Live Demo

AI Girl - My Heart Will Go On

Vocal Source: Celine Dion - My Heart Will Go On. Covered by Emma Heesters

AI Marilyn Monroe - Unholy

Vocal Source: Unholy - Sam Smith, Kim Petras

AI Storm Trooper - HellDivers 2

Vocal Source: INTRO CINEMATIC - HELLDIVERS™ 2

Dr. Emmett Brown - Rick and Morty

Character: Dr. Emmett Brown in "Back to the Future"

Abstract

We propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. Unlike existing methods that focus on generating full-body or half-body poses, we investigate the challenges of audio-driven gesture generation and identify the weak correspondence between audio features and full-body gestures as a key limitation.

To address this, we redefine the task as a two-stage process. In the first stage, we generate hand poses directly from audio input, leveraging the stronger correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements.

Method

Method Overview

End-Effector Guided Approach

The motivation behind our method. Human motion, similar to that of robots, involves planning the "end-effector" (EE), typically the hands, towards the target situation. The rest of the body then cooperates accordingly with the EE, using inverse kinematics principles.

Stage 1: Audio to Hand Poses

Generate hand poses directly from audio input, leveraging stronger correlation between audio signals and hand movements.

Stage 2: Diffusion Synthesis

Employ diffusion model to synthesize video frames with realistic facial expressions and body movements.

Generated Results

Singing Performances

By inputting a single character image and vocal audio, such as singing, our method can generate vocal avatar videos featuring not only expressive facial expressions but also a variety of body poses.

Karina - LOSER

Vocal Source: Yonezu Kenshi 「LOSER」┃Cover by Raon Lee

AI Girl - Attention

Vocal Source: Charlie Puth - Attention (Emma Heesters Cover)

Speaking in Multiple Languages

Our method supports voice in multiple languages and brings images to life by intuitively recognizing tonal variations in audio, enabling the creation of dynamic, richly performing avatars.

Elon Musk - Original Speech

Vocal Source: Musk's Speech

Elon Musk - Talk Show

Vocal Source: Trevor's Talkshow

Complex Hand Movements

Our method can generate complex and smooth hand movements, bringing the avatar to life with a vivid performance.

Karina - 明明

Complex hand dance performance

Jang Won Young - 想你

Fluid hand gestures performance

Method Comparison

vs Vlogger

Comparison with Vlogger method

vs CyberHost

Comparison with CyberHost method