Vocal Source: Celine Dion - My Heart Will Go On. Covered by Emma Heesters
Vocal Source: Unholy - Sam Smith, Kim Petras
Vocal Source: INTRO CINEMATIC - HELLDIVERS™ 2
Character: Dr. Emmett Brown in "Back to the Future"
We propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. Unlike existing methods that focus on generating full-body or half-body poses, we investigate the challenges of audio-driven gesture generation and identify the weak correspondence between audio features and full-body gestures as a key limitation.
To address this, we redefine the task as a two-stage process. In the first stage, we generate hand poses directly from audio input, leveraging the stronger correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements.
The motivation behind our method. Human motion, similar to that of robots, involves planning the "end-effector" (EE), typically the hands, towards the target situation. The rest of the body then cooperates accordingly with the EE, using inverse kinematics principles.
Generate hand poses directly from audio input, leveraging stronger correlation between audio signals and hand movements.
Employ diffusion model to synthesize video frames with realistic facial expressions and body movements.
By inputting a single character image and vocal audio, such as singing, our method can generate vocal avatar videos featuring not only expressive facial expressions but also a variety of body poses.
Vocal Source: Yonezu Kenshi 「LOSER」┃Cover by Raon Lee
Vocal Source: Charlie Puth - Attention (Emma Heesters Cover)
Our method supports voice in multiple languages and brings images to life by intuitively recognizing tonal variations in audio, enabling the creation of dynamic, richly performing avatars.
Vocal Source: Musk's Speech
Vocal Source: Trevor's Talkshow
Our method can generate complex and smooth hand movements, bringing the avatar to life with a vivid performance.
Complex hand dance performance
Fluid hand gestures performance
Comparison with Vlogger method
Comparison with CyberHost method