SwapAnyHead

Controllable and Expressive One-Shot Video Head Swapping

Chaonan Ji Jinwei Qi Peng Zhang Bang Zhang Liefeng Bo

Tongyi Lab, Alibaba Group • ICCV 2025

Overview

Video Head Swapping Overview

Revolutionary Head Swapping Technology

Illustration of Video Head Swapping. Given a reference image and video sequence as input, our model can generate high-fidelity head swapping results that accommodate diverse hairstyles, expressions, and identities.

One-Shot Swapping

Seamless head transplantation from single reference image

Expression Control

Controllable facial expressions and movements

High Fidelity

Preserve identity and background seamlessly

Abstract

In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed.

Existing face-swapping methods mainly focus on localized facial replacement neglecting holistic head morphology, while head-swapping approaches struggling with hairstyle diversity and complex backgrounds, and none of these methods allow users to modify the transplanted head expressions after swapping.

To tackle these challenges, our method incorporates several innovative strategies through a unified latent diffusion paradigm. Experimental results demonstrate that our method excels in seamless background integration while preserving the identity of the source portrait, as well as showcasing superior expression transfer capabilities applicable to both real and virtual characters.

Method

Method Pipeline

Diffusion-Based Multi-Condition Framework

The framework of our method. During the training stage, we begin by preprocessing the input video to obtain the inpainted background and detect 3D landmarks utilizing Mediapipe. Both the inpainted background and the 3D landmarks serve as conditional inputs, and a frame is randomly selected as the reference image.

Identity-Preserving Context Fusion

Shape-agnostic mask strategy with hair enhancement for robust identity preservation across diverse hair types and complex backgrounds.

3DMM-Driven Retargeting

Disentangled 3D landmarks that decouple identity, expression, and head poses for precise expression control.

Scale-Aware Retargeting

Advanced scaling strategy to minimize cross-identity expression distortion for higher transfer precision.

Latent Diffusion Paradigm

Unified diffusion model with additional ID losses in pixel level for enhanced identity consistency.

Results

Video Head Swapping

Male Subject

High-fidelity head swapping with natural expression transfer

Long Hair Model

Complex hairstyle preservation with seamless integration

Comic Character

Stylized character head swapping with artistic preservation

Alternative Comic Style

Diverse artistic style adaptation and expression transfer

Female Subject

Gender-specific features preservation with natural movement

Expressive Transfer

Dynamic expression mapping with emotional fidelity

Comic Book Hero

Superhero character head swapping with action preservation

Joker Character

Complex facial expression transfer with character integrity

Film Clips

Cinematic Application

Film-style head swapping with professional quality output

Expression Control

Dynamic Expression Editing

Real-time expression control and modification capabilities