Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

Hadi Alzayer1,2 Yunzhi Zhang1 Chen Geng1 Jia-Bin Huang2 Jiajun Wu1
1Stanford University 2University of Maryland
馃搫 Paper 馃捇 Code (Soon) 馃摎 BibTeX

Text Conditioned Relighting (Coupling Stable Virtual Camera and IC-Light)

We perform text-conditioned relighting by steering Stable-Virtual-Camera with IC-Light using coupled sampling. Here we show a relighting rendered object (helmet), as well as relighting real-world scene (Jeep example). We compare against prior works on composing diffusion models, and include the outputs of Stable Virtual Camera and IC-Light without composing them.

Lighting prompt: "sci-fi, RGB glowing, cyberpunk lighting"
Input
Liu et al. 2022
Du et al. 2023
Stable Virtual Camera
Per-Image IC-Light
Ours
Lighting prompt: "snowy soft lighting"
Input
Liu et al. 2022
Du et al. 2023
Stable Virtual Camera
Per-Image IC-Light
Ours
Lighting prompt: "white studio lighting"
Input
Liu et al. 2022
Du et al. 2023
Stable Virtual Camera
Per-Image IC-Light
Ours
Lighting prompt: "rocky canyon desert"
Input
Liu et al. 2022
Du et al. 2023
Stable Virtual Camera
Per-Image IC-Light
Ours
Lighting prompt: "snowy soft lighting"
Input
Liu et al. 2022
Du et al. 2023
Stable Virtual Camera
Per-Image IC-Light
Ours
Lighting prompt: "beach sunset lighting"
Input
Liu et al. 2022
Du et al. 2023
Stable Virtual Camera
Per-Image IC-Light
Ours
Lighting prompt: "sunset over the sea"
Input
Liu et al. 2022
Du et al. 2023
Stable Virtual Camera
Per-Image IC-Light
Ours
Select Example:

Stylization (Coupling Stable Virtual Camera and ControlNet)

We perform stylization by steering Stable-Virtual-Camera with 2D ControlNet using coupled sampling. We compare against prior works on composing diffusion models, as well as specialized stylization baselines. We also include the outputs of Stable Virtual Camera and ControlNet without composing them.

Editing prompt: "red ancient helmet"
Input
Liu et al. 2022
Du et al. 2023
Hunyuan3D
TEXTure
Trellis
Tailor3D
Stable Virtual Camera
Per-Image ControlNet
Ours
Editing prompt: "a rusty red jeep"
Input
Liu et al. 2022
Du et al. 2023
Hunyuan3D
TEXTure
Trellis
Tailor3D
Stable Virtual Camera
Per-Image ControlNet
Ours
Editing prompt: "a deep blue and teal starwars helmet"
Input
Liu et al. 2022
Du et al. 2023
Hunyuan3D
TEXTure
Trellis
Tailor3D
Stable Virtual Camera
Per-Image ControlNet
Ours
Editing prompt: "a green armored viking iron smith"
Input
Liu et al. 2022
Du et al. 2023
Hunyuan3D
TEXTure
Trellis
Tailor3D
Stable Virtual Camera
Per-Image ControlNet
Ours
Editing prompt: "a man wearing a black business suit"
Input
Liu et al. 2022
Du et al. 2023
Hunyuan3D
TEXTure
Trellis
Tailor3D
Stable Virtual Camera
Per-Image ControlNet
Ours
Editing prompt: "a marble and jade statue"
Input
Liu et al. 2022
Du et al. 2023
Hunyuan3D
TEXTure
Trellis
Tailor3D
Stable Virtual Camera
Per-Image ControlNet
Ours
Select Example:

Spatial Editing (Coupling Stable Virtual Camera and Magic Fixup)

We perform stylization by steering Stable-Virtual-Camera with Magic Fixup (which takes coarse edits for the spatial edit) using coupled sampling. We create various 3D spatial edits shown as "Coarse Edit" to prompt the edits. We compare against prior works on composing diffusion models, as well as SDEdit (which was suggested by MagicFixup to take coarse edit as input). We also include the outputs of Stable Virtual Camera and Magic Fixup without composing them.

Input
Coarse Edit
Liu et al. 2022
Du et al. 2023
SDEdit
Stable Virtual Camera
Per-Image Magic Fixup
Ours
Input
Coarse Edit
Liu et al. 2022
Du et al. 2023
SDEdit
Stable Virtual Camera
Per-Image Magic Fixup
Ours
Input
Coarse Edit
Liu et al. 2022
Du et al. 2023
SDEdit
Stable Virtual Camera
Per-Image Magic Fixup
Ours
Input
Coarse Edit
Liu et al. 2022
Du et al. 2023
SDEdit
Stable Virtual Camera
Per-Image Magic Fixup
Ours
Input
Coarse Edit
Liu et al. 2022
Du et al. 2023
SDEdit
Stable Virtual Camera
Per-Image Magic Fixup
Ours
Input
Coarse Edit
Liu et al. 2022
Du et al. 2023
SDEdit
Stable Virtual Camera
Per-Image Magic Fixup
Ours
Select Example:

Envmap Conditioned Relighting (Coupling Stable Virtual Camera and Neural-Gaffer)

We perform environment map conditioned relighting by steering Stable-Virtual-Camera with Neural-Gaffer using coupled sampling. We compare against prior works on composing diffusion models, as well as combining Neural Gaffer with GT NeRF (a previliged input) as a specialized 3D relighting baseline. We also include the outputs of Stable Virtual Camera and Neural-Gaffer without composing them.

Input
Liu et al. 2022
Du et al. 2023
Neural Gaffer 3D
Stable Virtual Camera
Per-Image Neural-Gaffer
Ours
Ground Truth
Input
Liu et al. 2022
Du et al. 2023
Neural Gaffer 3D
Stable Virtual Camera
Per-Image Neural-Gaffer
Ours
Ground Truth
Input
Liu et al. 2022
Du et al. 2023
Neural Gaffer 3D
Stable Virtual Camera
Per-Image Neural-Gaffer
Ours
Ground Truth
Input
Liu et al. 2022
Du et al. 2023
Neural Gaffer 3D
Stable Virtual Camera
Per-Image Neural-Gaffer
Ours
Ground Truth
Input
Ours
Liu et al. 2022
Du et al. 2023
Neural Gaffer 3D
Stable Virtual Camera
Per-Image Neural-Gaffer
Ground Truth
Select Example:

How does it work?

Coupled Diffusion Overview

Let's illustrate coupled diffusion sampling with text-to-image generation. Normally, when we generate images for two different prompts (e.g. "Japanese Samurai" and "Astronaut on Mars") we would get completely independent samples (Left). When we introduce coupling in the sampling process (effectively nudging the intermediate denoised latents towards each other), we get samples that are spatially aligned, while each image remains faithful to its conditioning prompt (right). When we implement the coupling between a 2D editing model (on each image individually), and a multi-view diffusion model, then we obtain multi-view editing in a training-free manner!

Application: Generating video editing data (Wan2.1)

We can create paired video editing dataset by coupling a text-to-video model with itself, using two different prompts. As we show below, we perform coupling between samples with distinct prompts, and we produce results that are highly aligned, but remain faithful to their input prompts. We produce these results using the video model Wan2.1 14B.

"car in the desert..."
"motorcycle in the desert..."
"A man walking with an umbrella through a neon-lit city..."
"A woman running through a neon-lit city..."
"A woman running along a tropical beach..."
"A woman hiking up a snowy mountain ridge..."
"A man walking down a busy city street at sunset..."
"A woman walking down a busy city street at sunset..."

Application: Extending Video Generation (Wan2.1)

One application of coupled sampling in videos is extending Text-to-Video generation. In particular, we can sample two videos with the same prompt, then perform coupling between the end and start of the two videos, respectively. Here we show an example of extended T2V generation with prompt "A man walking in the city."

Distance Function Comparison

Our formulation requires choosing a distance function to minimize in the coupled sampling. We choose standard euclidean distance (l2 distance) for our experiments, but here we show the effects of different distance functions. We find that l1 distance produces similar results to l2, as expected, since both metrics encourages minimizing the pixel-wise difference. On the other hand, using the cosine distance as a metric produces poor results.

Rocky

Input
L2 Distance
L1 Distance
Cosine Distance

Sea

Input
L2 Distance
L1 Distance
Cosine Distance

Snow

Input
L2 Distance
L1 Distance
Cosine Distance

Effects of Coupling Strength

The main hyperparameter in our method is the coupling strength 位. Here we show the effects of varying 位 on the multi-view sample. When 位 is 0.0, we get a typical image-to-MV output from stable virtual camera. As we increase 位, the 2D model steers the remaining views to match the input. We find that a range of 位 of approximately ~[0.01, 0.02] produces consistent and faithful values. As we increase 位, we see that the output exhibits inconsistencies and flickering. Naturally, the value of 位 depends on the backbone model used and the desired task.

Input
位 = 0.0
Coupling Strength (位)
0.0 0.005 0.01 0.015 0.025 0.035 0.05
Select Example:

Ablation: Guiding the Multi-View Model Only

Our main goal is to steer the multi-view samples using the 2D editing model. However, we show that it is essential to apply coupling on both the multi-view and 2D models to guide each other. Otherwise, the 2D model steers the output in highly inconsistent directions, resulting in a flickering output.

Input
Only Guiding Multi-view
Ours

Limitations and Failure Cases

2D Model Identity Preservation

In our multi-view editing, the source of identity preservation is the 2D model. This is because the multi-view generation is conditioned on a single edited image (since we cannot produce multiple consistent edited images), and the only way the multi-view model "observes" the input is through the coupling with the 2D model. As a result, when the 2D model struggles in perfectly preserving the identity (as shown here with IC-Light relighting), we inadvertently match the identity preservation of the 2D model. However, this limitation will naturally improve by improving the base models used.

Input
Input
Relit Output
Relit Output

Out-of-Distribution Edits

Since we rely on pre-trained models, steering them to edits that are out of distribution cannot guarantee the multi-view consistency. For example, when relighting to a very dark target lighting like shown below, we can encounter some flickering in the coupled output. However, we hope as these models are trained on large datasets, the models become increasingly resilient.

OOD Relit Video

BibTeX

@article{alzayer2025coupleddiffusion,
    title={Coupled Diffusion Sampling for Training-Free Multi-View Image Editing},
    author={Alzayer, Hadi and Zhang, Yunzhi and Geng, Chen and Huang, Jia-Bin and Wu, Jiajun},
    journal={arXiv preprint arXiv:2510.14981},
    year={2025}
}