Coupled Diffusion Sampling

for Training-Free Multi-View Image Editing

Hadi Alzayer1,2 Yunzhi Zhang1 Chen Geng1 Jia-Bin Huang2 Jiajun Wu1
1Stanford University 2University of Maryland

TL;DR: We propose a general framework to combine diffusion models. One application is lifting varying 2D editing diffusion models to multi-view by combining the 2D model with a multi-view diffuison model.


Spatial Editing Results

Here we combine Magic Fixup (2D spatial editing model) and Stable Virtual Camera (multi-view diffusion model) to obtain multi-view spatial editing

Stylization Results

Here we combine Control-Net (2D stylization model) and Stable Virtual Camera (multi-view diffusion model) to obtain multi-view stylization

Relighting Results

Neural Gaffer Relighting

Here we combine Neural Gaffer (2D relighting model conditioned on Environment Maps) and Stable Virtual Camera (multi-view diffusion model) to obtain multi-view relighting

IC-Light Relighting

Here we combine IC-Light (2D text-conditioned relighting model) and Stable Virtual Camera (multi-view diffusion model) to obtain multi-view text-conditioned relighting

How does it work?

Let's illustrate coupled diffusion sampling with text-to-image generation. Normally, when we generate images for two different prompts (e.g. "Japanese Samurai" and "Astronaut on Mars") we would get completely independent samples (Left). When we introduce coupling in the sampling process (effectively nudging the intermediate denoised latents towards each other), we get samples that are spatially aligned, while each image remains faithful to its conditioning prompt (right). When we implement the coupling between a 2D editing model (on each image individually), and a multi-view diffusion model, then we obtain multi-view editing in a training-free manner!

Coupling Text-to-Image Generation

By introducing coupling in the text-to-image generation with flux, we can produce pairs of samples that are closely aligned, while each sample is faithful to its conditioning prompt.

Coupling Text-to-Video Generation

We can also perform coupled sampling with text-to-video generation with WAN2.1 video model. Here we show coupled videos samples that illustrate close alignment between the generated videos.

BibTeX

@article{alzayer2025coupleddiffusion,
    title={Coupled Diffusion Sampling for Training-Free Multi-View Image Editing},
    author={Alzayer, Hadi and Zhang, Yunzhi and Geng, Chen and Huang, Jia-Bin and Wu, Jiajun},
    journal={arXiv preprint arXiv:2510.14981},
    year={2025}
  }