TL;DR: We propose a general framework to combine diffusion models. One application is lifting varying 2D editing diffusion models to multi-view by combining the 2D model with a multi-view diffuison model.
Here we combine Magic Fixup (2D spatial editing model) and Stable Virtual Camera (multi-view diffusion model) to obtain multi-view spatial editing
Here we combine Control-Net (2D stylization model) and Stable Virtual Camera (multi-view diffusion model) to obtain multi-view stylization
Here we combine Neural Gaffer (2D relighting model conditioned on Environment Maps) and Stable Virtual Camera (multi-view diffusion model) to obtain multi-view relighting
Here we combine IC-Light (2D text-conditioned relighting model) and Stable Virtual Camera (multi-view diffusion model) to obtain multi-view text-conditioned relighting
Let's illustrate coupled diffusion sampling with text-to-image generation. Normally, when we generate images for two different prompts (e.g. "Japanese Samurai" and "Astronaut on Mars") we would get completely independent samples (Left). When we introduce coupling in the sampling process (effectively nudging the intermediate denoised latents towards each other), we get samples that are spatially aligned, while each image remains faithful to its conditioning prompt (right). When we implement the coupling between a 2D editing model (on each image individually), and a multi-view diffusion model, then we obtain multi-view editing in a training-free manner!
By introducing coupling in the text-to-image generation with flux, we can produce pairs of samples that are closely aligned, while each sample is faithful to its conditioning prompt.
We can also perform coupled sampling with text-to-video generation with WAN2.1 video model. Here we show coupled videos samples that illustrate close alignment between the generated videos.
@article{alzayer2025coupleddiffusion,
title={Coupled Diffusion Sampling for Training-Free Multi-View Image Editing},
author={Alzayer, Hadi and Zhang, Yunzhi and Geng, Chen and Huang, Jia-Bin and Wu, Jiajun},
journal={arXiv preprint arXiv:2510.14981},
year={2025}
}