Zero-Shot Novel View and Depth Synthesis
with Multi-View Geometric Diffusion

Toyota Research Institute

Abstract

Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry.

In this paper we introduce Multi-view Geometric Diffusion (MVGD), a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities.

We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.

Novel view and depth synthesis results using incremental conditioning.

Red cameras indicate initial conditioning views, used to generate predictions for green cameras. After each generation, the predicted image
is added to the set of conditioning views for future generations.


Accumulated Pointcloud result

We obtained these accumulated pointclouds by generating novel images and depth maps from various viewpoints (black cameras),
using the same conditioning views (colored cameras), and stacking them together without any post-processing.
Pointcloud 1

Stacked Pointcloud Qualitative Result

To highlight the multi-view consistency of our method, predicted colored pointclouds from all novel viewpoints are stacked together for visualization
without any post-processing. Red cameras are used as conditioning views, and novel images and depth maps are generated from green cameras.
Pointcloud2

BibTeX

If you find our paper useful, please consider citing:

@misc{guizilini2025zeroshotnovelviewdepth,
        title={Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion}, 
        author={Vitor Guizilini and Muhammad Zubair Irshad and Dian Chen and Greg Shakhnarovich and Rares Ambrus},
        year={2025},
        eprint={2501.18804},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2501.18804}, 
  }