Predicting realistic ground views from satellite imagery in urban scenes is a challenging task due to the significant view gaps between satellite and ground-view images. We propose a novel pipeline to tackle this challenge, by generating geospecifc views that maximally respect the weak geometry and texture from multi-view satellite images. Different from existing approaches that hallucinate images from cues such as partial semantics or geometry from overhead satellite images, our method directly predicts ground-view images at geolocation by using a comprehensive set of information from the satellite image, resulting in ground-level images with a resolution boost at a factor of ten or more. We leverage a novel building refinement method to reduce geometric distortions in satellite data at ground level, which ensures the creation of accurate conditions for view synthesis using diffusion networks. Moreover, we proposed a novel geospecific prior, which prompts distribution learning of diffusion models to respect image samples that are closer to the geolocation of the predicted images. We demonstrate our pipeline is the first to generate close-to-real and geospecific ground views merely based on satellite images.
Overview of our pipeline. Top-down View Stage and Projection Stage: the satellite textures are projected to the refined 3D geometry and then projected back to ground-view 2D space. Ground-view Stage: the ground view satellite texture and corresponding high-frequency layout information serve as the conditions. Texture-guided Generation Stage: we use the recent successful diffusion model conditioning on ground-view satellite textures, high-frequency information with the geospecific prior.
The inherent randomness of the diffusion model makes the synthesis results (marked as orange rectangles) not consistent with their neighbor views.
There's a lot of excellent satellite-to-ground synthesis work that was introduced as ours.
Some geometry-based methods such as Sat2scene/ Sat2Vid bridge the top-down views and ground views based on the predicted geometry and then perform ground-view synthesis by embedding the texture into point clouds.
Some works such as CrossMLP, PanoGAN directly learn the relationship between top-down views and ground-views with auxlirary information such as semantics.
@misc{xu2024geospecificviewgeneration,
title={Geospecific View Generation -- Geometry-Context Aware High-resolution Ground View Inference from Satellite Views},
author={Ningli Xu and Rongjun Qin},
year={2024},
eprint={2407.08061},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.08061},
}