Abstract
Abstract
One of the core challenges in 3D Vision is the estimation of 3D scene geometry. Traditionally, this task was predominantly tackled with popular, well-established, and time-tested methods such as Structure-from-Motion, where a pipeline of smaller algorithms each handles a specific subtask. However, treating each task separately makes the overall pipeline susceptible to errors and noise, which then propagate to subsequent modules. Although recent work has improved the accuracy of such pipelines, the aforementioned problem persists. A recent method, called DUSt3R \cite{wang_dust3r_2023}, addresses this issue with a holistic approach. The method takes a pair of images as input and extracts information-rich structures called pointmaps. These pointmaps can be used in various downstream tasks, including camera parameter estimation, point matching, 3D reconstruction, and depth estimation. The success of this model makes it a promising tool in 3D Computer Vision.
To extend this functionality to an arbitrary number of images, \citet{wang_dust3r_2023} introduced a Global Alignment method that processes the images in pairs and applies an optimization algorithm to place the pointmaps in a common coordinate frame. However, the proposed alignment method suffers from a quadratic computational complexity. In this thesis, we propose a novel, scalable Global Alignment method that reduces the original O(N^2) complexity to a theoretical upper bound of O(kB^2), where B