3D graphics have become a critical component in various fields, including video games, film production, architectural visualization, medical imaging, virtual reality, and more. To create a 3D model, the traditional method involved estimating depth maps for input images and fusing them to generate a complete 3D representation. However, this approach often resulted in missing geometry and artifacts in areas where depth maps didn’t match due to transparent or low-textured surfaces.
To overcome these limitations, a team of researchers from Apple and the University of California, Santa Barbara proposed a new method for directly inferring scene-level 3D geometry using deep neural networks. Unlike the traditional method, their approach eliminates the need for test-time optimization.
The researchers utilized a 3D convolution neural network to project images onto a voxel grid and predict the scene’s truncated signed distance function (TSDF). By using a Convolutional Neural Network (CNN), they were able to produce smooth and consistent surfaces that filled gaps in low-textured or transparent regions. They also employed tri-linear interpolation during training to align the ground-truth TSDF with the model’s voxel grid, although this introduced some random noise to the details.
To improve results, the researchers focused on supervised predictions at specific points with well-known ground-truth TSDF values, leading to a 10% enhancement. However, the existing voxels, which represent points in 3D space, were too large (4cm or larger) to capture finer geometric details visible in natural images. To address this, the researchers used a CNN grid feature to project image features directly to query points.
While back projection for feature sampling caused blurring in the back-projection volume, the researchers mitigated this issue by employing initial multi-view stereo depth estimation, which enhanced the feature volume. By adopting this method, the network was able to learn fine details without the need for additional training or 3D convolution levels.
In conclusion, the proposed end-to-end network by Apple researchers allows for the generation of detailed 3D reconstructions from posed images. This approach improves upon the traditional method by eliminating missing geometry and artifacts, and it enables the selection of output resolution without additional training or convolution levels.