In the abstract of the Nerf paper (https://arxiv.org/abs/2003.08934), the described framework is that Nerf enable to do the following: the user inputs a set of images with known camera poses, and after training the network they can generate images of the same scene from new angles.

However, the paper itself builds a network that gets as an input 5D vectors (3 location coordinates+2 camera angles) and outputs color and volume density for each such coordinate. I don't understand where do I get those 5D coordinates from? My training data surely doesn't have those - I only have a collection of images. Same for inference data. It seems that the paper assumes not only having a collection of images but also having a 3D representation of the scene, while the abstract doesn't require the latter. What am I missing here?

Comments

You must log in or register to comment.

marixer t1_j9b0x65 wrote on February 20, 2023 at 4:46 PM

The step you're missing there is finding the cameras positions and angles with something like COLMAP, predicting them by extracting features from the images, pairing and triangulating. That data is then used alongside the RGB images to train the nerf

buyIdris666 t1_j9ctyh8 wrote on February 20, 2023 at 11:57 PM

Yup. Nerf just replaced the construction step after you "register" all the camera positions using traditional algorithms. Usually via COLMAP.

Not saying that's a bad thing, existing algorithms are already good at estimating camera positions and parameters. It was the 3d reconstruction step that was previously lacking.

For anyone wanting to try this, I suggest using Nerf-W . The original Nerf required extremely accurate camera parameter estimates that you're not going to get with a cell camera and COLMAP. Nerf-w is capable of doing some fine adjustments as it runs. It even works decent reconstructing scenes using random internet photos.

The workflow is COLMAP to register the camera positions used to take the pictures and estimate camera parameters, then export those into the Nerf model. Most of the Nerf repos are already setup to make this easy.

This paper is a good overview of how to build a Nerf from random unaligned images. They did it using frames from a sitcom, but you could take a similar approach to Nerf almost anything https://arxiv.org/abs/2207.14279

deathisnear t1_j9b8yzh wrote on February 20, 2023 at 5:38 PM

The original NeRF requires the camera poses. As /u/marixer commented the typical approach is to approximate the camera poses using SfM approaches like COLMAP. However, there has been some work that try to tackle using NeRF without known camera poses.

https://nerfmm.active.vision/

https://arxiv.org/abs/2104.06405

CatalyzeX_code_bot t1_j9avn4h wrote on February 20, 2023 at 4:11 PM

Found relevant code at https://github.com/yenchenlin/nerf-pytorch + all code implementations here

To opt out from receiving code links, DM me

[deleted] t1_j9axwrk wrote on February 20, 2023 at 4:26 PM

[deleted]

MediumOrder5478 t1_j9b7ln5 wrote on February 20, 2023 at 5:30 PM

You need to use a program like colmap for sparse scene reconstruction to recover the camera intrisics (focal length, lens distortions) and extrinsics (camera positions and orientations)

Pyramid_Jumper t1_j9ayed5 wrote on February 20, 2023 at 4:29 PM

Been a while since I’ve read the paper but I don’t think you’re missing anything - apart from data in the correct format that is. You’ll need the aforementioned 5D vectors to be able to train/use this model.

If you can’t get that data then I’d suggest you look at other work that cite NeRF that maybe have data in a similar format to the data you do have

harharveryfunny t1_j9aydo9 wrote on February 20, 2023 at 4:29 PM

Here's the key, thanks to CHatGPT:

Data preparation: First, the training data is preprocessed to convert the 2D images and camera poses into a set of 3D points and corresponding colors. Each 2D image is projected onto a 3D point cloud using the corresponding camera pose, resulting in a set of 3D points with associated colors.

harharveryfunny t1_j9b30et wrote on February 20, 2023 at 5:00 PM

Not sure why this got downvoted given that it's correct. ChatGPT is also well capable of explaining how this mapping is learnt (using a view-consistency loss mapping from the 3D voxels back to a 2D view and comparing to image).

tdgros t1_j9b43pe wrote on February 20, 2023 at 5:07 PM

it's downvoted because it doesn't add anything to the conversation, OP has already stated that they know what info is input, they just don't know where to get it from. Someone already answered correctly at the top.

harharveryfunny t1_j9bf30y wrote on February 20, 2023 at 6:18 PM

OP's question seems to be how to get from 2D images to the 3D voxels, no? But anyways if they've got their answer that's good.

Edit: I guess they were talking about camera position for the photos, not mapping to 3D.

tdgros t1_j9bfds3 wrote on February 20, 2023 at 6:19 PM

Just read the post!

>However, the paper itself builds a network that gets as an input 5D vectors (3 location coordinates+2 camera angles) and outputs color and volume density for each such coordinate. I don't understand where do I get those 5D coordinates from? My training data surely doesn't have those - I only have a collection of images.