Submitted by alik31239 t3_117blae in MachineLearning

In the abstract of the Nerf paper (https://arxiv.org/abs/2003.08934), the described framework is that Nerf enable to do the following: the user inputs a set of images with known camera poses, and after training the network they can generate images of the same scene from new angles.

However, the paper itself builds a network that gets as an input 5D vectors (3 location coordinates+2 camera angles) and outputs color and volume density for each such coordinate. I don't understand where do I get those 5D coordinates from? My training data surely doesn't have those - I only have a collection of images. Same for inference data. It seems that the paper assumes not only having a collection of images but also having a 3D representation of the scene, while the abstract doesn't require the latter. What am I missing here?

24

Comments

You must log in or register to comment.

marixer t1_j9b0x65 wrote

The step you're missing there is finding the cameras positions and angles with something like COLMAP, predicting them by extracting features from the images, pairing and triangulating. That data is then used alongside the RGB images to train the nerf

36

buyIdris666 t1_j9ctyh8 wrote

Yup. Nerf just replaced the construction step after you "register" all the camera positions using traditional algorithms. Usually via COLMAP.

Not saying that's a bad thing, existing algorithms are already good at estimating camera positions and parameters. It was the 3d reconstruction step that was previously lacking.

For anyone wanting to try this, I suggest using Nerf-W . The original Nerf required extremely accurate camera parameter estimates that you're not going to get with a cell camera and COLMAP. Nerf-w is capable of doing some fine adjustments as it runs. It even works decent reconstructing scenes using random internet photos.

The workflow is COLMAP to register the camera positions used to take the pictures and estimate camera parameters, then export those into the Nerf model. Most of the Nerf repos are already setup to make this easy.

This paper is a good overview of how to build a Nerf from random unaligned images. They did it using frames from a sitcom, but you could take a similar approach to Nerf almost anything https://arxiv.org/abs/2207.14279

12

MediumOrder5478 t1_j9b7ln5 wrote

You need to use a program like colmap for sparse scene reconstruction to recover the camera intrisics (focal length, lens distortions) and extrinsics (camera positions and orientations)

1

Pyramid_Jumper t1_j9ayed5 wrote

Been a while since I’ve read the paper but I don’t think you’re missing anything - apart from data in the correct format that is. You’ll need the aforementioned 5D vectors to be able to train/use this model.

If you can’t get that data then I’d suggest you look at other work that cite NeRF that maybe have data in a similar format to the data you do have

−2

harharveryfunny t1_j9aydo9 wrote

Here's the key, thanks to CHatGPT:

Data preparation: First, the training data is preprocessed to convert the 2D images and camera poses into a set of 3D points and corresponding colors. Each 2D image is projected onto a 3D point cloud using the corresponding camera pose, resulting in a set of 3D points with associated colors.

−8

harharveryfunny t1_j9b30et wrote

Not sure why this got downvoted given that it's correct. ChatGPT is also well capable of explaining how this mapping is learnt (using a view-consistency loss mapping from the 3D voxels back to a 2D view and comparing to image).

−3

tdgros t1_j9b43pe wrote

it's downvoted because it doesn't add anything to the conversation, OP has already stated that they know what info is input, they just don't know where to get it from. Someone already answered correctly at the top.

8

harharveryfunny t1_j9bf30y wrote

OP's question seems to be how to get from 2D images to the 3D voxels, no? But anyways if they've got their answer that's good.

Edit: I guess they were talking about camera position for the photos, not mapping to 3D.

−4

tdgros t1_j9bfds3 wrote

Just read the post!

>However, the paper itself builds a network that gets as an input 5D vectors (3 location coordinates+2 camera angles) and outputs color and volume density for each such coordinate. I don't understand where do I get those 5D coordinates from? My training data surely doesn't have those - I only have a collection of images.

7