buyIdris666

buyIdris666 t1_j9h8zqj wrote

I like to think of networks with residual connections as an ensemble of small models cascaded together. The residual connections were created to avoid vanishing/exploding gradient with deep networks.

It's important to realize that each residual connection exposes the layer to the entire input. I don't like the name "residual" because it implies a small amount of data is transferred. Nope, it's the entire input.

Latent information the model has learned along the way passes through the bottleneck. Which supposedly forces it to keep only the most important information. But the explanation above about receptive fields is also prudent.

Beyond vanishing gradient problem that affects all deep networks, one of the biggest challenges with image models is getting them to understand the big picture. Pixels close to each other are very strongly related, so the network preferentially learns these close relations. The bottleneck can be seen as forcing the model to learn global things about the image as resolution is usually halved each layer. A form of image compression if you want to think of it that way.

So the residual connections keep the model from forgetting what it started with, and the bottleneck forces it to learn more than just the relations between close pixels. Similar to the attention mechanism used in Transformers.

CNN's tend to work better than transformers for images because they naturally assume close by pixels affect each other more due to their receptive fields. This makes them easier to train on images. Whether Transformers would work equally well with more training is an open question.

For a similar model to U-Net and other "bottlenecked" CNN architectures check out denoising auto-encoders https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798

There is currently speculation that Diffusion models are simply a return of the ancient (7 years old) denoising autoencoder with a fancy training schedule tacked on

This explanation is entirely my intuition because nobody really knows how this stuff works lol

6

buyIdris666 t1_j9ctyh8 wrote

Yup. Nerf just replaced the construction step after you "register" all the camera positions using traditional algorithms. Usually via COLMAP.

Not saying that's a bad thing, existing algorithms are already good at estimating camera positions and parameters. It was the 3d reconstruction step that was previously lacking.

For anyone wanting to try this, I suggest using Nerf-W . The original Nerf required extremely accurate camera parameter estimates that you're not going to get with a cell camera and COLMAP. Nerf-w is capable of doing some fine adjustments as it runs. It even works decent reconstructing scenes using random internet photos.

The workflow is COLMAP to register the camera positions used to take the pictures and estimate camera parameters, then export those into the Nerf model. Most of the Nerf repos are already setup to make this easy.

This paper is a good overview of how to build a Nerf from random unaligned images. They did it using frames from a sitcom, but you could take a similar approach to Nerf almost anything https://arxiv.org/abs/2207.14279

12

buyIdris666 t1_j93m0ol wrote

Video will remain unsolved for a while.

LLM came first because the bit rate is lowest. A sentence of text is only a few hundred bits of information.

Now, image generation is getting good. It's still not perfect. The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.

Video is even harder. 30 high res images a second. To make long, coherent, believable videos takes an enormous amount of data and processing power

5