bloc97

bloc97 t1_j63q1nk wrote

>It's simpler (which leads to progress)

I wouldn't say current diffusions models are simpler, in fact they are much more complex than even the most "complex" GAN architectures. However it's exactly because of all the other points that they have become this complex. A vanilla GAN would never be able to endure this much tweaking without mode collapse. Compare that to even the most basic score-based models, which are always stable.

Sometimes, the "It just works™" proposition is much more appealing than pipeline simplicity or speed.

2

bloc97 t1_j49ft0g wrote

Reply to comment by mugbrushteeth in [D] Bitter lesson 2.0? by Tea_Pearce

My bet is on "mortal computers" (term coined by Hinton). Our current methods to train Deep Nets are extremely inefficient. CPU and GPUs basically have to load data, process it, then save it back to memory. We can eliminate this bandwidth limitation by printing basically a very large differentiable memory cell, with hardware connections inside representing the connections between neurons, which will allow us to do inference or backprop in a single step.

2

bloc97 t1_j2pj1c6 wrote

There are many ways to condition a diffusion model using time, but concatenating it as input is the least efficient method because:

  1. The first layer of your model is a convolutional layer, applying a convolution on a "time" image that has the same value everywhere is not computationally efficient. Early conv layers exist to detect variations in an image (eg. texture), applying the same kernel over and over on an empty image is not efficient.
  2. By giving t only to the first layer, the network will need to waste resources/neurons to propagate that information through the network. Again, this waste is compounded by the fact that you will need to propagate the time information for every "pixel" in each convolutional feature of your network (because it is a ConvNet). Why not just skip all that and directly give the time embedding to deeper layers within the network?
3

bloc97 t1_j2pfp6l wrote

GANs are generative models, you want a discriminative model (for regression?). You could start by predicting keypoints similar to the task of pose estimation, but in your case, you could predict 3D coordinates for the four corners of the QR code, plus two points to determine the axis of the cylinder. Then you can easily remove the distortion by inverting the cylindrical projection.

1

bloc97 t1_iy57elh wrote

With traditional gradient descent, probably not as most operators in modern NN architectures are bottlenecked by bandwidth and not much by compute. There's active research on alternative training methods but they mostly have difficulties in detecting malicious agents in the training pool. If you own all of the machines those algorithms work but when there is a non insignificant amount of malicious agents the training might fail or even produce models that have backdoors or have private training data leaked.

5

bloc97 t1_ixjuivv wrote

This can be considered good news. If all data is exhausted people will be actually forced to research better data-efficient algorithms. We humans don't ingest 100 GBs of arXiv papers to do research and we don't need billions of images to paint a cat sitting on a sofa. Until we figure out how to run GPT-3 on smartphones (maybe using neuromorphic computing?), we shouldn't be too worried about the trend of using bigger and bigger datasets, because small(er) networks can be successfully trained without that much data.

3

bloc97 t1_iwyeh1x wrote

>the GAN latent space is too compressed/folded

I remember reading a paper that showed that GANs often folds many dimensions of the "internal" latent space into singularities, with large swathes of flat space between them (it's related to the mode collapse problem of GANs).

Back to the question, I guess that when OP is trying to invert the GAN using gradient descent, he is probably getting stuck in a local minima. Try a global search metaheuristic on top of the gradient descent like simulated annealing or genetic algorithms?

9

bloc97 t1_ivqgf0q wrote

I was considering an unconditional latent diffusion model, but for conditional models, the computation becomes much more complex (we might have to use bayes here). If we use Score-Based Generative Modeling (https://arxiv.org/abs/2011.13456), we could try to find and count all the unique local minima and saddle points, but it is not clear how we can do this...

3

bloc97 t1_ivpzu4j wrote

Theoretically, the upper bound of distinct images is proportional to the number of bits required to encode each latent, thus a 64x64x4 latent encoded as a 32-bit number would amount to (2^32)^(64x64x4) images. However, many of those combinations are not considered to be "images" (they are "out of distribution"), thus the real number might be much much smaller than this, depending on the dataset and the network size.

11

bloc97 t1_ivpper1 wrote

I mean having the divergence would definitively help, as we will have additional information about the shape of the parameter landscape with respect to the loss function. The general idea would be to prefer areas with negative divergence, while trying to move and search through zero divergence areas very quickly.

Edit: In a sense, using the gradient alone only gives us information about the shape of the loss function at a single point, while having a laplacian gives us a larger "field of view" on the landscape.

1