arg_max

arg_max t1_j9rt2ew wrote

The thing is that the theory behind diffusion models is at least 40-50 years old. Forward diffusion is a discretization of a stochastic differential equations that transforms the data distribution into a normal distribution. People figured out that it is possible to reverse this process, so to go from the normal distribution back to the data distribution using another sde In the 1970s. The thing is that this reverse SDE contains the score function, so the gradient of the log density of the data and people just didn't really know how to get that from data. Then some smart guys came along, found the ideas about denoising score matching from the 2000s and did the necessary engineering to make it work with deep nets.

The point I am making is that this problem was theoretically well understood a long time ago, it just took humanity lots of years to actually be able to compute it. But for AGI, we don't have such a recipe. There's not one equation hidden in some old math book that will suddenly get us AGI. Reinforcement learning really is the only approach I could think of but even there I just don't see how we would get there with the algorithms we are currently using.

7

arg_max t1_j6mg664 wrote

I think diffusion models are kind of a bad example. The SDE paper from Yang Song has shown that it's all about modeling the score function and this can't be done with simple models. Apart from that, the big text2img models work inside the latent space of a deep vae, make use of conditioning using cross attention which isn't a thing in traditional ML and use large language models to process the text input. All their components are very dl based.

13

arg_max t1_j6ald1n wrote

I think the things that will come first will be automated art. It starts with concept art creation but I believe that we will soon see usable 3D mesh generators, so you put in a prompt like "creepy alien with claws and a tail" and get a 3D mesh out of it. AI chats like many people here suggested are obviously possible with things like GPT, but at the end all of this has to be linked back to game logic. When an NPC talks with you that he saw something at some place in the world, the game also should generate something interesting that you can discover in that location. I don't think there are solutions for this yet but I don't see why it can not happen. The problem is always that you need huge train sets to create those generative models and there just do not exist train sets for things like levels, quest or so, so we will have to see if people figure out smart ways to solve this.

3

arg_max t1_j69zq9x wrote

The issue is that gpt is trined on previously collected data and is not kept up to date. It might be able to tell you if an article from 2020 is fake news because it might know what actually happened that day from news articles from that time. But gpt has no idea of what happened today so it won't be able to tell what is real and what is fake. You'd need to use some sort of continuous online learning to do this properly. Obviously, it might be able to detect the real crazy stuff but it might even produce false negatives or real news if they are unexpected. For example, gpt probably has no idea that there currently is a war going on in ukraine, so how should it know whether or not an article about this topic is fake?

1

arg_max t1_j60qav1 wrote

Typically, if you're solver is not written in Pytorch/tensorflow itself you can't easily calculate gradients through them as your computational graph doesn't capture the solver. If your soler is also written in the framework and differentiable you might be able to just backpropagate through it though. Otherwise, the Neural ODE paper that was linked here a few times has an adjoint formulation that gives you the gradient wrt to the solver as a solution to another ode, but this is specific to their problem and won't apply to non-differential equations.

1

arg_max t1_j60jz1r wrote

Iterative refinement seems to be a big part of it. In a GAN, your network has to produce one image in a single forward pass. In diffusion models, the model actually sees the intermediate steps over and over and can make gradual improvements. Also, if you think about what the noise does, in the first few steps it will remove all small details and only keep low frequent, large structures. Basically, in the first steps, the model kind of has to focus on overall composition. Then, as the noise level goes down, it can gradually start adding all the small details. On a more mathematical level, the noise smoothes the distribution and widens the support in the [0,1]^D cube (D=image dimension, like 256x256x3). Typically people assume that this manifold is low-dimensional which can make sampling from it hard.

Some support for this claim is that people were able to improve other generative models like autoregressive models using similar noisy distributions. Also, you can run GANs to sample from the intermediate distributions which works better than standard GANs.

9

arg_max t1_j5r8qe6 wrote

What do you mean by "function represented by a neural network"? If you are hinting in the direction of universal approximation, then yes, you can learn any continuous function arbitrarily close with a single layer, sigmoid activation and infinite width. But similarly, there exist some results that show you can achieve a similar statement with a width-limited and "infinite depth" network (the required depth is not infinite but depends on the function you want to approximate and is afaik unbounded over the space of continuous functions). In practice, we are far away from either infinite width or depth so specific configurations can matter.

1

arg_max t1_j2hw6f8 wrote

That is an interesting paper BUT their method relies heavily on the structure of the task. In general, if you want to create a method that outputs algorithms choosing the output format is already non-trivial. For humans, pseudo-code probably is the most natural way to present algorithms but then you will require some kind of language model or at least a recurrent architecture that can output solutions of different lengths (as not every program has a fixed length). And even once you get your output from the model you have to first make sure that is a valid program and more importantly that it solves the task. This means that you have to verify the correctness of every method your model creates before being able to measure runtime.

But matrix multiplication is different. If you read the paper, you will see that every matrix multiplication algorithm can be written as a higher order Tensor and given a Tensor decomposition its trivial to check the correctness of the matrix multiplication algorithm. This is not even a super novel insight, people knew that you can formulate the task of finding better matrix multiplication algorithms as Tensor decomposition optimization problem BUT the problem is super hard to solve.

But not many real world tasks are like this. For most problems you don't have such a nice output space and at that point it becomes much much harder to to learn algorithms. I guess once people figure out a way to make models that can output verifiably correct pseudo code we will start to see tons of papers of new AI generated heuristics for NP hard and other problems that cannot be solved in optimal time yet.

6

arg_max t1_j136y5q wrote

Just to give you an idea about "optimal configuration" though, this is way beyond desktop PC levels:
You will need at least 350GB GPU memory on your entire cluster to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8 (GPU/instance) x 16 (GB/GPU) = 512 GB memory.

https://alpa.ai/tutorials/opt_serving.html

9

arg_max t1_j136nbo wrote

CPU implementations are going to be very slow. I'd probably try renting an A100 VM, running some experiments, and measuring VRAM and RAM usage. But I'd be surprised if anything below a 24G 3090TI is going to do the job. The issue is that bigger than 24GB means you have to go A6000 which costs as much as 4 3090s.

18

arg_max t1_j0z1p30 wrote

When? Probably now if someone decides to put enough money into it.
All the big Text-To-Image models like Dall-E, Imagen, Stable Diffusion are not very novel in terms of metrology. They all rely heavily on existing ideas and then combine them with more compute, bigger datasets and some tweaks.

Videos are not much more than 3D images with certain temporal constraints. There are already small scale Diffusion models for videos and I'm not saying that it's trivial to get longer videos, recurrent learning often is a bit tricky but I don't see why it would be impossible. Probably takes a few years before consumer hardware can run video generation though, after all we just about manage images at the moment.

1

arg_max t1_izymwa9 wrote

You basically need some kind of value function that estimates how good one assignment of teams is. For example, if each player has score between 1 and 100 your value function could simply be to minimize the difference between the strongest and weakest team. Typically you design this by hand. Then you run a constraint optimization method that makes sure that each player gets assigned to exactly one team and probably also takes team size into account. Then you can optimize this. It's not really ML but more of an optimization problem. Though if you really want to you might try to learn a player score, although it might be hard to collect training data for that.

2

arg_max t1_izpadl8 wrote

I think the most prominent use case in CNN is as a very simple, localised and fast operation that changes the number of channels without touching the spatial dimensions.

For example, deep resnets have a bottleneck design. The input is something like a Nx256xHxW Tensor (N batch size, H, W spatial dimensions) with 256 channels. To save compute/memory, we might not want to actually use the 3x3 conv on all 256 channels. Thus we use a 1x1 conv first to change the number of channels from 256 to 64. On this smaller Tensor, we then implement a 3x3 conv that doesn't change the number of channels. Finally, we use another 1x1 conv to convert back from 64 to 256 channels. So here the first 1x1 conv decreases the number of channels while the second one restores the output back to the original shape with 256 channels.

2

arg_max t1_iwcn3y0 wrote

Imagenet 1k pretraining might not be the best for this as it contains few plant classes. The bigger in-21k has a much larger selection of plants and might be better suited for you. Timm has efficient net v2, beit, vit and convnext models pretrained on this though I don't use keras you might be able to find them for this framework.

1

arg_max t1_iv2cb5u wrote

I think it kind of depends on what you want to do in the end. Machine learning can be complex and learning how to implement state of the art methods and understanding how they work can take years. If you want to do rather simple stuff like linear regression, you can probably just use a java linear algebra library and implement it yourself. But more complex stuff like deep learning is done using specialised libraries like Tensor flow, pytorch and so on. And I don't think you want to reimplement them yourself in java. Now you could either use pytorch in c++, wrap it and call from java or write the ml stuff in python which has the best framework support and then pass the data from java to your python program, calculate in python and send results back to java. There also is a deep java library but I have no experience with it and can't tell you how well it works. But yeah, ml is mostly done in python or c++ these days.

1