yldedly t1_j9jpuky wrote

There are two aspects, scalability and inductive bias. DL is scalable because compositions of differentiable functions make backpropagation fast, and those functions being mostly matrix multiplications make GPU acceleration effective. Combine this with stochastic gradients, and you can train on very large datasets very quickly.
Inductive biases make DL effective in practice, not just in theory. While the universal approximation theorem guarantees that an architecture and weight-setting exist that approximate a given function, the bias of DL towards low-dimensional smooth manifolds reflects many real-world datasets, meaning that SGD will easily find a local optimum with these properties (and when it doesn't, for example on tabular data where discontinuities are common, DL performs worse than alternatives, even if with more data it would eventually approximate a discontinuity).


yldedly t1_j75rw5b wrote

Speaking as someone also working on an ambitious project that deviates a lot from mainstream ML, I encourage you to do the same thing I'm struggling with:

Try to implement the simplest possible version of your idea and test it on some toy problem to quickly get some insight.

Maybe start with one type of modulatory node and see how NEAT ends up using it?


yldedly t1_j45ycm8 wrote

>With enough scale we get crude compositionality, yes.

Depends on exactly what we mean. To take a simple example, if you have cos(x) and x^2, you can compose these to produce cos(x)^2 (or cos(x^2)). You can approximate the composition using a neural network if you have enough data on some interval x \in [a;b]. It will work well even for x that weren't part of the training set, as long as they are in the interval. Outside the interval the approximation will be bad though. But if you take cos(x), x^2 and compose(f, g) as building blocks, and search for a combination of these that approximate the data, the approximation will be good for all real numbers.

In the same way, you can learn a concept like "subject, preposition, object A, transitive verb, object B", where e.g. subject = "raccoon", preposition = "in a", object A = "spacesuit", transitive verb = "playing" and object B = "poker", by approximating it with a neural network, and it will work well if you have enough data in some high-dimensional subspace. But it won't work with any substitutions. Is it fair to call that crude compositionality?


yldedly t1_j3dn5mb wrote

>Any alternative which would be able
to solve the same problems would probably require a similar
architecture: lot of parameters, deep connections.

If handwritten character recognition (and generation) counts as one such problem, then here is a model that solves it with a handful of parameters: https://www.cs.cmu.edu/~rsalakhu/papers/LakeEtAl2015Science.pdf


yldedly t1_iydiq69 wrote

>The goal isn't to pass as human, it's to solve whatever problem is in front of you.

It's worth disambiguating between solving specific business problems, and creating intelligent (meaning broadly generalizing) programs that can solve problems. For the former, what Francois Chollet calls cognitive automation is often sufficient, if you can get enough data, and we're making great progress. For the latter, we haven't made much progress, and few people are even working on it. Lots of people are working on the former, and deluding themselves that one day it will magically become the latter.


yldedly t1_isecsiy wrote

>I think my work "communicating natural programs to humans and machines" will entertain you for hours. Give it a go.

I will, looks super interesting. I'm so jealous of you guys at MIT working on all this fascinating stuff :D

>It's my belief that we should program computers using natural utterances such as language, demonstration, doodles, ect. These "programs" are fundamentally probablistic and admits multiple interpretations/executions.

That's an ambitious vision. I can totally see how that's the way to go if we want "human compatible" AI, in Stuart Russell's sense where AI is learning what the human wants to achieve, by observing their behavior (including language, demonstrations, etc).


yldedly t1_isb5nsi wrote

What evocative examples :P
I know probmods.org well, it's excellent. I wrote a blogpost about program synthesis. I stumbled on the area during my phd where I did structure learning for probabilistic programs, and realized (a bit late) that I was actually trying to do program synthesis. So I'm very interested in it, wish I had the chance to work with it more professionally. Looking forward to reading your blog!


yldedly t1_irvfafm wrote

There's a lot to unpack here. I agree that a large part of creating AGI is building in the right priors ("learning priors" is a bit of an oxymoron imo, since a prior is exactly the part you don't learn, but it makes sense that a posterior for a pre-trained model is a prior for a fine-tuned model).

Invariance and equivariance are a great example. Expressed mathematically, using symbols, it makes no sense to say a model is more or less equivariant - it either is or it isn't. If you explicitly build equivariance into a model (and apparently it's not as straightforward as e.g. just using convolutions), then this is really what you get. For example, the handwriting model from my blogpost has real translational equivariance (because the location of a character is sampled).

If you instead learn the equivariance, you will only ever learn a shortcut - something that works on training and test data, but not universally, as the paper from the twitter thread shows. Just like the networks that can solve the LEGO task for 6 variables don't generalize to any number of variables, learning "equivariance" on one dataset (even if it's a huge one) doesn't guarantee equivariance on another. A neural network can't represent an algorithm like "for all variables, do x", or constraints like "f(g(x)) = g(f(x)), for all x" - you can't represent universal quantifiers using finite dimensional vectors.

That being said, you can definitely learn some useful priors by training very large networks on very large data. An architecture like the Transformer allows for some very general-purpose priors, like "do something for pairs of tokens 4 tokens apart".