Submitted by GraciousReformer t3_118pof6 in MachineLearning

"Deep learning is the only thing that currently works at scale it's the only class of algorithms that is able to discover arbitrary functions in a reasonable amount of time."

https://www.youtube.com/watch?v=p-OYPRhqRCg

I know of the universal approximation theorem. But is there any mathematical formulation of this statement?

122

Comments

You must log in or register to comment.

relevantmeemayhere t1_j9ij6cc wrote

Lol. The fact that we use general linear models in every scientific field, and have been for decades should tell you all you need to know about this statement.

81

adventuringraw t1_j9in5sj wrote

I mean... the statement specifically uses the phrase 'arbitrary functions'. GLMs are a great tool in the toolbox, but the function family it optimizes over is very far from 'arbitrary'.

I think the statement's mostly meaning 'find very nonlinear functions of interest when dealing with very large numbers of samples from very high dimensional sample spaces'. GLM's are used in every scientific field, but certainly not for every application. Some form of deep learning really is the only game in town still for certain kinds of problems at least.

70

BoiElroy t1_j9ipbtg wrote

This is not the answer to your question but one intuition I like about universal approximation theorem I thought I'd share is the comparison to a digital image. You use a finite set of pixels, each that can take on a certain set of discrete values. With a 10 x 10 grid of pixels you can draw a crude approximation of a stick figure. With 1000 x 1000 you can capture a blurry but recognizable selfie. Within the finite pixels and the discrete values they can take you can essentially capture anything you can dream of. Every image in every movie ever made. Obviously there are other issues later like does your models operational design domain match the distribution of the training domain or did you just waste a lot of GPU hours lol

3

Featureless_Bug t1_j9iy4yq wrote

Haven't heard of GLMs being successfully used for NLP and CV in the recent time. And these are like the only things that would be described as large scale in ML. The statement is completely correct - even stuff like gradient boosting does not work at scale in that sense

26

randomoneusername t1_j9iyf7r wrote

I mean this has two elements in it.

DL is not the only algorithm that works in scale for sure.

26

yldedly t1_j9j6gk8 wrote

>discover arbitrary functions

Uh, no. Not even close. DL can approximate arbitrary functions on a bounded interval given enough data, parameters and compute.

14

VirtualHat t1_j9j8805 wrote

Linear models make an assumption that the solution is in the form of y=ax+b. If the solution is not in this form then the best solution will is likely to be a poor solution.

I think Emma Brunskill's notes are quite good at explaining this. Essentially the model will underfit as it is too simple. I am making an assumption though, that a large dataset implies a more complex non-linear solution, but this is generally the case.

10

VirtualHat t1_j9j8uvr wrote

For example, in IRIS dataset, the class label is not a linear combination of the input. Therefore, if your model class is all linear models, you won't find the optimal or in this case, even a good solution.

If you extend the model class to include non-linear functions, then your hypothesis space now at least contains a good solution, but finding it might be a bit more trickly.

15

hpstring t1_j9jb96f wrote

Universal approximation is not enough, you need efficiency to make things work.

DL is the only class of algorithms that beats the curse of dimensionality for discovering certain (very general) class of high dimensional functions(something related to Barron space). Point me out if this is not accurate.

54

1bir t1_j9jbi5e wrote

Apparently* decision trees are also capable of [universal function approximation](https://cstheory.stackexchange.com/a/46405).

Whether the algorithms for training them do that as well as the ones for deep NNs in practice is a separate issue.

*Haven't seen (& probably wouldn't understand) a proof.

4

chief167 t1_j9jev01 wrote

Define scale

Language models? Sure. Images? Sure. Huge amounts of transaction data to search for fraud? Xgboost all the way lol.

Church no free lunch theorem: there is no single approach best for every possible problem. Djeezes I hate it when marketing takes over. You learn this principle in the first chapter of literally every data course

20

suflaj t1_j9jjetb wrote

Because it requires the least amount of human intervention

Also because it subjectively sounds like magic to people who don't really understand it, so it both sells to management and to consumers.

At least it's easier for humans to cope it is like magic than to accept that a lot of what AI can do is just stuff that is trivial and doesn't require humanity to solve.

−12

randomoneusername t1_j9jkzs7 wrote

The statement stand-alone that you have there is very vague. Can I assume is taking about NLP or CV projects ?

On tabular data even with non linear relationship normal boosting, ensemble algorithms can scale and be top of the game.

13

elmcity2019 t1_j9jm4nq wrote

I have been an applied data scientist for 10 years. I have built over 100k models using python, databricks and DataRobot. I have never seen a DL model out compete all the other algorithms. Granted I am largely working with structured business data, but nonetheless DL isn't really competitive.

−4

ktpr t1_j9jmqq7 wrote

I feel like recently ML boosters come this Reddit, make large claims, and then use the ensuing discussion, time, and energy from others to correct their click content at our expense

17

yldedly t1_j9jorh1 wrote

It's not from a paper, but it's pretty uncontroversial I think - though people like to forget about the "bounded interval" part, or at least what it implies about extrapolation.

9

yldedly t1_j9jpuky wrote

There are two aspects, scalability and inductive bias. DL is scalable because compositions of differentiable functions make backpropagation fast, and those functions being mostly matrix multiplications make GPU acceleration effective. Combine this with stochastic gradients, and you can train on very large datasets very quickly.
Inductive biases make DL effective in practice, not just in theory. While the universal approximation theorem guarantees that an architecture and weight-setting exist that approximate a given function, the bias of DL towards low-dimensional smooth manifolds reflects many real-world datasets, meaning that SGD will easily find a local optimum with these properties (and when it doesn't, for example on tabular data where discontinuities are common, DL performs worse than alternatives, even if with more data it would eventually approximate a discontinuity).

4

GraciousReformer OP t1_j9jr4i4 wrote

"for example on tabular data where discontinuities are common, DL performs worse than alternatives, even if with more data it would eventually approximate a discontinuity." True. Is there references on this issue?

1

inspired2apathy t1_j9jsbz6 wrote

It's that entirely accurate though? There's all kinds of explicit dimensionally reduction methods. They can be combined with traditional ml models pretty easily for supervised learning. As I understand, the unique thing DL gives us just a massive embedding that can encode/"represent" something like language or vision.

6

activatedgeek t1_j9jt721 wrote

I think the no free lunch theorem is misquoted here. The NFL also assumes that all datasets from the universe of datasets are equally likely. But that is objectively false. Structure is more likely than noise.

15

activatedgeek t1_j9jvj8h wrote

For generalization (performing well beyond the training), there’s at least two dimensions: flexibility and inductive biases.

Flexibility ensures that many functions “can” be approximated in principle. That’s the universal approximation theorem. It is a descriptive result and does not prescribe how to find that function. This is not something very unique to DL. Deep Random Forests, Fourier Bases, Polynomial Bases, Gaussian processes all are universal function approximators (with some extra technical details).

The part unique to DL is that somehow their inductive biases have helped match some of the complex structured problems including vision and language that makes them generalize well. Inductive bias is a loosely defined term. I can provide examples and references.

CNNs provide the inductive bias to prefer functions that handle translation equivariance (not exactly true but only roughly due to pooling layers). https://arxiv.org/abs/1806.01261

Graph neural networks provide a relational inductive bias. https://arxiv.org/abs/1806.01261

Neural networks overall prefer simpler solutions, embodying Occam’s razor, another inductive bias. This argument is made theoretically using Kolmogorov complexity. https://arxiv.org/abs/1805.08522

107

bloodmummy t1_j9jzvnr wrote

It strikes me that people who tout DL as a hammer-for-all-nails never touched tabular data in their lives. Go try to do a couple of Kaggle Tabular competitions and you'll soon realise that DL can be very dumb, cumbersome, and data-hungry. Ensemble models,Decision Tree models, and even feature-engineered Linear Regression models still rule there and curb-stomp DL all day long ( For most cases ).

Tabular data is also still the type of data most-used with ML. I'm not a "DL-hater" if there is such a thing, in fact my own research is using DL only. But it isn't a magical wrench, and it won't be.

8

NitroXSC t1_j9k09wt wrote

> Q2: Probably there are other classes but they haven't been discovered or are only at the early age of research.

I think there are many different classes that would work but current DL is based in large parts on matrix-vector operations which can be implemented efficiently on current hardware.

10

DigThatData t1_j9k17rr wrote

it's not. tree ensembles scale gloriously, as do approximations of nearest neighbors. there are certain (and growing) classes of problems for which deep learning produces seemingly magical results, but that doesn't mean it's the only path to a functional solution. It'll probably give you the best solution, but that doesn't mean it's the only way to do things.

in any event, if you want to better understand scaling properties of DL algorithms, a good place to start is the "double descent" literature.

3

yldedly t1_j9k5n8n wrote

It depends a lot on what you mean by works. You can get a low test error with NNs on tabular data if you have enough of it. For smaller datasets, you'll get a lower test error using tree ensembles. For low out-of-distribution error neither will work.

3

Brudaks t1_j9k6mo0 wrote

Because being an universal function approximator is not sufficient to be useful in practice, and IMHO is not even really a particularly interesting property; we don't care if something can approximate any function, we care whether it approximates the thing needed for a particular task; and in any case being able to approximate it is a necessary but not a sufficient condition. We care about efficiency of approximation (e.g. a single-layer perceptron is an universal approximator iff you assume an impractical number of neurons), but even more important than how well the function can be approximated with a limited number of parameters is how well you can actually learn these parameters - this differs a lot for different models, and we don't care about how well a model would fit the function with optimal parameters, we care about how well it fits the function with the parameter values we can realistically identify with a bounded amount of computation.

That being said, we do use decision trees instead of DL; for some types of tasks the former outperform the latter and for other types of tasks its the other way around.

3

hpstring t1_j9kf6te wrote

This is a very good answer! I want to add that apart from generalization, the fact that we have efficient optimization algorithms that can find quite good minima also contributes a lot to the deep learning magic.

4

relevantmeemayhere t1_j9khp8m wrote

As you mentioned, this is highly dependent on the functional relationship of the data.

You wouldn’t not use domain knowledge to determine that.

Additionally, non linear models tend to have their own drawbacks. Lack of interpretability, high variability being some of them

2

relevantmeemayhere t1_j9ki2x1 wrote

Because they are useful for some problems and not others, like every algorithm? Nowhere in my statement did I say they are monolithic in their use across all subdomains of ml

The statement was that deep learning is the only thing that works at scale. It’s not lol. Deep learning struggles in a lot of situations.

0

howtorewriteaname t1_j9kmbrv wrote

There's no mathematical formulation of that statement because there's no mathematical reasoning behind that statement. It's just an opinion (which I believe it isn't true btw)

2

chief167 t1_j9kt5ho wrote

We use gradient boosting at quite a big scale. Not LLM big, but still big. It's just not NLP or CV at all. It's for fraud detection in large transactional tabular datasets. And it outperforms basically all neural network, shallow or deep, approaches.

2

chief167 t1_j9ku5mq wrote

I don't think it implies that all datasets are equally likely. I think it only implies that given all possible datasets, there is no best approach to modelling them. All possible != All are equally likely

But I don't have my book with me, and I do t trust the internet since it seems to lead to random blogposts instead of the original paper (Wikipedia gave a 404 in the footnotes)

0

Featureless_Bug t1_j9kuu22 wrote

Large scale is somewhere to the north of 1-2 TB of data. Even if you had that much data, in absolutely most cases tabular data has such a simplistic structure that you wouldn't need that much data to achieve the same performance - so I wouldn't call any kind of tabular data large scale to be frank

−2

30299578815310 t1_j9kziil wrote

This is just not true. As others noted there are other algorithms which are universal approximators and run at scale. The key to the success of DNNs is unknown. A hypothesis is called the lottery ticket hypothesis.

​

https://arxiv.org/abs/1803.03635

2

kvutxdy t1_j9l7on1 wrote

Universal approximation theorem only states that DNN can approximate Lipschitz functions, but not necessarily all functions.

2

adventuringraw t1_j9ll3fp wrote

I can see how the quote could be made slightly more accurate. In particular, tabular data in general is still better tackled with something like XGBoost instead of deep learning, so deep learning certainly hasn't turned everything into a nail for one universal hammer yet.

1

VirtualHat t1_j9ll5i2 wrote

Yes, that's right. For many problems, a linear model is just what you want. I guess what I'm saying is that the dividing line between when a linear model is appropriate vs when you want a more expressive model is often related to how much data you have.

1

VirtualHat t1_j9lmkg1 wrote

In my experience DNNs only help with structured data (audio, video, images etc.). I once had a large (~10M datapoints) tabular dataset and found that simply taking a random 2K subset and fitting an SVM gave the best results. I think this is usually the case, but people still want DNNs for some reason. If it were a vision problem, then, of course, it'd be the other way around.

3

activatedgeek t1_j9lnhvv wrote

See Theorem 2 (Page 34) of The Supervised Learning No-Free-Lunch Theorems.

It conditions "uniformly" averaged over all "f" the input-output mapping, i.e. the function that generates the dataset (this is a noise-free case). It also provides "uniformly averaged over all P(f)", a distribution over the data-generating functions.

So while you could still have different data-generating distributions P(f), the result is defined over all such distributions uniformly averaged.

The NFL is sort of a worst-case result, and I think it pretty meaningless and inconsequential for the real world.

Let me know if I have misinterpreted this!

1

VirtualHat t1_j9loi32 wrote

It should be all continious functions, but I can't really think of any problems where this would limit the solution. The set of all continuous functions is a very big set!

As a side note, I think it's quite interesting that the theorem doesn't include periodic functions like sin, so I guess it's not quite all continuous functions, just continuous functions with bounded input.

4

VirtualHat t1_j9lp6z3 wrote

There was a really good paper a few years ago that identifies some biases in how DNNs learn might explain why they work so well in practice as compared to alternatives. Essentially they are biased towards smoother solutions, which is often what is wanted.

This is still an area of active research, though. I think it's fair to say we still don't quite know why DNNs work as well as they do.

1

activatedgeek t1_j9lu7q7 wrote

All model classes have inductive biases. e.g. random forests have the inductive bias of producing axis-aligned region splits. But clearly, that inductive bias is not good enough for image classification because a lot of information in the pixels is spatially correlated that axis-aligned regions cannot capture as specialized neural networks, under the same budget. By budget, I mean things like training time, model capacity, etc.

If we have infinite training time and infinite number of image samples, then probably random forests might be just as good as neural networks.

4

JackBlemming t1_j9n4bp2 wrote

This is true. Netflix famously didnt use some complex neural net for choosing shows you'd like exactly because it didnt scale. Neural nets are expensive and if you can sacrifice a few percentages to save hundreds of millions in server fees, it's probably good.

1

currentscurrents t1_j9n8in9 wrote

In theory, either structure can express any solution. But in practice, every structure is better suited to some kinds of data than others.

A decision tree is a bunch of nested if statements. Imagine the complexity required to write an if statement to decide if an array of pixels is a horse or a dog. You can technically do it by building a tree with an optimizer; but it doesn't work very well.

On the other hand, a CNN runs a bunch of learned convolutional filters over the image. This means it doesn't have to learn the 2D structure of images and that pixels tend to be related to nearby pixels; it's already working on a 2D plane. A tree doesn't know that adjacent pixels are likely related, and would have to learn it.

It also has a bias towards hierarchy. As the layers stack upwards, each layer builds higher-level representations to go from pixels > edges > features > objects. Objects tend to be made of smaller features, so this is a good bias for working with images.

1

pyfreak182 t1_j9slq8a wrote

It helps that the math behind back propagation (i.e. matrix multiplications) is easily parallelizable. The computations in the forward pass are independent of each other, and can be computed in parallel for different training examples. The same is true for the backward pass, which involves computing the gradients for each training batch independently.

And we have hardware accelerators like GPUs that are designed to perform large amounts of parallel computations efficiently.

The success of deep learning is just as much about implementation as it is theory.

1

alterframe t1_ja2f7xu wrote

Part of the answer is probably that DL is not a single algorithm or a class of algorithms, but rather a framework or a paradigm for building such algorithms.

Sure, you can take a SOTA model for ImageNet and apply it to similar image classification problems, by tuning some hyperparameters and maybe replacing certain layers. However, if you want to apply it to a completely different task, you need to build a different neural network.

1