Submitted by GraciousReformer t3_118pof6 in MachineLearning

"Deep learning is the only thing that currently works at scale it's the only class of algorithms that is able to discover arbitrary functions in a reasonable amount of time."

I know of the universal approximation theorem. But is there any mathematical formulation of this statement?



You must log in or register to comment.

activatedgeek t1_j9jvj8h wrote

For generalization (performing well beyond the training), there’s at least two dimensions: flexibility and inductive biases.

Flexibility ensures that many functions “can” be approximated in principle. That’s the universal approximation theorem. It is a descriptive result and does not prescribe how to find that function. This is not something very unique to DL. Deep Random Forests, Fourier Bases, Polynomial Bases, Gaussian processes all are universal function approximators (with some extra technical details).

The part unique to DL is that somehow their inductive biases have helped match some of the complex structured problems including vision and language that makes them generalize well. Inductive bias is a loosely defined term. I can provide examples and references.

CNNs provide the inductive bias to prefer functions that handle translation equivariance (not exactly true but only roughly due to pooling layers).

Graph neural networks provide a relational inductive bias.

Neural networks overall prefer simpler solutions, embodying Occam’s razor, another inductive bias. This argument is made theoretically using Kolmogorov complexity.


SodomizedPanda t1_j9jyhem wrote

And somehow, the best answer is at the bottom of the thread..

A small addition : Recent research suggests that the implicit bias in DNN that helps generalization does not only lie in the structure of the network but in the learning algorithm as well (Adam, SGD, ...).


red75prime t1_j9k0i84 wrote

Does in-context learning suggest that inductive biases could also be extracted from training data?


hpstring t1_j9kf6te wrote

This is a very good answer! I want to add that apart from generalization, the fact that we have efficient optimization algorithms that can find quite good minima also contributes a lot to the deep learning magic.


-vertigo-- t1_j9kih7k wrote

hmm for some reason the arxiv links are giving 403 forbidden


sanman t1_ja3kkkp wrote

first 2 links are the same - do you have the one for CNNs inductive bias?


GraciousReformer OP t1_j9ljjqu wrote

>inductive biases

Then why does DL have inductive biases and others do not?


activatedgeek t1_j9lu7q7 wrote

All model classes have inductive biases. e.g. random forests have the inductive bias of producing axis-aligned region splits. But clearly, that inductive bias is not good enough for image classification because a lot of information in the pixels is spatially correlated that axis-aligned regions cannot capture as specialized neural networks, under the same budget. By budget, I mean things like training time, model capacity, etc.

If we have infinite training time and infinite number of image samples, then probably random forests might be just as good as neural networks.


currentscurrents t1_j9n3o7u wrote

Sounds like ideally we'd want a model with good inductive biases for meta-learning new inductive biases, since every kind of data requires different biases.


GraciousReformer OP t1_j9lwe7i wrote

Still, why is it that DL has better inductive biases than others?


activatedgeek t1_j9lz6ib wrote

I literally gave an example of how (C)NNs have better inductive bias than random forests for images.

You need to ask more precise questions than just a "why".


GraciousReformer OP t1_j9m1o15 wrote

So it is like an ability to capture "correlations" that cannot be done with random forests.


currentscurrents t1_j9n8in9 wrote

In theory, either structure can express any solution. But in practice, every structure is better suited to some kinds of data than others.

A decision tree is a bunch of nested if statements. Imagine the complexity required to write an if statement to decide if an array of pixels is a horse or a dog. You can technically do it by building a tree with an optimizer; but it doesn't work very well.

On the other hand, a CNN runs a bunch of learned convolutional filters over the image. This means it doesn't have to learn the 2D structure of images and that pixels tend to be related to nearby pixels; it's already working on a 2D plane. A tree doesn't know that adjacent pixels are likely related, and would have to learn it.

It also has a bias towards hierarchy. As the layers stack upwards, each layer builds higher-level representations to go from pixels > edges > features > objects. Objects tend to be made of smaller features, so this is a good bias for working with images.


GraciousReformer OP t1_j9oeeo3 wrote

What are the situations that the bias for the hierarchy is not helpful?


relevantmeemayhere t1_j9ij6cc wrote

Lol. The fact that we use general linear models in every scientific field, and have been for decades should tell you all you need to know about this statement.


adventuringraw t1_j9in5sj wrote

I mean... the statement specifically uses the phrase 'arbitrary functions'. GLMs are a great tool in the toolbox, but the function family it optimizes over is very far from 'arbitrary'.

I think the statement's mostly meaning 'find very nonlinear functions of interest when dealing with very large numbers of samples from very high dimensional sample spaces'. GLM's are used in every scientific field, but certainly not for every application. Some form of deep learning really is the only game in town still for certain kinds of problems at least.


relevantmeemayhere t1_j9kin48 wrote

I agree with you. I was just pointing out that to say they are the only solution is foolish, as the quote implied

This quote could have just been used without much context, so grain of salt.


adventuringraw t1_j9ll3fp wrote

I can see how the quote could be made slightly more accurate. In particular, tabular data in general is still better tackled with something like XGBoost instead of deep learning, so deep learning certainly hasn't turned everything into a nail for one universal hammer yet.


Featureless_Bug t1_j9iy4yq wrote

Haven't heard of GLMs being successfully used for NLP and CV in the recent time. And these are like the only things that would be described as large scale in ML. The statement is completely correct - even stuff like gradient boosting does not work at scale in that sense


chief167 t1_j9kt5ho wrote

We use gradient boosting at quite a big scale. Not LLM big, but still big. It's just not NLP or CV at all. It's for fraud detection in large transactional tabular datasets. And it outperforms basically all neural network, shallow or deep, approaches.


Featureless_Bug t1_j9kuu22 wrote

Large scale is somewhere to the north of 1-2 TB of data. Even if you had that much data, in absolutely most cases tabular data has such a simplistic structure that you wouldn't need that much data to achieve the same performance - so I wouldn't call any kind of tabular data large scale to be frank


relevantmeemayhere t1_j9ki2x1 wrote

Because they are useful for some problems and not others, like every algorithm? Nowhere in my statement did I say they are monolithic in their use across all subdomains of ml

The statement was that deep learning is the only thing that works at scale. It’s not lol. Deep learning struggles in a lot of situations.


Featureless_Bug t1_j9kvek5 wrote

Ok, name one large scale problem where GLMs are the best prediction algorithm possible.


relevantmeemayhere t1_j9kygtx wrote

Any problem where you want things like effect estimates lol. Or error estimates. Or models that generate joint distributions

So, literally a ton of them. Which industries don’t like things like that?


VirtualHat t1_j9j2gwx wrote

Large linear models tend not to scale well to large datasets if the solution is not in the model class. Because of this lack of expressivity, linear models tend to do poorly on complex problems.


relevantmeemayhere t1_j9khp8m wrote

As you mentioned, this is highly dependent on the functional relationship of the data.

You wouldn’t not use domain knowledge to determine that.

Additionally, non linear models tend to have their own drawbacks. Lack of interpretability, high variability being some of them


GraciousReformer OP t1_j9j7iwm wrote

>Large linear models tend not to scale well to large datasets if the solution is not in the model class

Will you provide me a reference?


VirtualHat t1_j9j8805 wrote

Linear models make an assumption that the solution is in the form of y=ax+b. If the solution is not in this form then the best solution will is likely to be a poor solution.

I think Emma Brunskill's notes are quite good at explaining this. Essentially the model will underfit as it is too simple. I am making an assumption though, that a large dataset implies a more complex non-linear solution, but this is generally the case.


relevantmeemayhere t1_j9kifhu wrote

Linear models are often preferred for the reasons you mentioned. Under fitting is almost always preferred to overfitting.


VirtualHat t1_j9ll5i2 wrote

Yes, that's right. For many problems, a linear model is just what you want. I guess what I'm saying is that the dividing line between when a linear model is appropriate vs when you want a more expressive model is often related to how much data you have.


GraciousReformer OP t1_j9j8bsl wrote

Thank you. I understand the math. But I meant a real world example that "the solution is not in the model class."


VirtualHat t1_j9j8uvr wrote

For example, in IRIS dataset, the class label is not a linear combination of the input. Therefore, if your model class is all linear models, you won't find the optimal or in this case, even a good solution.

If you extend the model class to include non-linear functions, then your hypothesis space now at least contains a good solution, but finding it might be a bit more trickly.


GraciousReformer OP t1_j9jgdmc wrote

But DL is not a linear model. Then what will be the limit of DL?


terminal_object t1_j9jp51j wrote

You seem confused as to what you yourself are saying.


GraciousReformer OP t1_j9jppu7 wrote

"Artificial neural networks are often (demeneangly) called "glorified regressions". The main difference between ANNs and multiple / multivariate linear regression is of course, that the ANN models nonlinear relationships."


PHEEEEELLLLLEEEEP t1_j9k691x wrote

Regression doesnt just mean linear regression, if that's what you're confused about


Acrobatic-Book t1_j9k94l4 wrote

The simplest example is the xor-problem (aka either or). This was also why multilayer perceptrons as the basis of deep learning where actually created. Because a linear model cannot solve it.


VirtualHat t1_j9lkto4 wrote

Oh wow, super weird to be downvoted just for asking for a reference. r/MachineLearning isn't what it used to be I guess, sorry about that.


BoiElroy t1_j9ioqcz wrote

Yeah always should first exhaust existing classical methods before reaching for deep learning.


[deleted] t1_j9ijb65 wrote



sdmat t1_j9izc3q wrote

Not exactly but close enough?


Fancy-Jackfruit8578 t1_j9j5v25 wrote

Because every NN is just basically a big linear function… with a nonlinearity at the end.


hpstring t1_j9jbhwv wrote

This is correct for two-layer NNs, not general NNs.


hpstring t1_j9jb96f wrote

Universal approximation is not enough, you need efficiency to make things work.

DL is the only class of algorithms that beats the curse of dimensionality for discovering certain (very general) class of high dimensional functions(something related to Barron space). Point me out if this is not accurate.


inspired2apathy t1_j9jsbz6 wrote

It's that entirely accurate though? There's all kinds of explicit dimensionally reduction methods. They can be combined with traditional ml models pretty easily for supervised learning. As I understand, the unique thing DL gives us just a massive embedding that can encode/"represent" something like language or vision.


hpstring t1_j9jxzpm wrote

Well the traditional ml + dimensionality reduction cannot crack e.g. imagenet recognition


inspired2apathy t1_j9jzpqw wrote

Other models like PGMs can absolutely be applied to ImageNet, just not for SOTA accuracy.


GraciousReformer OP t1_j9ji7t1 wrote

But why DL beats the curse? Why is DL the only class?


hpstring t1_j9juk1f wrote

Q1: We don't know yet. Q2: Probably there are other classes but they haven't been discovered or are only at the early age of research.


NitroXSC t1_j9k09wt wrote

> Q2: Probably there are other classes but they haven't been discovered or are only at the early age of research.

I think there are many different classes that would work but current DL is based in large parts on matrix-vector operations which can be implemented efficiently on current hardware.


randomoneusername t1_j9iyf7r wrote

I mean this has two elements in it.

DL is not the only algorithm that works in scale for sure.


[deleted] t1_j9jgblt wrote



randomoneusername t1_j9jkzs7 wrote

The statement stand-alone that you have there is very vague. Can I assume is taking about NLP or CV projects ?

On tabular data even with non linear relationship normal boosting, ensemble algorithms can scale and be top of the game.


bloodmummy t1_j9jzvnr wrote

It strikes me that people who tout DL as a hammer-for-all-nails never touched tabular data in their lives. Go try to do a couple of Kaggle Tabular competitions and you'll soon realise that DL can be very dumb, cumbersome, and data-hungry. Ensemble models,Decision Tree models, and even feature-engineered Linear Regression models still rule there and curb-stomp DL all day long ( For most cases ).

Tabular data is also still the type of data most-used with ML. I'm not a "DL-hater" if there is such a thing, in fact my own research is using DL only. But it isn't a magical wrench, and it won't be.


Mefaso t1_j9jgvoz wrote

Anything that scales sub-quadraticaly?

Anything "big-data"


GraciousReformer OP t1_j9jib2n wrote

Then why DL?


suflaj t1_j9jjetb wrote

Because it requires the least amount of human intervention

Also because it subjectively sounds like magic to people who don't really understand it, so it both sells to management and to consumers.

At least it's easier for humans to cope it is like magic than to accept that a lot of what AI can do is just stuff that is trivial and doesn't require humanity to solve.


chief167 t1_j9jev01 wrote

Define scale

Language models? Sure. Images? Sure. Huge amounts of transaction data to search for fraud? Xgboost all the way lol.

Church no free lunch theorem: there is no single approach best for every possible problem. Djeezes I hate it when marketing takes over. You learn this principle in the first chapter of literally every data course


activatedgeek t1_j9jt721 wrote

I think the no free lunch theorem is misquoted here. The NFL also assumes that all datasets from the universe of datasets are equally likely. But that is objectively false. Structure is more likely than noise.


chief167 t1_j9ku5mq wrote

I don't think it implies that all datasets are equally likely. I think it only implies that given all possible datasets, there is no best approach to modelling them. All possible != All are equally likely

But I don't have my book with me, and I do t trust the internet since it seems to lead to random blogposts instead of the original paper (Wikipedia gave a 404 in the footnotes)


activatedgeek t1_j9lnhvv wrote

See Theorem 2 (Page 34) of The Supervised Learning No-Free-Lunch Theorems.

It conditions "uniformly" averaged over all "f" the input-output mapping, i.e. the function that generates the dataset (this is a noise-free case). It also provides "uniformly averaged over all P(f)", a distribution over the data-generating functions.

So while you could still have different data-generating distributions P(f), the result is defined over all such distributions uniformly averaged.

The NFL is sort of a worst-case result, and I think it pretty meaningless and inconsequential for the real world.

Let me know if I have misinterpreted this!


ktpr t1_j9jmqq7 wrote

I feel like recently ML boosters come this Reddit, make large claims, and then use the ensuing discussion, time, and energy from others to correct their click content at our expense


yldedly t1_j9j6gk8 wrote

>discover arbitrary functions

Uh, no. Not even close. DL can approximate arbitrary functions on a bounded interval given enough data, parameters and compute.


ewankenobi t1_j9jl91t wrote

I like your wording, did you come up with that definition yourself or is it from a paper?


yldedly t1_j9jorh1 wrote

It's not from a paper, but it's pretty uncontroversial I think - though people like to forget about the "bounded interval" part, or at least what it implies about extrapolation.


[deleted] t1_j9jsgf6 wrote

What is "bounded interval" here?


yldedly t1_j9judc7 wrote

Any interval [a; b] where a and b are numbers. In practice, it means that the approximation will be good in the parts of the domain where there is training data. I have a concrete example in a blog post of mine:


[deleted] t1_j9jukfu wrote

Interesting but that is valid for us as well. So I am not sure this is true once they learn very general things, like learning itself.


GraciousReformer OP t1_j9jfxvh wrote

Yes but is DL the unique mechanism? Why DL?


yldedly t1_j9jpuky wrote

There are two aspects, scalability and inductive bias. DL is scalable because compositions of differentiable functions make backpropagation fast, and those functions being mostly matrix multiplications make GPU acceleration effective. Combine this with stochastic gradients, and you can train on very large datasets very quickly.
Inductive biases make DL effective in practice, not just in theory. While the universal approximation theorem guarantees that an architecture and weight-setting exist that approximate a given function, the bias of DL towards low-dimensional smooth manifolds reflects many real-world datasets, meaning that SGD will easily find a local optimum with these properties (and when it doesn't, for example on tabular data where discontinuities are common, DL performs worse than alternatives, even if with more data it would eventually approximate a discontinuity).


GraciousReformer OP t1_j9jr4i4 wrote

"for example on tabular data where discontinuities are common, DL performs worse than alternatives, even if with more data it would eventually approximate a discontinuity." True. Is there references on this issue?


yldedly t1_j9jr821 wrote

This one is pretty good:


GraciousReformer OP t1_j9jrhjd wrote

This is a great point. Thank you. So do you mean that DL work for language models only when they get a large amount of data?


GraciousReformer OP t1_j9k1srq wrote

But then what is the difference from the result that NN works better for ImageNet?


yldedly t1_j9k3orr wrote

Not sure what you're asking. CNNs have inductive biases suited for images.


GraciousReformer OP t1_j9k4974 wrote

So it works for images but not for tabular data?


yldedly t1_j9k5n8n wrote

It depends a lot on what you mean by works. You can get a low test error with NNs on tabular data if you have enough of it. For smaller datasets, you'll get a lower test error using tree ensembles. For low out-of-distribution error neither will work.


1bir t1_j9jbi5e wrote

Apparently* decision trees are also capable of [universal function approximation](

Whether the algorithms for training them do that as well as the ones for deep NNs in practice is a separate issue.

*Haven't seen (& probably wouldn't understand) a proof.


GraciousReformer OP t1_j9jg13p wrote

Then why not use decision trees instead of DL?


uhules t1_j9jpkun wrote

Before why, ask if. GBDTs are very widely used.


1bir t1_j9jnu2h wrote

>Whether the algorithms for training them do that as well as the ones for deep NNs in practice is a separate issue.

For supervised learning the big problem with decision trees (RFs, GBTs etc) seems to be representation learning


Brudaks t1_j9k6mo0 wrote

Because being an universal function approximator is not sufficient to be useful in practice, and IMHO is not even really a particularly interesting property; we don't care if something can approximate any function, we care whether it approximates the thing needed for a particular task; and in any case being able to approximate it is a necessary but not a sufficient condition. We care about efficiency of approximation (e.g. a single-layer perceptron is an universal approximator iff you assume an impractical number of neurons), but even more important than how well the function can be approximated with a limited number of parameters is how well you can actually learn these parameters - this differs a lot for different models, and we don't care about how well a model would fit the function with optimal parameters, we care about how well it fits the function with the parameter values we can realistically identify with a bounded amount of computation.

That being said, we do use decision trees instead of DL; for some types of tasks the former outperform the latter and for other types of tasks its the other way around.


VirtualHat t1_j9lp6z3 wrote

There was a really good paper a few years ago that identifies some biases in how DNNs learn might explain why they work so well in practice as compared to alternatives. Essentially they are biased towards smoother solutions, which is often what is wanted.

This is still an area of active research, though. I think it's fair to say we still don't quite know why DNNs work as well as they do.


BoiElroy t1_j9ipbtg wrote

This is not the answer to your question but one intuition I like about universal approximation theorem I thought I'd share is the comparison to a digital image. You use a finite set of pixels, each that can take on a certain set of discrete values. With a 10 x 10 grid of pixels you can draw a crude approximation of a stick figure. With 1000 x 1000 you can capture a blurry but recognizable selfie. Within the finite pixels and the discrete values they can take you can essentially capture anything you can dream of. Every image in every movie ever made. Obviously there are other issues later like does your models operational design domain match the distribution of the training domain or did you just waste a lot of GPU hours lol


GraciousReformer OP t1_j9jh7zm wrote

Yes a very finite grid size will approximate any digital image. But this is an approximation of an image in grids. How will it lead to approximation by NN?


DigThatData t1_j9k17rr wrote

it's not. tree ensembles scale gloriously, as do approximations of nearest neighbors. there are certain (and growing) classes of problems for which deep learning produces seemingly magical results, but that doesn't mean it's the only path to a functional solution. It'll probably give you the best solution, but that doesn't mean it's the only way to do things.

in any event, if you want to better understand scaling properties of DL algorithms, a good place to start is the "double descent" literature.


JackBlemming t1_j9n4bp2 wrote

This is true. Netflix famously didnt use some complex neural net for choosing shows you'd like exactly because it didnt scale. Neural nets are expensive and if you can sacrifice a few percentages to save hundreds of millions in server fees, it's probably good.


DigThatData t1_j9nlrii wrote

just to be clear: i'm not saying neural networks don't scale, i'm saying they're not the only class of learning algorithm that scales.


howtorewriteaname t1_j9kmbrv wrote

There's no mathematical formulation of that statement because there's no mathematical reasoning behind that statement. It's just an opinion (which I believe it isn't true btw)


30299578815310 t1_j9kziil wrote

This is just not true. As others noted there are other algorithms which are universal approximators and run at scale. The key to the success of DNNs is unknown. A hypothesis is called the lottery ticket hypothesis.



VirtualHat t1_j9lm05j wrote

It's worth noting that it wasn't until conv nets that DNNs took off. It's hard to think of a problem that a traditional vanilla MLP solves that can't also be solved with an SVM.


kvutxdy t1_j9l7on1 wrote

Universal approximation theorem only states that DNN can approximate Lipschitz functions, but not necessarily all functions.


VirtualHat t1_j9loi32 wrote

It should be all continious functions, but I can't really think of any problems where this would limit the solution. The set of all continuous functions is a very big set!

As a side note, I think it's quite interesting that the theorem doesn't include periodic functions like sin, so I guess it's not quite all continuous functions, just continuous functions with bounded input.


pyfreak182 t1_j9slq8a wrote

It helps that the math behind back propagation (i.e. matrix multiplications) is easily parallelizable. The computations in the forward pass are independent of each other, and can be computed in parallel for different training examples. The same is true for the backward pass, which involves computing the gradients for each training batch independently.

And we have hardware accelerators like GPUs that are designed to perform large amounts of parallel computations efficiently.

The success of deep learning is just as much about implementation as it is theory.


alterframe t1_ja2f7xu wrote

Part of the answer is probably that DL is not a single algorithm or a class of algorithms, but rather a framework or a paradigm for building such algorithms.

Sure, you can take a SOTA model for ImageNet and apply it to similar image classification problems, by tuning some hyperparameters and maybe replacing certain layers. However, if you want to apply it to a completely different task, you need to build a different neural network.


elmcity2019 t1_j9jm4nq wrote

I have been an applied data scientist for 10 years. I have built over 100k models using python, databricks and DataRobot. I have never seen a DL model out compete all the other algorithms. Granted I am largely working with structured business data, but nonetheless DL isn't really competitive.


VirtualHat t1_j9lmkg1 wrote

In my experience DNNs only help with structured data (audio, video, images etc.). I once had a large (~10M datapoints) tabular dataset and found that simply taking a random 2K subset and fitting an SVM gave the best results. I think this is usually the case, but people still want DNNs for some reason. If it were a vision problem, then, of course, it'd be the other way around.