Submitted by Difficult-Race-1188 t3_z7nt9o in deeplearning

This will change your understanding of Neural Networks forever. The black-box nature of Neural Networks has eluded even the best of scientists for more than a decade now. The release of recent research papers throwing light on the black-box nature of Deep learning systems has convinced a lot of researchers that Neural Networks are nothing but a bunch of decision trees in hyperspace. Godfather of AI, LeCuN went on to say that Neural Networks can't interpolate, all they can do is extrapolate, and that too in a rudimentary fashion like Decision Trees.

Full Article:

https://medium.com/aiguys/proof-that-neural-networks-are-dumb-1c848163dec3

​

Polyhedron in hypothesis space

2

Comments

You must log in or register to comment.

xtof54 t1_iy7kgtw wrote

Researchers know that, but it does not help in any way to better understand DNN. A bunch of DT is not more explainable than a DNN

29

Difficult-Race-1188 OP t1_iy7p4ap wrote

It does, if we know that NN behaves like DT then we can design new loss functions that take the internal structure into account. One of the research areas in this regard is Lipschitz Regularization. Adding such regularization makes NN behave more smoothly.

−9

freaky1310 t1_iy7ielr wrote

Thanks for pointing out the article, it’s going to be useful for a lot of people.

Anyway, when we refer to the “black box” nature of DNNs we don’t mean “we don’t know what’s going on”, but rather “we know exactly what’s going on in theory, but there are so many simple calculations that it’s impossible for a human being to keep track of them”. Just think of a simple ConvNet for MNIST classification like AlexNet: it has ~62M parameters, meaning that all the simple calculations (gradients update and whatnot) are performed A LOT of times in a single backward pass.

Also, DNNs often work with a latent representation, which adds another layer of abstraction for the user: the “reasoning” part happens in a latent space that we don’t know anything about, except some of its properties (and again, if we make the calculations we actually do know exactly what it is, it’s just unfeasible to do them).

To address these points, several research projects have focused on network interpretability, that is, finding ways of making sense of NNs’ reasoning process. Here’s a review written in 2021 regarding this.

11

Difficult-Race-1188 OP t1_iy7pdev wrote

So the paper which talks about the Spline theory of DL says that even in latent representation NN are incapable of interpolation and that's a very important thing to know about. If we know this then we can design loss functions that works to better understand the global manifold structures.

−2

ivan_kudryavtsev t1_iy7ncq9 wrote

Maybe, a decision tree is an example of a NN? I mean that NN is more generic structure because it may include an arbitrary Neuron design and custom layers design?

5

Difficult-Race-1188 OP t1_iy7p779 wrote

It might behave in a similar fashion to DT, but DT doesn't make abstract feature representation and that is something important.

−1

RichardBJ1 t1_iy7xvbr wrote

Was interested when I first head about this concept. People seemed to respond with either thinking it was ground shaking, …..or alternatively that it stood to reason that given enough splits it would be the case! Do you think though, that from a practical usage perspective this doesn’t help much because there are so many decisions…. Article has a lot more than just that though and a nice provocative title.

3

hp2304 t1_iy8lyyf wrote

Any ML classifier or regressor is basically a function approximator.

The function space isn't continuous but rather discrete, discretized by the dataset points. Hence, increasing the size of dataset can help in increasing overall accuracy. This is relatable with Nyquist criterion. Less data and its more likely our approximation is wrong. Given the dimensions of input space and range of each input variable, the dataset size is nothing. E.g. for 224x224 rgb input image, input space has total pow(256, 224x224x3) possible input values, which is unimaginably large number, mapping each to a correct class label (total 1000 classes) is very difficult for any approximator. Hence, one can never get 100% accuracy.

2

Salt-Improvement9604 t1_iy8rho5 wrote

>Any ML classifier or regressor is basically a function approximator.

so what is the point of designing all these different learning algorithms?

All we need is more data and even the simplest linear model with interaction terms will be enough?

0

freaky1310 t1_iy91uxy wrote

TL;DR: Each model tries to solve the problems that affect the current state-of-the-art model.

Theoretically, yes. Practically, definitely not.

I’ll try to explain myself, please let me know if something I say is not clear. The whole point of training NNs is to find an approximator that could provide correct answers to our questions, given our data. The different architectures that have been designed through the years address different problems.

Namely, CNNs addressed the curse of dimensionality: using MLPs and similar architecture wouldn’t scale on “large” images (large means larger than 64x64) because the number of connections would increase exponentially on the number of neurons of each layer. Therefore, convolution has been found to provide a nice approximation of aggregated pixels (called “features” from now on) and CNNs were born.

After that, expressiveness has been a problem: for example, stacking too many convolutions would erase too much information on one side, and significantly decrease inference time on the other side. To address this, researchers have found recurrent units to be useful for retaining lost information and propagate it through the network. Et voilá, RNNs are born.

Long story short: each different type of architecture was born to solve the problems of another kind of models, while introducing new issues and limitations at the same time.

So, to go back to your first question: can NNs approximate everything? Not everything everything, but a “wide variety of interesting functions”. In practice, they can try to approximate everything that you will need to, even though some limitation will always stay there.

3

BellyDancerUrgot t1_iy8x636 wrote

Wonder why if so NNs are so much better on unstructured data while being trickier and in general more useless on structured data compared to tree based classifiers and boosted classifiers.

2

Youness_Elbrag t1_iy9lq76 wrote

I think that NN is general structure algorithm can learn anything from data depends on problem and it can approximate between data distributions , such automa NN is good example

2

Creepy_Disco_Spider t1_iya0x5i wrote

You can just cite the original paper

2

Difficult-Race-1188 OP t1_iya8hg7 wrote

I've tried adding information from a lot of other resources. Not just one paper. And all of them are mentioned in the article.

1

VinnyVeritas t1_iya34at wrote

Correlation is not causation.

2

BrotherAmazing t1_iyaux7r wrote

A deep neural network can approximate and function.

A deep recurrent neural network can approximate any algorithm.

The are mathematically proven facts. Can the same be said about “a bunch of decision trees in hyperspace”? If so, then I would say “a bunch of decision trees in hyperspace” are pretty darn powerful, as are deep neural networks. If not, then I would say the author has made a logical error somewhere along the way in his very qualitative reasoning. Plenty of thought experiments in language with “bulletproof” arguments have led to “contradictions” in the past, only for a subtle logical error to be unveiled when we stop using language and start using mathematics.

2

Difficult-Race-1188 OP t1_iyaxhbe wrote

The argument goes much further, NNs are not exactly learning the data distribution. If they had, the affine transformation problem would have been already taken care of, there would have been no need for data augmentation by rotating or flipping. Also approximating any algorithm doesn't necessarily mean the underlying data is following a distribution made out of any known algorithm. Also, Neural network struggle even to learn simple mathematical functions, all they do in the approximation is make piecewise assumptions of algorithms.

Here's the grokking paper review that told that NN couldn't generalize to this equation:

x³ + xy² + y (mod 97)

Article: https://medium.com/p/9dbbec1055ae

Original paper: https://arxiv.org/abs/2201.02177

1

BrotherAmazing t1_iyazrs4 wrote

Again, they can approximate any function or algorithm. This is proven mathematically.

Just because people are confounded by examples of DNNs that don’t seem to do what they want them to do, and just because people do not yet understand how to construct DNNs that exist that can indeed do these things does not mean they are “dumb” or limited.

Perhaps you are constructing them wrong. Perhaps the engineers are the dumb ones? 🤷🏼

Sometimes people literally argue, just with plain english and not mathematics, that basic mathematically proven concepts are not true.

If you had a mathematical proof that showed DNNs were equivalent to decision trees or incapable of performing certain tasks, with a mathematical proof, neat! If you argue DNNs can’t perform tasks that can be reduced to functions or algorithms though, and do it in mere language without mathematical proofs, I’m not impressed yet!

2

Difficult-Race-1188 OP t1_iyc8451 wrote

https://arxiv.org/pdf/2210.05189.pdf

Read this paper, it's been proven that neural networks are decision trees, not a mere approximation but precisely that only. 3rd line in the abstract.

1

BrotherAmazing t1_iyeq8zq wrote

Interesting—I will have a read when I have time to read and check the math/logic. Thanks!

I do think I am allowed to remain skeptical for now because this was just posted as a pre-print with a single author a month ago and has not been vetted by the community.

Besides, if there is an equivalence between recurrent neural networks, convolutional neural networks, fully connected networks, policies learned with deep reinforcement learning, and all of this regardless of the architecture, how the network is trained, and so on, and there always exists a decision tree that is equivalent, then I would say:

  1. Very interesting

  2. Decision trees are then more flexible and powerful than we give them credit for, not NNs are less flexible and less powerful than they have been proven to be.

  3. What is it about decision trees that makes people not use them in practice for anything too complicated on full motion video, etc? How does one construct the decision tree “from scratch” via training except by training the NN first, then building a decision tree that represents the NN? I wouldn’t say “they’re the same” from an engineering and practical point of view if one can be trained efficiently and the other cannot, but can only be built once the trained NN already exists.

2