Submitted by 4bedoe t3_yas9k0 in MachineLearning
In many disciplines, practice is usually deviates away from the theory. Is that the same in ML? If so, what are they?
Submitted by 4bedoe t3_yas9k0 in MachineLearning
In many disciplines, practice is usually deviates away from the theory. Is that the same in ML? If so, what are they?
This is a great answer. The simplest approach is often the best in practice.
I am still shocked as to how densenets are not the standart in the industry, also with a cross stage partial design you split the gradient flow and allow for much much easier training process. Is it the complexity of implementation thats holding them back I wonder?
Anybody who has tried to run densenet knows, that it requires an absurd amount of memory in comparison to resnets.
data > model
Data/Label is usually a much harder problem than algorithm.
The first things that came to mind are things that just weren't taught much in school - data quality/relevance/sourcing, designing proxies for business metrics, how good is good enough, privacy of training data, etc.
The deviations from theory that have come to mind:
[deleted]
I feel like in practice, in data science, you can rarely justify things theoretically and mostly rely on best practices for certainty of conclusions.
[removed]
A thing called double descent. I still don't believe it tho
illustration of why it happens in low-dimensions: https://twitter.com/adad8m/status/1582231644223987712
i think the main problem is that all textbooks introduce the bias-variance tradeoff as something close to a theoretical law, while in reality, it is just an empirical observation and we simply haven't bothered to further check this observation across more settings... until now
There's quite a bit of missing context and misconceptions about this in general, but there is a theoretical backing that is really just a mathematical fact that isn't falsifiable by observation.
Bias, variance, and mean squared error for an estimator can be related quite simply. Say the quantity to be estimated is y, the estimate is x (forgive the lack of LaTeX making more conventional names difficult). Then MSE is E[(y - x)^2] = E[x^2] + E[y^2] - 2 E[xy] = Var[x] + Var[y] - 2Cov[x,y] + (E[x]-E[y])^2, and the first 3 terms are typically simplified by assuming some error structure that makes Cov[x,y] = 0, leading to a decomposition of MSE into 3 terms:
Var[y] - the variance of the true underlying quantity that is inherent to any estimator.
Var[x] - the variance of the estimator regardless of bias. (What's typically referred to as the variance in the bias-variance terminology)
(E[y] - E[x])^2 = Bias[x]^2 - the squared bias of the estimator.
Since (1) is constant across all estimators, that means that the difference in MSE for two estimators comes entirely down to (2) and (3).
The bias-variance tradeoff is then:
For a fixed value of mean squared error, decreasing the bias of an estimator must increase the variance by the same amount and vice versa.
An unbiased estimator will have the maximum variance among all estimators with the same MSE. It's entirely possible to improve an estimator's MSE by increasing or decreasing bias. In fact, the two commonly used estimators of sample variance are a great example*.
The most important pieces here are that we're talking about MSE of an estimator, and that for a fixed value of MSE there's a 1:1 tradeoff between bias and variance. It makes no statements about the case when MSE is not held constant, and it's actually very well known in mathematical statistics that biased estimators can provide significant improvements in MSE over unbiased ones. This doesn't hold if you're not considering MSE, or allowing MSE to change. It can be extended to multivariate cases and "estimator" could be anything from a mean of a few samples to the output of a deep belief network.
* Take a = 1/(n-1) * sum(x^2) - 1/(n*(n-1)) * sum(x)^2, and b = (n-1)/n * a. v = Var[x] is the population variance. We have E[a] = v, which means a is unbiased (this is Bessel's correction). And obviously, E[b] = (n-1)/n * v. Thus Bias[b] = E[b] - v = - v / n. Calculation of Var[a] requires knowledge of the 4th moments of x, however, in the case of normally distributed data the MSE of b is less than that of a. And a property that holds in general is that Var[b] = ((n-1)/n)^2 * Var[a], and MSE[b] = Var[b] + (v/n)^2 = 1/n^2*((n-1)^2 Var[a] + v^2). Thus MSE[b] < MSE[a] if (2n - 1)*Var[a] > v^2. This is true for normal distributions and many others.
thanks for your clarification - i am sure it is useful for the other readers! But based on your knowledge, you might be interested in our latest preprint, which offers a more general bias-variance decomposition https://arxiv.org/pdf/2210.12256.pdf
Oh interesting! This looks to hit upon a lot of my favorite topics, I'll be taking a more in depth look later.
It's not surprising to me that a decomposition based on Bregman divergences has similar properties as the one for MSE, but the connections through convex conjugates and proper scores is clever.
It's not a theoretical law. But it is sure as hell makes intuitive sense and I can't really imagine how a complex model may not overfit. I mean, that's what I mean by a complex model, something that is prone to overfitting. Otherwise, what does model complexity even mean?
>Otherwise, what does model complexity even mean?
People are generally referring to bigger models (#parameters) as more complex.
Come to think of it, redundancy in networks with more parameters can act as a regularizer, by making similar branches have essentially higher learning rate and be less prone to overfitting. Let me give you an example of what I have in mind: a simple network with just one parameter - y = wx. You can pass some data through it, calculate loss, backpropagate to get gradient, and update the weight with it.
But see what happens if we reparametrize w as w1+w2: the gradient for these is gonna be the same as in case of only one parameter, but after the weight update step we will essentially end up moving twice as far, which would be equal to original one parameter case with 2 times bigger learning rate.
Another thing that could be somehow linked to this phenomenon is that on one hand the parameter space of a 1 hidden layer neural network grows exponentially with the number of neurons, and on the other hand the number of equivalent minimums grows factorially, so at some certain number of neurons the factorial takes over and your optimization problem becomes much simpler, because you are always close to your desired minimum. But I don't know shit about high-dimensional math so don't quote me on that.
maybe to help your intuition, consider the following: Do more parameters really increase model complexity if they are less fitted? Check out this post https://twitter.com/tengyuma/status/1545101994150531073
There are a few credible theories proposed as to why this could happen, not just in ML but in the statistics community as well. It's a pretty widespread phenomenon and is confirmed in simulation studies.
MLPs are universal function approximators but it turns out models with more inductive bias like CNNs are more effective for tasks like image classification.
Does that mean that MLP are not universal function approximators? No.
Its a fact that MLP is capable of fitting arbritrary functions.
Does anything here deviate from the theory? No.
That's what the theory says
> MLPs are universal function approximators
MLPs with non-polynomial activation functions with either arbitrary width or arbitrary depth have the ability to approximate a function f: S -> R with an arbitrary specified level of error where S is a compact subset of R^n.
Violate any of these assumptions and you lose those guarantees. Any finite MLP will only be able to approximate a subset of functions with the given support for an arbitrary error level. Nothing about their ability in practice contradicts this.
Much like how there exist matrix multiplication algorithms with better than O(n^2.4) running time but the naive O(n^3) algorithm outperforms them for all physically realizable inputs, the effects of finite sizes are very important to consider.
MLP Mixer would like to speak to you
None, also not in other disciplines
there are definitely differences between theory and practice in many fields. For example
As i see you lack of examples 😄
When practice deviates from theory, this usually means that the theory does not well-capture the results that people are getting in practice. This does not necessarily mean that the theory is incorrect, but that usually the implications of the theory and it’s common inductions don’t capture the entire picture.
A big example of this was when ML theorists were trying to capture the size of the hypothesis class through it’s VC dimension. The common theory was that larger neural networks had more parameters and a higher VC dimension, and thus, led to higher model variance and higher capacity to overfit. This hypothesis was empirically nullified ad scientists found that larger models tended to learn from better from data. This was described by a well-known paper which termed the phenomenon “double deep descent.”
One last note is that I have yet to see a scientist who doesn’t understand the limitations and fallibility of scientific theories. It is usually the science “enthusiasts” who naively misinterpret their implications and make unfounded claims.
@ "When practice deviates from theory, this usually means that the theory does not well-capture the results that people are getting in practice. This does not necessarily mean that the theory is incorrect, but that usually the implications of the theory and it’s common inductions don’t capture the entire picture."
Thats a rather stupid (sorry) interpretation of what the term "deviations from the theory" should mean. If one follows your interpretation, it means that you would declare an experiment, where someone tries to calculate some fluid dynamics by using maxwells equations and then gets results that not match the measurements, as a "deviation from the theory".
Thats nonsense, because obviously the theory IS JUST USED wrong: the theory around the Maxwell equations NEVER suggested to model the dynamics of fluids.
therefore, a more sensical interpretation for a "theory deviating from reality" is to include the assumption, that the theory "is used correctly" (meaning theory predictions are compared with measurements, for which the theory really is made for to predict up to a certain accuracy).
If then the theory is applied correctly, but the measurements deviate qualitatvely from the measured reality, that implies that the theory does not model the real mechanics accurately and therefore is wrong (in science we call that falsification).
At the example you gave: you named some ongoing discussion in an open research field. Of course, when humans try to come up with explanations for something (when they develop NEW theories), 99% of the ideas are wrong at first and are still being discussed. Through experiments and falsification, theories are adapted until they are not falsified anymore. That means, every theory, with time, gets filtered by the scientific process, until more and more experiments confirm it by not falsifying it. At the end of this process, a theory is well established and starts to being taught e.g. to students in universities.
Your example is not one for a well established theory, that the topic creator as a student could have learned as "truth" about in one of his lectures. And the topic name is "what .... did you LEARN".
The wrong usage of a theory can be pinned down to the assumptions, and these are not a matter of semantics.
In classes i have visited or seen in the internet, i have never seen somebody stating that its a global rule that larger models, without exception, do increase the danger of overfitting or similar. Such topics were discussed at maximum in context of "intuition", resp. The teacher just shared his own experiences. And still, thats often true.
But i am open to see the example lecture, that teaches that as a general rule explicitly, such that it has been falsified later.
[deleted]
I state that every correct (and by that i mean scientific) formulation of assumptions can be even abstracted and formalized, and even incorporated in an automated algorithm yielding the answer weather this assumption is true or not, w.r.t the theories assumptions.
Proof: take an arbritrary assumption formulation and convert it to mathmatical formulation. Then us goesels numbers to formalize.
If you say now, well the conversion to a mathmatical formulation can be ambigious, i would ask you to clearly state the assumptions in a language that is suited for a scientific discussion.
If you Talk about the subsequent slides, i see it introduces one idea, to get some guidance in finding the opt settings, called bayesian occams razor. Occams razor is a HEURISTICS. Thats so to say the opposite of a rule/theory.
A property of heuristics is explicitly, that it does not guarantee to yield a true or optimal solution. A heuristics can by definition not be wrong or correct. its a heuristics, a strategy that has worked for many ppl in the past and might fail in many cases. A heuristics does not claim to provide a found rule or similiar.
Now on the last slide they even address the drawbacks of this heuristics. What do you expect more?
As i expected, this is not an example of a theory stating something that deviates from reality. Its just a HEURISTIC strategy they give you at hand, when you want to start with hyperparameter finding but you have no clue how. Thats when you go back to heuristics (please wikipedia heuristics) and i bet this proposed heuristics is not the worst you can do even today, where more knowledge has been accquired.
I believe you are looking at the wrong slides. Reddit did something weird with the hyperlink
Then please point mento the right slide by gibing the slide number
It should be from MIT (try copying/pasting the address linked above)
One thing i must add regarding the topic of presentation as "established knowledge".
The lecture you quoted, is lecture number 12. It is embedded in a course. There are of course lecture 11, 10, 9 etc. If you check these, which are also accessible with slightly midifying the given link, you see the context of this lecture. Specifically, a bunch of classifiers are explicitly introduced, and the v-dim theory on lecture 12 are still valid of these. The course does not adress deep networks yet.
So its a bit unfair to say these lecture does teach you a theory that deviates. Its does not deviate for the there introduced classifiers.
Ok found the right one.
Well, generally i must say good example. I accepted it at least as a very interesting example to talk about, worth mentioning in this context.
Nevertheless, its still valid for all NON cnn, resnet, transformer models.
Taking into account, that its based on an old theory (prior 1990), where these deep networks have not existed yet, one might take into account its limitedness (as it doesnt try to model effects taking place during learning of such complex deep models, which hasnt been a topic back then).
So if I would be really mean, i would say u cant expect a theory making predictions about entities (in this case modern deep networks) that had not been invented yet. One could say that the v-dim theory's assumptions include the assumption of a "perfect" learning procedure (therefore exclude any dynamic effects from the learning procedure), which is still valid for decision trees, random forrest, svms, etc, which have their relevance for many problems.
But since im not that mean, i admit that this observations in these modern networks do undermine the practicability of the V-dimension view for modern deep networks of the mentioned types, and that must have been a mediocre surprise before having tried out if v-dims work for cnn/resnet/transformers, therefore good example.
Your theory about what it means for theory to deviate from practice seems to deviate from practice.
[deleted]
Yes you are right. That is, because in most cases, not the theory is wrong but its wrongly understood, artifically extended, wrongly used.
It is therefore important to distinguish between deviations from theory, and deviations from misuse/misinterpretation.
Well, in CS theory is mathematical, and mathematical theory cannot be wrong because it's logically derived and makes statements about conclusions that can be made if certain conditions are met. So it would be nonsensical to discuss cases where such a theory is wrong. It stands to reason that the only practical discussion of deviation of practice from theory occurs in the realm of what you call misuse.
Wrong.
Due to minimum 2 reasons:
It is a difference, if one has just misunderstood the theory, but could have known better. Or if that one has "done everything right" up to the point of science's knowledge, at that point of time.
In the first case, its the "users" fault, in the second, its on the current state of the theory. The first one is rubbish, only the second one yields knowledge. If you submit a paper of the first case, it gets rejected. If you submit one of the second one, you get attention.
You are just "projecting" everything into one pseudo-abstract (cp. Reason 2.) timeless dimension, washing away the core in my "theories of theories":
By misuse i would clearly say its of the first case category, where the "user" fucked it up, the second should NOT be labelled as misuse.
Reason for this partition as already said: the first category is rubbish, the second one is valueable.
Or another equally good reason: it cannot semantically be a misuse, if the "user" does not violate any assumption known at that time, because the current "rules" do define what a misuse is. Misuse is dynamic, not static. Think four dimensional, Doc Brown!
Even in mathmatics itself, only a subset of the disciplines are purely abstract.
One example, where this is not the case, is coincidently the greatest mathmatical question currently: by what rule are the prime numbers distributed? This is in the field of number theory.
The earliest theory was, that they are log-distributed, by Gauß. His log approximation of the step function of primes was good, bur clearly deviates. Meanwhile, many other mathmatical theories, such as a log distribution, have been developed, leading up to the riemann hypothesis.
In this example u see, that a mathmatical model (here the log fct) is APPLIED TO DESCRIBE a natural phenomenon (the distribution of primes). Now the correctness is not decided in some formal level, but in the question if the natural phenomenon can be accurately enough described by the suggested mathmatical model.
What you say, that math. theories cannot be wrong because they are deduced/proven/whatever, is totally off-topic in this example, and could at maximum refer to the formal correctness of the theory around the log function involved here, e.g. the rules of adding two logs or similar, which are itself proven to be correct.
But that is not of interest here, the log theory is just taken, and suggested to model a natural phenomenon. The mathmatical theory here is that the step fct. of primes follows a log distribution.
By the way, there is also no conditions to be met, there is just one step fct. and one candidate theory to predict it.
The answer to the correctness of a theory about the distribution of primes does not lie in some formal deduction, but lies directly in the difference of the predicted and real distribution of primes.
In CS theories, of course you have questions from the CS domain that you want to answer with your mathmatical model. Only a small subset of CS theory deals with purely abstract proofs that stand for itself, e.g. in formal verification. The majority has a domain question to be answered and very often that question is quantitative, such that CS theories predict numbers that can be measured against real data measured from the phenomenon of interest, which then detetmines the correctness of the theory
Jeremy Howard of FastAI says in his course you don't need to worry about getting the architecture and the hyper-parameters right, because a decent library (including theirs) already takes care of it in most cases.
Not a well established theory, just an implemented software feature being advertised here.
And its trivially true, that when a method that tries out more hyperparameters and more models, the probability increases that it spits out a configuration that works well for you (the only statement in this that comes close to a "theory").
Weather an implemented software is doing that up to a satisfactory manner for a user, or if it can be achieved at all, is a question in the domain of engineering, not theory.
seba07 t1_itd3x25 wrote
A few random things I learned in practice: "Because it worked" is a valid answer to why you chose certain parameters or algorithms. Older architectures like resnet are still state of the art for certain scenarios. Executive time is crucial, we often take the smallest models available.