sunbunnyprime t1_j7y86w7 wrote on February 10, 2023 at 6:29 AM

Good question.

An ML Researcher is typically trying to find models which are more powerful in terms of output behavior - whether that be predictive power, generative ability etc.

A Statistical Researcher is typically trying to understand the dataset, the underlying generative distribution, and really dig into what the model’s innards are saying about the data and what you can conclude from it. They’re more likely to want to extract insight about the data itself.

Statisticians tend to be more rigorous about data and more well grounded in my experience, while ML Scientists tend to want to push boundaries and be the person who’s read the latest ML journal piece.

There’s so much you can say and know about something as simple as linear regression. There’s really a lot of fascinating math in there that goes so much deeper than you might expect.

If you’re interested in just using models to predict, there’s not that much of interest in a linear model. If you really want to know what meaning you can extract from what’t going on inside - exactly why it learns the coefficients it does, what the learning dynamics are, what the results mean etc - then you might end up writing 10 papers on Lasso.

Both sides are valid. Most ML scientists suck at their jobs I must say though.

JurgenSchmidthuber t1_j7yenpm wrote on February 10, 2023 at 7:50 AM

>while ML Scientists tend to want to push boundaries and be the person who’s read the latest ML journal piece.

Lol easy tell that you're neither in the field nor actually know any "ml scientists"

carlthome t1_j7z22wa wrote on February 10, 2023 at 12:49 PM

Because they didn't say conference paper, you mean?

sunbunnyprime t1_j8bp6uz wrote on February 13, 2023 at 3:00 AM

I’m a principal machine learning scientist at a very well known company and I’m also a kaggle master. You’re reading a lot into a few words I crapped out in a reddit comment.

JurgenSchmidthuber t1_j8ce8bs wrote on February 13, 2023 at 6:49 AM

Lol

themusicdude1997 t1_j7yddvo wrote on February 10, 2023 at 7:33 AM

Care to elaborate on that last sentence?

SwitchOrganic t1_j801prt wrote on February 10, 2023 at 5:06 PM

My guess is ML scientists generally care less about statistical rigor which can lead to poor outcomes due to not properly understanding the data, assumptions, risk involved, etc

Ex: Zillow

BrotherAmazing t1_j86l5g3 wrote on February 12, 2023 at 12:45 AM

Right. I mean, most people suck at their jobs, period though so… 🤷🏼

sunbunnyprime t1_j8bpqov wrote on February 13, 2023 at 3:04 AM

Most ML scientists aren’t actually fluent in the application of the algorithms they use. They have superficial understanding, they’re slow and buggy programmers, write slow code, spend months working on models that should take a few days to put together, overindex on hyperparam selection and tuning, playing with new algorithms, and don’t know how to validate their models and end up deploying garbage that often is literally no better than a coin flip. But they’re great at convincing people that they’re right on the cusp of solving a really big problem and adding a ton of value which buys them enough time to fart around for a few years and then get another job with a 30% raise and then do it all over again.

themusicdude1997 t1_j8fm0tf wrote on February 13, 2023 at 11:10 PM

:D

Ulfgardleo t1_j7y8hdg wrote on February 10, 2023 at 6:33 AM

The difference between stats and ml is as large as between math and applied math. They aim to answer vastly different questions. In ml you don't care about identifiability because you don't care whether there is a gene among 2 millions that cause a specific type of cancer. This is not what ml is about. In ML you also very rarely care about tail risk (you should) and almost nothing about calibration (you really should). And identifiability is out of the window as soon as you use neural networks and that prevents you from interpreting your models.

I-am_Sleepy t1_j7ybb41 wrote on February 10, 2023 at 7:07 AM

I don’t think ML researcher didn’t care about model calibration or tail risks. Just it often doesn’t came up in experimental settings

It also depends on the objective. If your goal is regression or classification, then tail risk and model calibration might be necessary as supporting metrics

But for more abstract use case such as generative modeling, it is debatable if tail risk and model calibration actually matter. For example GANs model can experience mode collapse such that the generated data isn’t as diverse as the original data distribution. But it doesn’t mean the model is totally garbage either

Also I don’t think statistics and ML is totally different, because most of statistical fundamentals is also ML fundamentals. And such many of ML metrics is directly derive from fundamental statistics and / or related fields

Ulfgardleo t1_j7yd02x wrote on February 10, 2023 at 7:28 AM

You are right, but the point I was making that in ml in general those are not of high importance and this already holds for rather basal questions like:

"For your chosen learning algorithm, under which conditions holds that: in expectation over all training datasets of size n, the Bayes risk is not monotonously increasing with n"

One would think that this question is of rather central importance. Yet no-one cares, and answering this question is non-trivial for linear classification already. Stats cares a lot about this question. While the math behind both fields is the same, (all applied math is a subset of math, except if you people who identify as one of both) the communities have different goals.

BrotherAmazing t1_j86kxmq wrote on February 12, 2023 at 12:43 AM

You should say “…between pure mathematics and applied math” IMO. Nit-picky, yes, but more accurate.

Ulfgardleo t1_j87y15c wrote on February 12, 2023 at 8:40 AM

Sorry that was a wrong translation from how we say it over here.

canbooo t1_j7z0lku wrote on February 10, 2023 at 12:34 PM

I agree with the size of the difference yet disagree with the examples as there is ml research considering all 3 (causal ml, conformal ml/predictions/forecasting, AI safety, reliability etc.) I think the difference is more like deduction and induction in a sense, meaning the process of finding the answers are different. Since finishing pooping on corporate time, I will keep this short.

ML: Data -> Method -> Hypothesis -> Answers

Statistics: Hypothesis -> Method -> Data -> Answers

This may be too simplistic and please propose a better distinction but do not postulate that ML does not care about things statistics do.

Illustrious-Bar5621 t1_j7y0iu2 wrote on February 10, 2023 at 5:06 AM

These two should get you started:
https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full

http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/

jimmymvp t1_j7yubak wrote on February 10, 2023 at 11:25 AM

A pretty famous stats professor once told me that he should've switched to ML a long time ago. Now he does ML research, obviously very rigorous. He said that stats is making up questions that are to a large extent not practically useful.

AdFew4357 t1_j7ztran wrote on February 10, 2023 at 4:14 PM

Stats is finding interpretable ways to look at and mode data that ML plug and chug cs people don’t do

jimmymvp t1_j806dx2 wrote on February 10, 2023 at 5:36 PM

Just communicating what I've heard. Nevertheless, I think the whole interpretable ML community (at the very least) would disagree with you on this one :). Reducing ML to "plug and chug" is well... Speaks for itself :D

AdFew4357 t1_j806plm wrote on February 10, 2023 at 5:38 PM

The whole landscape of ML research is a hunt to chase SOTA by tweaking an architecture here or using a different optimizer there and then squeezing out 0.2% accuracy on some well known imaging dataset in an attempt to churn out papers. That’s not science if you ask me.

jimmymvp t1_j83v503 wrote on February 11, 2023 at 1:16 PM

I'm not sure if you have a good overview of ML research if this is your claim. Sounds like you've read too many blog posts on transformers. I'd suggest going through some conference proceedings to get a good overview, there's some pretty rigorous (not just stats) stuff out there. I agree though that there is a substantial subset of research in ML that works towards tweaking and pushing the boundaries of what can be achieved with existing methods, which is for me personally exciting to see! A lot of cool stuff came out of scaling up and tweaking the architectures.

Any_Geologist9302 t1_j7yg7wn wrote on February 10, 2023 at 8:10 AM

That’s kind of an odd question because many statisticians are actively doing research in ML.

[deleted] t1_j7yb6nb wrote on February 10, 2023 at 7:05 AM

[deleted]

Appropriate-Code-940 t1_j7z700v wrote on February 10, 2023 at 1:33 PM

A very simple idea, may be not correct. ML is more data driven. Statistics is more hypothesis driven. Like 2 different streams, they joint to the same river, and can not be separated again.

ml-anon t1_j7z97fq wrote on February 10, 2023 at 1:50 PM

You will find the same thing in ML too and at some point folks might find it quaint that people spent their whole careers dicking about with convnets when they are reduced to a historical footnote by whatever comes after Transformers.

[deleted] t1_j7zjscq wrote on February 10, 2023 at 3:08 PM

[deleted]

slashdave t1_j80hs32 wrote on February 10, 2023 at 6:49 PM

Different goals and different tools

OkCandle6431 t1_j811ol6 wrote on February 10, 2023 at 8:59 PM

Where I'm at 'statistics' is what me and my co-workers call what we do, and 'machine learning' is what goes in the grant application. I'm sure this differs across regions/faculty/industry/whatever.

[deleted] t1_j87x8qq wrote on February 12, 2023 at 8:29 AM

[deleted]

[deleted] t1_j7y0m61 wrote on February 10, 2023 at 5:07 AM

[deleted]

shele t1_j7y8mg8 wrote on February 10, 2023 at 6:34 AM

You cite a paper. The authors write

> Power-law scalings with model and dataset size in density estimation […] may be connected with our results.

AdFew4357 t1_j7yafw0 wrote on February 10, 2023 at 6:56 AM

Statisticians care about inference. ML scientists care about the model specifically.

Any_Geologist9302 t1_j7yf7hk wrote on February 10, 2023 at 7:57 AM

This makes zero sense.

[deleted] t1_j7zt58s wrote on February 10, 2023 at 4:10 PM

[removed]

[deleted] t1_j7xu30y wrote on February 10, 2023 at 4:07 AM

[deleted]

[deleted] t1_j7xvxia wrote on February 10, 2023 at 4:23 AM

[deleted]

currentscurrents t1_j7xv6j3 wrote on February 10, 2023 at 4:16 AM

Stats is tremendously useful, especially when your dataset is small by ML standards. Basically every scientific paper relies on statistics to tell you whether or not their result is meaningful.

ML is great when you have millions of data points, but when you only have a hundred it's not going to help you.

[deleted] t1_j7y325j wrote on February 10, 2023 at 5:32 AM

[deleted]

currentscurrents t1_j7y4073 wrote on February 10, 2023 at 5:42 AM

>Right now basically all progress is with large models,

You mean all progress... in machine learning. A lot of scientific fields necessarily must make do with a smaller number of data points.

You can't test a new drug on a million people, especially in early phase trials. Even outside of medicine, you may have very few samples if you're studying a rare phenomena.

Statistics gives you tools to make limited conclusions from small samples, and also measure how meaningful those conclusions actually are.

[deleted] t1_j7y67bi wrote on February 10, 2023 at 6:06 AM

[deleted]

[deleted] t1_j7y9mjs wrote on February 10, 2023 at 6:46 AM

[deleted]

WikiSummarizerBot t1_j7y9nn5 wrote on February 10, 2023 at 6:47 AM

All models are wrong

>All models are wrong is a common aphorism in statistics; it is often expanded as "All models are wrong, but some are useful". The aphorism acknowledges that statistical models always fall short of the complexities of reality but can still be useful nonetheless. The aphorism originally referred just to statistical models, but it is now sometimes used for scientific models in general. The aphorism is generally attributed to the statistician George Box.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

psyyduck t1_j7ybb3i wrote on February 10, 2023 at 7:07 AM

Eh. I don’t care enough about this to argue

[deleted] t1_j7ybqh1 wrote on February 10, 2023 at 7:12 AM

[deleted]

Jemimas_witness t1_j7y68en wrote on February 10, 2023 at 6:06 AM

This is only correct for certain problems, like everything it has best use cases. When you only have a hammer everything looks like a nail.

In medicine the backbone of clinical trial results that change the field relies often on 2000-3000 patients (datapoints) and often groundbreaking achievements in medical practice are made by simple statistics and simple methods. Go to the New England journal of medicine and pick any trial and the weight of their conclusions are based off of survival functions, hazard ratios, and chi squared statistics. Then go look at the funding section - these projects are funded by millions. The only disciplines in medicine with ML datapoints are epidemiology and claims level data which strays way into econometrics.

I myself study rare diseases as well as AI/ML applications in medicine and for some projects I’d be stoked to get 80 patients because there just simply aren’t that many around.

[deleted] t1_j7y84nz wrote on February 10, 2023 at 6:28 AM

[deleted]

trutheality t1_j7xvn75 wrote on February 10, 2023 at 4:20 AM

Actually the opposite. Stats is how you design studies, which is what governments, the economy, pharma, the medical field, and most sciences run on.

ML is just used for predictive modeling in low-stakes situations and fun tech demos.

Any_Geologist9302 t1_j7yfm9t wrote on February 10, 2023 at 8:02 AM

Statistics are are used literally everywhere , including in applications that fall under the ML umbrella. What do you think people have been doing with data for the last century?

Comments