Viewing a single comment thread. View all comments

shumpitostick t1_ivfeiqq wrote

The author has an embarrassingly bad understanding of statistics and machine learning and makes a very unclear argument. In fact, the opposite is true. The more data we have - the easier it is to find meaningful patterns. Variance - which is the thing that causes most spurious patterns, decreases with the root on the number of samples. The more data points you have, the more likely it is that if you have a true relationship between variables, you will pass whichever hypothesis test (such as the p-score). More data therefore allows us to set higher standards for hypothesis testing.

However, most of the arguments in the article don't even support this main hypothesis. Instead the author talks about unrelated things, such as some low-quality studies that found some weird correlations and about the replication crisis, which is a complex problem caused by many reasons, none of which is an abundance of data.

115

ShadowStormDrift t1_ivfus80 wrote

What about confounding variables?

For example. Looking for trends across governments:hard. Looking for trends WITHIN government departments: Easier. (two different departments might trend in opposite directions and cancel each other out when pooled together)

16

shumpitostick t1_ivfwf6z wrote

That gets us into the realm of causal inference. This is not really what the author was talking about, but yes, it's a field that has a bunch of additional challenges. In this case, more data points might not help, but collecting data about additional variables might. In any case, getting more data will pretty much never cause your model to be worse.

34

ajt9000 t1_ivics5n wrote

The main way its gonna make a statistical model worse is by increasing the computational power needed to run it. Thats not an argument about the quality of the model results though. I agree the author's understanding of statistics is really bad.

7

shumpitostick t1_ivlb6zi wrote

I was oversimplifying my comments a bit. There is the curse of dimensionality. And in causal inference if you just use every variable as a confounder your model can also get worse because you're blocking forward paths. But if you know what you're doing it shouldn't be a problem. And I haven't met any ML practitioner or statistician who doesn't realize the importance of getting to understand your data and making proper modelling decisions.

1

val_tuesday t1_iviyo6a wrote

If I understand the point of the article it really isn’t talking about “more data” in the sense of a bigger data set to train the model on a given task. He is more commenting that given a nonsense task (like predicting criminality from a photo of a face) you can find some data to train your model. You might then be seduced by your results into thinking that the task makes sense and that your model does something meaningful. That it found a modern theory of skull shapes or whatever, when all it really did was classify mugshots.

In other words he is not addressing the cutting edge of AI research, but rather the wide eyed practitioners (and - inevitably - charlatans) who will gladly sell a model for an impossible task.

6

dontworryboutmeson t1_ivh62zg wrote

My girlfriend sent me this article and I responded with the same exact stance before reading your comment. This dude is a pseudo expert.

4

OceanoNox t1_ivjf74f wrote

I read before that at least a big part of the replication crisis is cherry-picking of data by researchers themselves. What would you say are other reasons? Badly recorded protocols/methodologies?

2

shumpitostick t1_ivlg54a wrote

My take on the replication crisis is that it is something 60% bad incentives, 35% bad statistics and 5% malice. Bad incentives is the whole journal system, which incentivizes getting good results and does not deeply scrutinize methodology and source data, the lack of incentives for preregistration, poor quality journals existing, etc.

Bad statistics is mostly the fact that people interpret p<0.05 as true and p>0.05 as worthless results and use it as a threshold for publishing, rather than the crude statistical tool that it really is. Plus just a general bad understanding of statistics by most social scientists. I'm currently doing some research in causal inference, developing methodology that can be used in social science, and it's embarrassing how slow social scientists are in using tools from causal inference. In economics applications are usually 10-20 years behind the research but in psychology for example they often don't even attempt any kind of causal identification but then suggest that their studies somehow show causality.

Malice is scientists just outright faking data or cherry-picking. But even that is tied to the incentive structure. We should normalize publishing negative results

2

OceanoNox t1_ivmhma8 wrote

Thank you for your insight. I am in material engineering, and I emphasize having representative data, but I have heard at conferences that the results shown are sometimes the top outliers, outside of the average.

I completely agree about the publication of negative results. Many times I have wondered how many people have tried the same idea, only to find out it didn't work and did not or could not publish it. And thus another team will spend effort and money because nothing was ever reported.

1

Clean-Inevitable538 t1_ivhfbhw wrote

This answer is a perfect example of what the OG author is talking about. This response does seem to come from a knowledgable person and the response seems well constructed but it does not address the point the author is making. But they are observant enough too state that the authors argument is unclear which in reality means that they did not understand it fully... Which is great at showing how 2 separate theories of Truth work for diferent people. Where the author is probably comming from some sort of relativism, the redditor comes from a theory where truth is objective and so claims not that the OG author's argument is difficult to understand but the argument is unclear, under a premise that they know what constitutes a clear argument. :D

Three takeaways:

  1. The paradox of big data is that the more data we ransack for patterns, the more likely it is that what we find will be worthless or worse.
  2. The real problem today is not that computers are smarter than us, but that we think that computers are smarter than us and trust them to make decisions for us that they should not be trusted to make.
  3. In the age of Big Data and powerful computers, human wisdom, commonsense, and expertise are needed more than ever.
−1

visarga t1_ivinkvl wrote

  1. Take a look at neural scaling laws, figures 2 and 3 especially. Experiments show that more data and more compute are better. It's been a thing for a couple of years already, the paper has 260 citations, authored by OpenAI.

  2. If you work with AI you know it always makes mistakes. Just like if you're using Google Search - you know you often have to work around its problems. Checking models not to make mistakes is big business today, called "human in the loop". There is awareness about model failure modes. Not to mention that even generative AIs like Stable Diffusion require lots of prompt massaging to work well.

  3. sure

9

thereissweetmusic t1_ivj4ggd wrote

As a layman your supposed alternative interpretation of the article’s arguments makes them sound quite simplistic and not at all difficult to understand. Reductive even. Which makes me suspect your suggestion that OP didn’t understand the article came directly from your butthole.

  1. Ok, you’ve just claimed the opposite of what OP claimed, and provided far less evidence (ie none) to back it up compared to what OP provided.

  2. This sounds like it has nothing to do with having more or less data.

  3. Ditto

2

Clean-Inevitable538 t1_ivj6u8m wrote

I am a layman as well but as far as I understand the article, as it talks about meaning and relation, variance mentioned by the commentor is not relevant. And I can see how it can be misconstrued as relevant when talking about meaning. It depends if meaningfull is understood as data extrapolation itself or its corelation to factual aplication.

3

shumpitostick t1_ivldyfg wrote

I understand that those are the takeaways, but where is the evidence? The author just jumps to some vaguely related topics as if it's evidence, while what he's really doing is spinning some kind of narrative, and the narrative is wrong.

About the takeaways:

  1. As I explained in my comment, this is not true.
  2. Who thinks that way? Everybody I know, both laymen and people in the field of Data Science and Machine Learning, have healthy skepticism of AI.
  3. Having worked as a data scientist, I can attest that data scientists check their algorithms, use common sense, and put an emphasis on understanding their data.

Honestly,the article just reads to me as a boomer theory-focused economist who's upset of the turn towards quantitative and statistics-heavy approach that his field has taken. There is a certain (old) school of economists who prefer theoretical models and takes a rationalist over an empirical approach. The problem with their approach is that the theoretical models they build use assumptions that often turn out to be wrong. They use "common sense" rather than relying on data but the world is complex and many "common sense" assumptions don't actually hold.

0