Comments

You must log in or register to comment.

AllanfromWales1 t1_ivf54lg wrote

"The world is complex, but if we pretend it's simple we can make patterns.."

13

eliyah23rd t1_ivf9vsf wrote

The author's argument seems to be:

  1. There are many people writing machine learning papers without understanding core statistical principles.
  2. The best explanation for this is that there is so much data, that there are no valid methods for distinguishing valid correlations from accidental ones.
  3. Therefore, big data will produce nothing of much value from now on, since we have too much data already.

There are many procedures in place to give some protection from data over-fitting. Random pruning is one of them.

GPT-3 (and its siblings) and DALL-E 2 (and its) would not be possible without the scrape of a significant fraction of all the textual data available (DALL-E obviously combines this with images). They overcome overfitting using hundreds of billions of parameters and moving up. The power requirements of training these systems alone is mind-boggling.

Much medical data that is fed into learning systems is absurdly under fitted. Imagine a (rather dystopian) world where all health indicators of all people taking specific drugs was fed into learning systems. A doctor might one day know whether a specific drug will be effective for you specifically.

There is much yet to learn. To make a falsifiable prediction, corporations will be greedily seeking to increase their data input for decades to come. Power needs will continue to grow. This will be driven by the success (in their own value terms) of their procedures and not blind adherence to false assumptions as the author might seem to suggest.

21

BernardJOrtcutt t1_ivfbthl wrote

Please keep in mind our first commenting rule:

> Read the Post Before You Reply

> Read/listen/watch the posted content, understand and identify the philosophical arguments given, and respond to these substantively. If you have unrelated thoughts or don't wish to read the content, please post your own thread or simply refrain from commenting. Comments which are clearly not in direct response to the posted content may be removed.

This subreddit is not in the business of one-liners, tangential anecdotes, or dank memes. Expect comment threads that break our rules to be removed. Repeated or serious violations of the subreddit rules will result in a ban.


This is a shared account that is only used for notifications. Please do not reply, as your message will go unread.

1

shumpitostick t1_ivfeiqq wrote

The author has an embarrassingly bad understanding of statistics and machine learning and makes a very unclear argument. In fact, the opposite is true. The more data we have - the easier it is to find meaningful patterns. Variance - which is the thing that causes most spurious patterns, decreases with the root on the number of samples. The more data points you have, the more likely it is that if you have a true relationship between variables, you will pass whichever hypothesis test (such as the p-score). More data therefore allows us to set higher standards for hypothesis testing.

However, most of the arguments in the article don't even support this main hypothesis. Instead the author talks about unrelated things, such as some low-quality studies that found some weird correlations and about the replication crisis, which is a complex problem caused by many reasons, none of which is an abundance of data.

115

FrankDrakman t1_ivfgi0j wrote

yes, I agree it's a delusion. The more data we have, the easier it is to find the patterns. The new data tools are so powerful, it's easy to winnow through fields of chaff to find a few grains of wheat. And don't be fooled by what's commercially available.

Ten years ago, I went to a conference where one of the speakers was describing how they had successfully used the Qlik BI tool to be able to extract opinions from natural speech.

For those not in the field, natural speech is extremely hard to catalog. For example, an old type of system might have read "Trump was not the best president", and because "best" and "president" and "Trump" were in the same sentence, the system would have concluded this is a favourable opinion, when clearly it is not. That's just a simple example; it gets much worse.

But this guy was able to show us that his company's product had overcome those limitations. When the Q&A came around, he was asked who was using it, and he gave us the standard "I'd tell you, but then I'd have to kill you" line. Except I don't think he was joking.

As I said, that was ten years ago. Data science advances by leaps and bounds each year. I'm pretty sanguine about our ability to keep up with the datums.

10

JoostvanderLeij t1_ivfgqsv wrote

Actually, it is: the bigger the data set, the more patterns you can find, but the less meaning these patterns have. It is a well known issue with Big Data.

9

FrankDrakman t1_ivfi5hq wrote

Not at all. As an engineer, I understand we're building models, based on our incomplete understanding. As we learn more, we refine our models, but they are always only models, and as such, necessarily simpler than the real world, because they are based on principles abstracted from the real world, and not the real world itself.

There's no 'pretending' involved. We know they are models, we know they are only approximations, and we also know the approximations are good enough to get the results we want. And with that, we built the society you see around us.

Why do you sneer at the process that has resulted in immense wealth and better lives for billions of people?

16

JustAPerspective t1_ivfr760 wrote

Maybe all the cherry-picking & misinformation that gets churned out when non-ethical individuals start playing with that data. Or when inaccurate data is relied upon as factual rather than speculative.

9

ShadowStormDrift t1_ivfus80 wrote

What about confounding variables?

For example. Looking for trends across governments:hard. Looking for trends WITHIN government departments: Easier. (two different departments might trend in opposite directions and cancel each other out when pooled together)

16

shumpitostick t1_ivfwf6z wrote

That gets us into the realm of causal inference. This is not really what the author was talking about, but yes, it's a field that has a bunch of additional challenges. In this case, more data points might not help, but collecting data about additional variables might. In any case, getting more data will pretty much never cause your model to be worse.

34

AllanfromWales1 t1_ivg0jlb wrote

> We know they are models, we know they are only approximations, and we also know the approximations are good enough to get the results we want.

As someone who works in an engineering discipline I think you are naive to assume that all engineers know this. Many I have dealt with simply follow algorithms and give little or no thought to what underlies them. I'd also suggest that if we had, if it were possible to have, more complex models the world would not be running headlong towards catrastrophe as we speak.

6

Dominion1995 t1_ivg7xji wrote

So just give up all your data to throw them off the trail? I find that difficult to fathom.

0

Snufflepuffster t1_ivgfnwq wrote

Computer science is moving so fast that the other disciplines are kinda lost when it comes to it.

0

Iucidium t1_ivgmcj5 wrote

Kojima nailed it in 2001.

1

ascendrestore t1_ivgqnrj wrote

Isn't it equally a delusion to reason that by restricting data (i.e. constraining and adding a lens or a sampling decision) that the patterns produced are true patterns and not merely the effect of sampling?

6

sentientlob0029 t1_ivh7jdz wrote

I can imagine that would be made easier using AI.

−1

resfan t1_ivhcj2y wrote

Just like Kojima said, we need filters.

1

Clean-Inevitable538 t1_ivhfbhw wrote

This answer is a perfect example of what the OG author is talking about. This response does seem to come from a knowledgable person and the response seems well constructed but it does not address the point the author is making. But they are observant enough too state that the authors argument is unclear which in reality means that they did not understand it fully... Which is great at showing how 2 separate theories of Truth work for diferent people. Where the author is probably comming from some sort of relativism, the redditor comes from a theory where truth is objective and so claims not that the OG author's argument is difficult to understand but the argument is unclear, under a premise that they know what constitutes a clear argument. :D

Three takeaways:

  1. The paradox of big data is that the more data we ransack for patterns, the more likely it is that what we find will be worthless or worse.
  2. The real problem today is not that computers are smarter than us, but that we think that computers are smarter than us and trust them to make decisions for us that they should not be trusted to make.
  3. In the age of Big Data and powerful computers, human wisdom, commonsense, and expertise are needed more than ever.
−1

fuq-daht t1_ivhj7kv wrote

Is OP TRIPPIN or what😱?

0

xstoopkidx t1_ivi4uka wrote

A bit tangential… but I’ve always thought the more data that is collected, the less likely we will be randomly targeted based on said data, but the more likely we might be intentionally targeted for said data. Meaning, if everyone has their SSNs leaked on the internet, you might be less likely to be randomly targeted as that number approaches the US population. However, you might be more likely to be intentionally targeted based on other demographic factors (assets, income, location, etc).

Another example: if Alexa listens to every word that everyone says, it seems to approach meaninglessness. Unless you begin to target particular pieces of information to isolate particular groups. Then it becomes increasingly beneficial. The key importance is to know how to sift through that information as the amount of information is amassed.

2

metaphysics137 t1_ivi7t9i wrote

Low precision signals are crowding out high-precision signals - according to the good profs at HEC Finance faculty

2

ajt9000 t1_ivics5n wrote

The main way its gonna make a statistical model worse is by increasing the computational power needed to run it. Thats not an argument about the quality of the model results though. I agree the author's understanding of statistics is really bad.

7

visarga t1_ivinkvl wrote

  1. Take a look at neural scaling laws, figures 2 and 3 especially. Experiments show that more data and more compute are better. It's been a thing for a couple of years already, the paper has 260 citations, authored by OpenAI.

  2. If you work with AI you know it always makes mistakes. Just like if you're using Google Search - you know you often have to work around its problems. Checking models not to make mistakes is big business today, called "human in the loop". There is awareness about model failure modes. Not to mention that even generative AIs like Stable Diffusion require lots of prompt massaging to work well.

  3. sure

9

visarga t1_ivioifb wrote

> They overcome overfitting using hundreds of billions of parameters

Increasing model size usually increases overfitting. The opposite effect comes from increasing the dataset size.

6

visarga t1_ivipogr wrote

In 2012 NLP was in its infancy. We were using recurrent neural nets called LSTMs but they could not handle long range contextual dependencies and were difficult to scale up.

In 2017 we got a breakthrough with the paper "Attention is all you need", suddenly long range context and fast/scalable learning was possible. By 2020 we got GPT-3, and in this year there are over 10 alternative models, some open sourced. They all trained on an amazing volume of text and exhibit signs of generality in their abilities. Today NLP can solve difficult problems, in code, math and natural language.

2

visarga t1_ivir19q wrote

No, models are tools, it's how you wield them. What I noticed is that models tend to attract activist types that have an agenda to push, so they try to control it. Not just in AI, also in Economics and other fields.

0

visarga t1_ivircgg wrote

Real data is biased and unbalanced. The "long tail" is hard to learn, for example there are papers trying to balance it out for those rare classes. Unfortunately most datasets follow a power law so they have many rare classes.

0

val_tuesday t1_iviyo6a wrote

If I understand the point of the article it really isn’t talking about β€œmore data” in the sense of a bigger data set to train the model on a given task. He is more commenting that given a nonsense task (like predicting criminality from a photo of a face) you can find some data to train your model. You might then be seduced by your results into thinking that the task makes sense and that your model does something meaningful. That it found a modern theory of skull shapes or whatever, when all it really did was classify mugshots.

In other words he is not addressing the cutting edge of AI research, but rather the wide eyed practitioners (and - inevitably - charlatans) who will gladly sell a model for an impossible task.

6

DanielM1618 t1_ivj40yg wrote

Taleb has said that like 10 years ago.

1

thereissweetmusic t1_ivj4ggd wrote

As a layman your supposed alternative interpretation of the article’s arguments makes them sound quite simplistic and not at all difficult to understand. Reductive even. Which makes me suspect your suggestion that OP didn’t understand the article came directly from your butthole.

  1. Ok, you’ve just claimed the opposite of what OP claimed, and provided far less evidence (ie none) to back it up compared to what OP provided.

  2. This sounds like it has nothing to do with having more or less data.

  3. Ditto

2

Clean-Inevitable538 t1_ivj6u8m wrote

I am a layman as well but as far as I understand the article, as it talks about meaning and relation, variance mentioned by the commentor is not relevant. And I can see how it can be misconstrued as relevant when talking about meaning. It depends if meaningfull is understood as data extrapolation itself or its corelation to factual aplication.

3

OceanoNox t1_ivjf74f wrote

I read before that at least a big part of the replication crisis is cherry-picking of data by researchers themselves. What would you say are other reasons? Badly recorded protocols/methodologies?

2

FrankDrakman t1_ivjjg3h wrote

Yes, I agree the recent breakthroughs are staggering, and NLP is moving along rapidly. But my point still stands: this guy's firm had it working well in 2012, and it was being secretly used by the US government.

1

eliyah23rd t1_ivjri3f wrote

Thank you for your reply.

Perhaps I phrased it poorly. You are correct, of course, that increasing model size tends to increase overfitting in the normal sense. Overfitting in this case means a failure of generalization. This would also lead to bad results in new data.

In spoke in the context of this article, which claimed that spurious generalizations are found. LLMs move two parameters up in parallel in order to produced the amazing results that they do. They increase both the quantity of data and the numbers of parameters.

1

iiioiia t1_ivkfmyy wrote

> Why do you sneer at the process that has resulted in immense wealth and better lives for billions of people?

I am suspicious of anyone who speaks of their industry and every single practitioner within it as being purely rational, or essentially flawless. Of course, this "wasn't what you meant", but that's kind of my complaint.

Another aspect: presumably you're on Hacker News - I've observed people there "telling it how it is" for way over a decade, so I have a decent amount of exposure to how (a substantial sampling of) tech people think across a wide variety of ideas (including how thinking styles change depending on the topic), and how confident they can be in various beliefs (perceived as knowledge) they hold.

1

iiioiia t1_ivkuvgo wrote

It is the best we have done, but is it the best we could have done?

And if we never ask ourselves such questions, and take them seriously, might it be possible that the best that we do do is always below what we could have done?

For some context: as a thought experiment, consider two streams of reality: the current one, versus one where the scientific method wasn't discovered, wasn't widely adopted, wasn't taken seriously, etc. Might there be a substantial difference between these two realities in the year 2022?

3

shumpitostick t1_ivlb6zi wrote

I was oversimplifying my comments a bit. There is the curse of dimensionality. And in causal inference if you just use every variable as a confounder your model can also get worse because you're blocking forward paths. But if you know what you're doing it shouldn't be a problem. And I haven't met any ML practitioner or statistician who doesn't realize the importance of getting to understand your data and making proper modelling decisions.

1

shumpitostick t1_ivldyfg wrote

I understand that those are the takeaways, but where is the evidence? The author just jumps to some vaguely related topics as if it's evidence, while what he's really doing is spinning some kind of narrative, and the narrative is wrong.

About the takeaways:

  1. As I explained in my comment, this is not true.
  2. Who thinks that way? Everybody I know, both laymen and people in the field of Data Science and Machine Learning, have healthy skepticism of AI.
  3. Having worked as a data scientist, I can attest that data scientists check their algorithms, use common sense, and put an emphasis on understanding their data.

Honestly,the article just reads to me as a boomer theory-focused economist who's upset of the turn towards quantitative and statistics-heavy approach that his field has taken. There is a certain (old) school of economists who prefer theoretical models and takes a rationalist over an empirical approach. The problem with their approach is that the theoretical models they build use assumptions that often turn out to be wrong. They use "common sense" rather than relying on data but the world is complex and many "common sense" assumptions don't actually hold.

0

shumpitostick t1_ivlg54a wrote

My take on the replication crisis is that it is something 60% bad incentives, 35% bad statistics and 5% malice. Bad incentives is the whole journal system, which incentivizes getting good results and does not deeply scrutinize methodology and source data, the lack of incentives for preregistration, poor quality journals existing, etc.

Bad statistics is mostly the fact that people interpret p<0.05 as true and p>0.05 as worthless results and use it as a threshold for publishing, rather than the crude statistical tool that it really is. Plus just a general bad understanding of statistics by most social scientists. I'm currently doing some research in causal inference, developing methodology that can be used in social science, and it's embarrassing how slow social scientists are in using tools from causal inference. In economics applications are usually 10-20 years behind the research but in psychology for example they often don't even attempt any kind of causal identification but then suggest that their studies somehow show causality.

Malice is scientists just outright faking data or cherry-picking. But even that is tied to the incentive structure. We should normalize publishing negative results

2

SelfAwareMachine t1_ivlm6mo wrote

I like to understand this as "The more data we have, the more obvious our core assumptions are flawed."

And there isn't a single ML schema that isn't built on terribly flawed assumptions, the most critical being how we classify data in the first place.

1

OceanoNox t1_ivmhma8 wrote

Thank you for your insight. I am in material engineering, and I emphasize having representative data, but I have heard at conferences that the results shown are sometimes the top outliers, outside of the average.

I completely agree about the publication of negative results. Many times I have wondered how many people have tried the same idea, only to find out it didn't work and did not or could not publish it. And thus another team will spend effort and money because nothing was ever reported.

1