Viewing a single comment thread. View all comments

pyepyepie t1_j14a34r wrote

Why do many papers are putting emphasis on performance comparisons and ignore the model's behavior?

Background - My first ML project was done around 2016-2017. It's funny to say but SOTA for NLP was nowhere near what it is today, so even though I am relatively new to the field, I observed how transformers completely change the world, not only the world of NLP.

Now, I am nowhere close to research scientist, my experience is implementing stuff, but I did read relatively many NLP papers (during work and a little for grad school) - and I see that there are many papers that are improvements upon a specific task, using "cheap tricks" or just fine-tuning a new model (BERT version 100X), to get better quantitative performance.

That being said, I have yet to see a situation where getting 96% vs 95% accuracy (hopefully more info but not always) on datasets that are often imbalanced is a meaningful signal that is even ethical to report as improvement without statistical significance tests and qualitative analysis.

Again, if I look at myself as someone who builds a product, I can't see when I would ever want to use "the best" model if I don't know how it fails - which would mean I would take a 93% model instead of 95% accuracy if I can understand it better (even because the paper was more explicit and the model is a complete black-box).

My question to the smarter & more experienced people here (probably a large portion of the subreddit), is what is the counter to my argument? Do you see qualitative improvements of models (i.e., classification with less bias, better grounding of language models) as more or less important in comparison to quantitative? And if I ask you honestly, do you ever read papers that just improved SOTA without introducing significant novel ideas? If so, why do you do it (I can see a few reasons but would like to hear more)?

1

trnka t1_j1heqwa wrote

In actual product work, it's rarely sufficient to look at a single metric. If I'm doing classification, I typically check accuracy, balanced accuracy, and the confusion matrix for the quality of the model among other things. Other factors like interpretability/explainability, RAM, and latency also play into whether I can actually use a different model, and those will depend on the use case as well.

I would never feel personally comfortable with deploying a model if I haven't reviewed a sample of typical errors. But there are many people deploy models without that and just rely on metrics. In that case it's more important to get your top-level metric right, or to get product-level metrics right and inspect any trends in say user churn.

> Do you see qualitative improvements of models as more or less important in comparison to quantitative?

I generally view quantitative metrics as more important though I think I value qualitative feedback much more than others in the industry. For the example of bias, I'd say that if it's valued by your employer there should be a metric for it. Not that I like having metrics for everything, but having a metric will force you to be specific about what it means.

I'll also acknowledge that there are many qualitative perspectives on quality that don't have metrics *yet*.

> do you ever read papers that just improved SOTA without introducing significant novel ideas?

In my opinion, yes. If your question was why I read them, it's because I don't know whether they contribute useful, new ideas until after I've read the paper.

Hope this helps - I'm not certain that I understood all of the question but let me know if I missed anything

2