Continuing my analysis of errors in widely-used LLM benchmarks (post on Google's GoEmotions here) — I analyzed HellaSwag and found 36% contains errors.

For example, here's a prompt and set of possible completions from the dataset. Which completion do you think is most appropriate? See if you can figure it out through the haze of typos and generally non-sensical writing.

Men are standing in a large green field playing lacrosse. People is around the field watching the game. men

are holding tshirts watching int lacrosse playing.
are being interviewed in a podium in front of a large group and a gymnast is holding a microphone for the announcers.
are running side to side of the ield playing lacrosse trying to score.
are in a field running around playing lacrosse.

I'll keep it spoiler-free here, but the full blog post goes into detail on this example (and others) and explains why they are so problematic.

Link: https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors

Comments

You must log in or register to comment.

leondz t1_izckeuh wrote on December 8, 2022 at 2:40 AM

This happens all the time and it's awful. Please put this up on arXiv.

Different_Fig4002 t1_izce9pb wrote on December 8, 2022 at 1:53 AM

There are similar labeling issues in some popular emotion/sentiment datasets used widely for sentiment analysis. Stuff like "yay I love cold food..." is labeled as Positive emotion when it's obviously negative sarcasm.

BB4evaTB12 OP t1_izgu9jj wrote on December 9, 2022 at 12:25 AM

Totally! We may be thinking of the same example from the GoEmotions dataset, where they mislabeled "Yay, cold McDonald's. My favorite." as Love.

Mefaso t1_izdjye3 wrote on December 8, 2022 at 8:51 AM

The fact that big bench includes kanji-ascii art classification task is pretty funny.

But i guess if you want to have over a hundred tasks in a benchmark you run out of ideas at some point

Jean-Porte t1_ize03mv wrote on December 8, 2022 at 12:27 PM

A good thing with bigbench is that google performed nice human evaluations, and they report the results of the best humans as well as the average accuracy

gwern t1_izezqic wrote on December 8, 2022 at 4:59 PM

https://news.ycombinator.com/item?id=33874955