Submitted by BB4evaTB12 t3_zff5mh in MachineLearning

Continuing my analysis of errors in widely-used LLM benchmarks (post on Google's GoEmotions here) — I analyzed HellaSwag and found 36% contains errors.

For example, here's a prompt and set of possible completions from the dataset. Which completion do you think is most appropriate? See if you can figure it out through the haze of typos and generally non-sensical writing.

Men are standing in a large green field playing lacrosse. People is around the field watching the game. men

  • are holding tshirts watching int lacrosse playing.
  • are being interviewed in a podium in front of a large group and a gymnast is holding a microphone for the announcers.
  • are running side to side of the ield playing lacrosse trying to score.
  • are in a field running around playing lacrosse.

I'll keep it spoiler-free here, but the full blog post goes into detail on this example (and others) and explains why they are so problematic.

Link: https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors

33

Comments

You must log in or register to comment.

leondz t1_izckeuh wrote

This happens all the time and it's awful. Please put this up on arXiv.

13

Different_Fig4002 t1_izce9pb wrote

There are similar labeling issues in some popular emotion/sentiment datasets used widely for sentiment analysis. Stuff like "yay I love cold food..." is labeled as Positive emotion when it's obviously negative sarcasm.

11

BB4evaTB12 OP t1_izgu9jj wrote

Totally! We may be thinking of the same example from the GoEmotions dataset, where they mislabeled "Yay, cold McDonald's. My favorite." as Love.

1

Mefaso t1_izdjye3 wrote

The fact that big bench includes kanji-ascii art classification task is pretty funny.

But i guess if you want to have over a hundred tasks in a benchmark you run out of ideas at some point

5

Jean-Porte t1_ize03mv wrote

A good thing with bigbench is that google performed nice human evaluations, and they report the results of the best humans as well as the average accuracy

2