egrefen

egrefen t1_iznhuqz wrote

I am surprised and disappointed to be reading so many of the responses on this post. A lot of them amount to something like “Hey guys this paper says coins don’t understand language, but hey I shouted HEADS at a coin while flipping it three times and it came up heads twice so I don’t think it’s fair to say coins don’t understand us”.

I’m not saying LLMs are coins here, but for Pete’s sake some people need to calibrate their expectations given how a random baseline performs on this task, and understand the need for systematic evaluation. And no, having tried a few examples you pulled from the paper on ChatGPT is not systematic evaluation.

Then there’s the responses along the lines that “the experiments aren’t fair because zero shot doesn’t make sense/should have used chain of thought/some other feature”. First, some of these auxiliary experiments are done and contextualised in the paper. Second, this general argument is addressed in the paper, but to add colour here: this benchmark tests a very simple aspect of conversational implicature. Humans are very good at resolving completely new conversational implicatures in novel contexts, which distinguishes this class of implicatures from conventional implicatures. We ourselves do this zero shot every day, nearly every conversation. Testing this zero shot, noting that maximising the likelihood of text at any scale does not capture this phenomenon, but that instruction following finetuning somehow does move the dial is an important finding, because it indicates how we can get there.

Ultimately, we all want LLMs to be zero shot communicators in line with human performance for them to have extrinsic utility that matches human abilities. I will never understand some people’s rush to skip the actual message and method of a paper in their quest to see the state of affairs they want were it might not exist, and in doing so make themselves more vulnerable to confirmation bias, and less able to drive actual progress towards a goal we all share.

10