egrefen t1_iznhuqz wrote on December 10, 2022 at 12:40 PM

I am surprised and disappointed to be reading so many of the responses on this post. A lot of them amount to something like “Hey guys this paper says coins don’t understand language, but hey I shouted HEADS at a coin while flipping it three times and it came up heads twice so I don’t think it’s fair to say coins don’t understand us”.

I’m not saying LLMs are coins here, but for Pete’s sake some people need to calibrate their expectations given how a random baseline performs on this task, and understand the need for systematic evaluation. And no, having tried a few examples you pulled from the paper on ChatGPT is not systematic evaluation.

Then there’s the responses along the lines that “the experiments aren’t fair because zero shot doesn’t make sense/should have used chain of thought/some other feature”. First, some of these auxiliary experiments are done and contextualised in the paper. Second, this general argument is addressed in the paper, but to add colour here: this benchmark tests a very simple aspect of conversational implicature. Humans are very good at resolving completely new conversational implicatures in novel contexts, which distinguishes this class of implicatures from conventional implicatures. We ourselves do this zero shot every day, nearly every conversation. Testing this zero shot, noting that maximising the likelihood of text at any scale does not capture this phenomenon, but that instruction following finetuning somehow does move the dial is an important finding, because it indicates how we can get there.

Ultimately, we all want LLMs to be zero shot communicators in line with human performance for them to have extrinsic utility that matches human abilities. I will never understand some people’s rush to skip the actual message and method of a paper in their quest to see the state of affairs they want were it might not exist, and in doing so make themselves more vulnerable to confirmation bias, and less able to drive actual progress towards a goal we all share.

RevolutionaryGear647 t1_izo1lkc wrote on December 10, 2022 at 3:36 PM

Well said, I think a lot of people have an unrealistic roadmap of AI development and they would go through heaps of logical fallacies in order to “be right”.

The progress has been massive, but I think identifying where are our current limitations is what’s gonna enable us to progress even further.

Competitive-Rub-1958 t1_izo1n60 wrote on December 10, 2022 at 3:36 PM

Assuming those objections were directed towards my comment (as they seem to directly address them) and brushing over the antagonistic tone, I have no doubt that your evaluation was not systematic nor am I reaching conclusions with a few examples - that's misrepresenting what the general consensus towards this paper is.

I wholeheartedly agree with you that LLMs should 0-shot understanding implicature, but there are certain nuances here that seem to be ignored. What I was going for is simply this:

1> The paper should have compared their alternative prompt templates to CoT, especially if you explicitly mention CoT. The idea is quite clear - look at this paper for instance. Complex tasks which usually involve disambiguating a chain of events ("I wore gloves" -> gloves cover the finger -> they don't expose fingerprints -> therefore, the answer is Y/N) benefit greatly from CoT. It may seem like an insignifcant demand, maybe even some reviewer-2 vibes here but it seems reasonable to expect that a method that works on almost every task should have been tested here - merely out of scientific curiosity to observe the outcome had this template been incorporated.

2> And more important - when you prompt your model k-shot, it does NOT reveal any context whatsoever about the actual, target question. when you few-shot, you give it completely independent examples of how you perform the task at hand with no bearing to the actual question you ask. So it would perceive "gloves" and concept of fingerprints independently to the provided examples, which could be about bananas and groceries. Yet Few-shot primes the LLM for better understanding this task, there is so much literature exploring this interesting phenomena (mostly attributed to a mix of ICL and statistical patterns).

This extremely important point wasn't mentioned in the paper at all; few-shot doesn't actually invalidate LLMs not being human-aligned communicators. Hence why I quoted above there being an ~5% difference in accuracy between average human and Few-shot LLM.

Lastly, No one's claiming ChatGPT is perfect. All I mentioned was that I would like to see it being tested on that latest iteration of RLHF models and see how it fares. It was in no way meant to denigrate the authors or the paper at hand, or expressing some claim that ChatGPT can somehow perform tasks that GPT3/InstructGPT cannot.

leliner t1_izobd14 wrote on December 10, 2022 at 4:45 PM

Just to respond to 2 (I disagree with 1 and Ed already extensively covered in another comment), I agree with you! It primes the model for the task, which might be more fair than zero-shot as a comparison to humans who are otherwise motivated, we do not currently know (e.g. see Andrew Lampinen's paper on the topic https://arxiv.org/abs/2210.15303).

We argue in the paper, and here, that ~5% is significant, and additionally on the subset of particularised examples the different is ~9%. The actual significance of this gap in terms of whether this will be noticeable to humans in some kind of Turing-style comparison is an endeavour for future work. I personally expect it to be first of all noticeable and second of all, to re-iterate, this is a very simple type of binary conversational implicature; it remains to be seen how they fare on more complex implicature.