Viewing a single comment thread. View all comments

Competitive-Rub-1958 t1_izo1n60 wrote

Assuming those objections were directed towards my comment (as they seem to directly address them) and brushing over the antagonistic tone, I have no doubt that your evaluation was not systematic nor am I reaching conclusions with a few examples - that's misrepresenting what the general consensus towards this paper is.

I wholeheartedly agree with you that LLMs should 0-shot understanding implicature, but there are certain nuances here that seem to be ignored. What I was going for is simply this:

1> The paper should have compared their alternative prompt templates to CoT, especially if you explicitly mention CoT. The idea is quite clear - look at this paper for instance. Complex tasks which usually involve disambiguating a chain of events ("I wore gloves" -> gloves cover the finger -> they don't expose fingerprints -> therefore, the answer is Y/N) benefit greatly from CoT. It may seem like an insignifcant demand, maybe even some reviewer-2 vibes here but it seems reasonable to expect that a method that works on almost every task should have been tested here - merely out of scientific curiosity to observe the outcome had this template been incorporated.

2> And more important - when you prompt your model k-shot, it does NOT reveal any context whatsoever about the actual, target question. when you few-shot, you give it completely independent examples of how you perform the task at hand with no bearing to the actual question you ask. So it would perceive "gloves" and concept of fingerprints independently to the provided examples, which could be about bananas and groceries. Yet Few-shot primes the LLM for better understanding this task, there is so much literature exploring this interesting phenomena (mostly attributed to a mix of ICL and statistical patterns).

This extremely important point wasn't mentioned in the paper at all; few-shot doesn't actually invalidate LLMs not being human-aligned communicators. Hence why I quoted above there being an ~5% difference in accuracy between average human and Few-shot LLM.

Lastly, No one's claiming ChatGPT is perfect. All I mentioned was that I would like to see it being tested on that latest iteration of RLHF models and see how it fares. It was in no way meant to denigrate the authors or the paper at hand, or expressing some claim that ChatGPT can somehow perform tasks that GPT3/InstructGPT cannot.

1

leliner t1_izobd14 wrote

Just to respond to 2 (I disagree with 1 and Ed already extensively covered in another comment), I agree with you! It primes the model for the task, which might be more fair than zero-shot as a comparison to humans who are otherwise motivated, we do not currently know (e.g. see Andrew Lampinen's paper on the topic https://arxiv.org/abs/2210.15303).

We argue in the paper, and here, that ~5% is significant, and additionally on the subset of particularised examples the different is ~9%. The actual significance of this gap in terms of whether this will be noticeable to humans in some kind of Turing-style comparison is an endeavour for future work. I personally expect it to be first of all noticeable and second of all, to re-iterate, this is a very simple type of binary conversational implicature; it remains to be seen how they fare on more complex implicature.

3