leliner

leliner t1_izobd14 wrote

Just to respond to 2 (I disagree with 1 and Ed already extensively covered in another comment), I agree with you! It primes the model for the task, which might be more fair than zero-shot as a comparison to humans who are otherwise motivated, we do not currently know (e.g. see Andrew Lampinen's paper on the topic https://arxiv.org/abs/2210.15303).

We argue in the paper, and here, that ~5% is significant, and additionally on the subset of particularised examples the different is ~9%. The actual significance of this gap in terms of whether this will be noticeable to humans in some kind of Turing-style comparison is an endeavour for future work. I personally expect it to be first of all noticeable and second of all, to re-iterate, this is a very simple type of binary conversational implicature; it remains to be seen how they fare on more complex implicature.

3

leliner t1_iznom12 wrote

Did test against chatGPT. Cannot fully compare to humans or the experimental setup used in the paper (especially not as comprehensively as using 9 prompts on 600 examples). Preliminary results show there's still a gap with humans, especially with particularised examples (see last paragraph of section 4.1 in the paper). Feel free to try CoT, definitely something we have thought about, and for a response to that I refer to Ed's comment https://www.reddit.com/r/MachineLearning/comments/zgr7nr/comment/iznhuqz/?context=1.

3

leliner t1_izno4vq wrote

As other people have been pointing out, myself included on twitter, anecdotal evidence on one example tells us nothing. We try 9 different prompts on 600 examples of implicature, we do few-shot prompting including up to 30 examples in-context (filling the context window), we try a contrastive framing of the question. I think you are misunderstanding the paper. Already at the time of publishing the paper the introductory examples in the abstract were properly answered by OpenAI's models, does not change the story. Additionally, chatGPT does much better than Davinci-2 (and -3), but still has a gap with humans, especially on the particularised examples subset (last paragraph section 4.1 in the paper).

4