leliner
leliner t1_iznqdj4 wrote
Reply to comment by rePAN6517 in [R] Large language models are not zero-shot communicators by mrx-ai
leliner t1_iznom12 wrote
Reply to comment by Competitive-Rub-1958 in [R] Large language models are not zero-shot communicators by mrx-ai
Did test against chatGPT. Cannot fully compare to humans or the experimental setup used in the paper (especially not as comprehensively as using 9 prompts on 600 examples). Preliminary results show there's still a gap with humans, especially with particularised examples (see last paragraph of section 4.1 in the paper). Feel free to try CoT, definitely something we have thought about, and for a response to that I refer to Ed's comment https://www.reddit.com/r/MachineLearning/comments/zgr7nr/comment/iznhuqz/?context=1.
leliner t1_izno4vq wrote
Reply to comment by TheDeviousPanda in [R] Large language models are not zero-shot communicators by mrx-ai
As other people have been pointing out, myself included on twitter, anecdotal evidence on one example tells us nothing. We try 9 different prompts on 600 examples of implicature, we do few-shot prompting including up to 30 examples in-context (filling the context window), we try a contrastive framing of the question. I think you are misunderstanding the paper. Already at the time of publishing the paper the introductory examples in the abstract were properly answered by OpenAI's models, does not change the story. Additionally, chatGPT does much better than Davinci-2 (and -3), but still has a gap with humans, especially on the particularised examples subset (last paragraph section 4.1 in the paper).
leliner t1_izobd14 wrote
Reply to comment by Competitive-Rub-1958 in [R] Large language models are not zero-shot communicators by mrx-ai
Just to respond to 2 (I disagree with 1 and Ed already extensively covered in another comment), I agree with you! It primes the model for the task, which might be more fair than zero-shot as a comparison to humans who are otherwise motivated, we do not currently know (e.g. see Andrew Lampinen's paper on the topic https://arxiv.org/abs/2210.15303).
We argue in the paper, and here, that ~5% is significant, and additionally on the subset of particularised examples the different is ~9%. The actual significance of this gap in terms of whether this will be noticeable to humans in some kind of Turing-style comparison is an endeavour for future work. I personally expect it to be first of all noticeable and second of all, to re-iterate, this is a very simple type of binary conversational implicature; it remains to be seen how they fare on more complex implicature.