Submitted by mrx-ai t3_zgr7nr in MachineLearning

Paper: Large language models are not zero-shot communicators (arXiv)

Abstract:

Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. Models adapted to be "aligned with human intent" perform much better, but still show a significant gap with human performance. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.

Authors: Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, Edward Grefenstette

150

Comments

You must log in or register to comment.

Competitive-Rub-1958 t1_izil2ps wrote

I feel this paper could've been written significantly more clearly and fairly. While I do understand that the authors wanted to create a punchy title declaring "poor" 0-shot performance, it reads slightly a bit like LLMs can't understand context or reason very well (this is just my impression and opinion though).

From 4.2, The average human gets 86.2% correct - the best LLM gets 80.6% w/ natural language, and 81.7% with a structured prompt, both few-shot.

My main gripe is that disambiguating implicature is fundamentally a reasoning task. Due to the inherent ambiguity, you have to create multiple hypotheses and test them to see which fits the best. With enough context, that task becomes simpler.

So they should've evaluated with Chain-of-thought prompting. They even mention it in the paper they try finding other prompt templates as alternatives to it - but don't test w/CoT? This is a very recent paper, with some famous authors. We've seen CoT help in almost all tasks - additionally, with U-shaped inverse scaling too. I don't see why this task gets a pass.

If someone tests this against ChatGPT to further confirm the RLHF hypothesis, and against CoT, I will be satisfied that understanding implicature 0-shot is indeed hard for LLMs.

88

RomanRiesen t1_izkj431 wrote

I was about to write "neither title nor abstract manage to 1-shot communicate their ideas or research to me" but it felt mean so I didn't. Also haven't read the paper yet.

14

egrefen t1_iznuuxy wrote

While your quip is as witty as it is potentially mean-spirited, I’d love to understand what about the title and abstract you actually found unclear.

2

leliner t1_iznom12 wrote

Did test against chatGPT. Cannot fully compare to humans or the experimental setup used in the paper (especially not as comprehensively as using 9 prompts on 600 examples). Preliminary results show there's still a gap with humans, especially with particularised examples (see last paragraph of section 4.1 in the paper). Feel free to try CoT, definitely something we have thought about, and for a response to that I refer to Ed's comment https://www.reddit.com/r/MachineLearning/comments/zgr7nr/comment/iznhuqz/?context=1.

3

hadaev t1_izikjgj wrote

86

liquiddandruff t1_izil5eu wrote

right? it's likely their forced attempts to make the model respond in yes/no eliminates a sort of "show your work" behavior. likely they'd get better responses if they let it free-form

40

mootcat t1_izile70 wrote

Yeah, this doesn't reflect my experience with more recent chat centric LLMs. LaMDA and GPTChat are quite capable of reading between the lines and understanding causality of less direct scenarios. They are far from perfect, but are still remarkably competent.

34

Flag_Red t1_izitq04 wrote

From the paper, the best LLMs still get ~60% accuracy zero shot, and ~70% accuracy few shot (up to ~80% fully prompt engineered). Remember that a coin flip would achieve 50% accuracy. There's a lot of room for confirmation bias here.

16

CommunismDoesntWork t1_izj03bg wrote

ChatGPT came out after this paper was written. We're at the point where models are improving faster than we can evaluate them lol

25

egrefen t1_iznve2f wrote

Does ChatGPT actually do better than DaVinci-2?

1

hadaev t1_iziyd0p wrote

Don't trust my sample, try yourself.

5

Flag_Red t1_izjhefl wrote

Just did. I tried 5 prompts from the paper (adjusted to QA format so that ChatGPT can respond) and ChatGPT got 3/5 of them correct.

Example: > Esther asked “Have you found him yet?” and Juan responded “They’re still looking”. Has the person been found?

> It is unclear if the person has been found.

8

abecedarius t1_izjij38 wrote

I tried this now with one change: adding "Explain Juan's answer" to follow the prompt-scheme that started this thread.

> Esther asked “Have you found him yet?” and Juan responded “They’re still looking”. Explain Juan's answer. Has the person been found?

> Juan's answer suggests that the person being searched for has not yet been found. It appears that the search is ongoing, and the person has not yet been located.

(I didn't put "explain the answer" at the end because I expect that to do worse on average. That pattern of prompt tends more to get GPT to blurt an answer first without thinking, and then rationalize it.)

5

Flag_Red t1_izjo1xv wrote

Yeah, it's totally clear from "let's think step by step"-style prompt engineering that LLMs have the capability to understand this stuff. I'm confident that a few models down the line we'll have this stuff sorted zero-shot with no prompt engineering.

The interesting part is why this kind of prompt engineering is necessary. Why is this sort of capability seemingly lagging behind others that are more difficult for humans? ELI5-style explanations, for example, are very hard for humans, but LLMs seem to excel at them. In what ways are these tasks different, and what does that tell us about the difference between LLMs and our own brains? Also, why does the ordering of the sentences in the prompt matter so much?

7

liquiddandruff t1_izkq9l5 wrote

one naive explanation is that since chatgpt is at its core a text predictor, by prompting it in such a way that it minimizes leaps of logic (i.e., make each inference step build slowly so as to prevent it from jumping to conclusions), it will be more able to respond coherently and correctly.

2

soraki_soladead t1_izjq5j8 wrote

it seems obvious that the ambiguity comes from the framing of the question. the model has no way of knowing if the person has been found or when the question was posed to Juan. however if you ask the model to explain Juan’s answer that is a very different request

1

aussie_punmaster t1_izmuf8q wrote

But it’s an ambiguity humans easily navigate, understanding the implications of the question. So still a fair test for mine.

2

soraki_soladead t1_iznhiei wrote

Sure but in the context of ChatGPT and how it was trained this isn’t a surprising result.

1

lostmsu t1_j053o14 wrote

I don't understand what are you talking about. As I mentioned above, the correct conclusion from the Juan's formulation of the answer is "unclear", as Juan does not know if the implied others who are still looking found the person yet based on his own phrasing.

1

aussie_punmaster t1_j061388 wrote

The goal here is to make the rational inference. Not to be the world’s biggest logic pedant.

Ask 100 humans that question and 99 will make the rational conclusion they haven’t been found yet.

1

lostmsu t1_j07o5bu wrote

>Ask 100 humans that question and 99 will make the rational conclusion they haven’t been found yet.

I disagree, and the fact that humans will do what you say only tells me how AI might be ahead. 100 humans are not an indication of truth in any way even if they all agree.

1

aussie_punmaster t1_j0ax8je wrote

Disagree if you like. You’re wrong.

Imagine you’re coming back from a search where you’ve found a lost boy. The mum asks “Have they found him?” And you reply “They’re still looking”…

This happens never. Because the clear implication of that conversation is the boy isn’t found.

0

lostmsu t1_j0d8fsl wrote

Man, this statement is not a negation of my statement neither it implies a negation of my statement, so it does not prove anything.

You somehow think being "the biggest logic pedant" is a downside. I can assure you logic pendancy correlates positively with pretty much every success metric you could imagine, except those that are hard dependent on average folk to be able to comprehend what one is saying. More so in science-related discussion like this one.

Don't you see the irony of two of us arguing about the correctness of "unclear" answer being the definite proof that "unclear" is the correct answer?

0

aussie_punmaster t1_j0e61hj wrote

Being the biggest logic pedant is a downside when you deliberately limit your understanding and probability of acting correctly based on a reasonable assumption of truth, all for the sake of purity.

If you live your life treating exchanges like this as ambiguous, your chance of survival reduces. It will lead you to inactions or actions to your detriment.

This exchange has a very clear subtext the child hasn’t been found. No one keeps looking after the child is found. It is requiring absolute logic excess to argue that they didn’t specifically say the child hadn’t been found. If you had been out looking for someone’s child, came back knowing they’d been found and said “they’re still looking”, you’d be lucky not to be shot if they found out later that you’d known and only said that.

P.S. I think you’ll find this level of logical pedantry only correlates with being a douche

P.P.S no it’s not ironic, because someone of your almighty logical calibre should identify that’s bollocks. I say 1 + 1 = 2 is clear, you say it’s not. Well obviously it must be unclear if one of us considered it not you say? No, you’re just wrong,

0

lostmsu t1_j0mqde6 wrote

> limit your understanding

ROFL. One in making that statement you assume you're right, but that's the matter in question, so this argument is circular. Two, the opposite of that is called "jumping to conclusions".

> limit your ... probability of acting correctly

Unsubstantiated BS. When the transmitted information is "unclear", nothing prevents one from acting as it was "no" or "yes". That's what damn "unclear" means. On the contrary, assuming it means "no" is the limiting factor in that particular scenario.

> This exchange has a very clear subtext the child hasn’t been found.

Dude if it is clear to you and not clear to me, it damn literally means it is unclear because the people disagree on the interpretation. Your is missing the "last time I met the group of people who are searching", which could possibly be minutes ago, hours ago or even yesterday.

> I think you’ll find this level of logical pedantry only correlates with being a douche

Oh now we switch to personal attacks? How about I call you a moron, cause you can't grasp that if two seemingly not stupid people disagree about a statement, it can not possibly be "clear"?

> I say 1 + 1 = 2 is clear, you say it’s not. Well obviously it must be unclear if one of us considered it not you say

I can see that you fail to separate slightly complicated abstractions. For instance, in your example you confuse objective truth and the information that a message conveys.

1

aussie_punmaster t1_j0nea6k wrote

>>Dude if it is clear to you and not clear to me, it damn literally means it is unclear because the people disagree on the interpretation. Your is missing the "last time I met the group of people who are searching", which could possibly be minutes ago, hours ago or even yesterday.

The absence of the lines you mention are part of the inference. If there is a meaningful gap between when the person sourced their information and when they’re reporting it, the expectation is it is included. If we’re talking about a lost child and my information is hours out of date I don’t just say “They’re still looking”, I say “They were still looking when I last heard 5 hours ago”. It’s truly inconceivable that with a child missing that’s the way that discussion would go with outdated information.

>> Oh now we switch to personal attacks? How about I call you a moron, cause you can't grasp that if two seemingly not stupid people disagree about a statement, it can not possibly be "clear"?

One person disagreeing is not a sufficient threshold for clarity. Otherwise nothing would ever be clear. Survey some people, see what answers you get.

>> I can see that you fail to separate slightly complicated abstractions. For instance, in your example you confuse objective truth and the information that a message conveys.

I’m not saying the two examples are the same. I was taking the argument to the absurd to show that one person’s unclear doesn’t invalidate a truth. It ignores the possibility of a person being incorrect.

1

lostmsu t1_j1t7nph wrote

> If we’re talking about a lost child

Now you are just making things up.

> my information is hours out of date I don’t just say

This depends on the context of the dialog, which in this case is not present. E.g. this could be a conversation about events happening elsewhere only tangentially relevant to the conversation participant(s). For a specific example consider that dialog being about the disappearance of MH370 flight.

> One person disagreeing is not a sufficient threshold for clarity. > was taking the argument to the absurd to show that one person’s unclear doesn’t invalidate a truth.

It normally would not be, but we are not two randomly selected people, and neither of us is crazy nor do we argue in bad faith.

1

aussie_punmaster t1_j1w351b wrote

Well you can just answer “we can’t be sure” to every question in life then.

Scenario 2:

Bob: “Are there any apples left?” Fred: “There are 2 in the fruit bowl”

Question - How many apples are there?
lostmsu - we can’t be sure. Maybe Fred looked at the fruit bowl yesterday, and since then perhaps someone else took one.

This is the logic you are selling. Obviously I’m not going to be able to convince you though. I’d suggest we leave it here, although I would encourage you to survey some friends. See if you find anyone else who agrees with you.

0

lostmsu t1_j1x7gfr wrote

>lostmsu - we can’t be sure. Maybe Fred looked at the fruit bowl yesterday

I mean. I mean. Did you read the last sentence? I am selling the logic that if two sane non-stupid people in good faith disagree, then it is unclear. In you example lostmsu is a fruit of your imagination. You can't be sure that fruit is sane and non-stupid. Here the argument is that we are in the ML subreddit context, and we both understand the topic at hand which raises the chances of both of us matching the criteria to near 100%.

In this context if I would start disagreeing with 1+1=2 you should at least start doubting, that e.g. I'm on to something.

1

lostmsu t1_j053ewq wrote

From standpoint of logic this answer looks correct to me. If Juan's answer would be "I am still looking", then the "Has the person been found?" would indicate "No", but as formulated, "unclear" is correct.

1

TheDeviousPanda t1_izio2gc wrote

This is just literally not true -every model on beta.openai.com and chatGPT answers this question correctly. Contrived experimental setup or I’m just completely misunderstanding the paper.

23

leliner t1_izno4vq wrote

As other people have been pointing out, myself included on twitter, anecdotal evidence on one example tells us nothing. We try 9 different prompts on 600 examples of implicature, we do few-shot prompting including up to 30 examples in-context (filling the context window), we try a contrastive framing of the question. I think you are misunderstanding the paper. Already at the time of publishing the paper the introductory examples in the abstract were properly answered by OpenAI's models, does not change the story. Additionally, chatGPT does much better than Davinci-2 (and -3), but still has a gap with humans, especially on the particularised examples subset (last paragraph section 4.1 in the paper).

4

timscarfe t1_izikxup wrote

We interviewed Laura at NeurIPS last week here -- https://youtu.be/5Yd28ssDutA

18

Flag_Red t1_izimx17 wrote

I just watched your video a couple of hours ago. It's interesting seeing people repeat the same misunderstood criticisms of the paper that Laura points out.

14

rePAN6517 t1_izilxsq wrote

Paper only tested against InstructGPT 175B / text-da-vinci-002. They did not test against ChatGPT or text-da-vinci-003.

If they had, I think the paper would obviously be titled "Large language models are zero-shot communicators"

14

CommunismDoesntWork t1_izj06ql wrote

Yeah, we're at the point where models are improving faster than we can evaluate them lol

10

egrefen t1_iznmjuu wrote

Those models weren’t released at time of writing. I would love it if these models significantly moved the dial on this benchmark, as that would confirm the direction we see with Davinci. Curious to hear why you are so confident, though.

1

shadowknight094 t1_izj34x6 wrote

Can anyone here explain what zeroshot is? New to NLP. Let me know if this is not the right place to ask that though.

10

jcasper t1_izj4owg wrote

Zero shot means a large language model (LLM) is performing a task without seeing any examples of the task being done. One shot or few shot gives some examples of the task in the prompt before the task.

10

soraki_soladead t1_izjpxd6 wrote

fwiw, it is very difficult to know if the model has seen the task (or similar tasks) before or not due to the nature of the data collection

I feel like zero shot / few shot has taken a much less rigorous tone when applied to LLMs

15

FutureIsMine t1_izkmv5n wrote

hmmmmm, to some extent I think you might be onto something, there could be something related, on the other hand, seeing enough data means that you've seen bits and pieces of a zero-shot task so its not exactly "seen" but its not brand new and a novel task, its just piecing it together from multiple other tasks

2

Same_Smoke6922 t1_izkdzgw wrote

Giving some examples in the prompt is zero-shot. Few shot means using few examples in the training.

2

egrefen t1_iznhuqz wrote

I am surprised and disappointed to be reading so many of the responses on this post. A lot of them amount to something like “Hey guys this paper says coins don’t understand language, but hey I shouted HEADS at a coin while flipping it three times and it came up heads twice so I don’t think it’s fair to say coins don’t understand us”.

I’m not saying LLMs are coins here, but for Pete’s sake some people need to calibrate their expectations given how a random baseline performs on this task, and understand the need for systematic evaluation. And no, having tried a few examples you pulled from the paper on ChatGPT is not systematic evaluation.

Then there’s the responses along the lines that “the experiments aren’t fair because zero shot doesn’t make sense/should have used chain of thought/some other feature”. First, some of these auxiliary experiments are done and contextualised in the paper. Second, this general argument is addressed in the paper, but to add colour here: this benchmark tests a very simple aspect of conversational implicature. Humans are very good at resolving completely new conversational implicatures in novel contexts, which distinguishes this class of implicatures from conventional implicatures. We ourselves do this zero shot every day, nearly every conversation. Testing this zero shot, noting that maximising the likelihood of text at any scale does not capture this phenomenon, but that instruction following finetuning somehow does move the dial is an important finding, because it indicates how we can get there.

Ultimately, we all want LLMs to be zero shot communicators in line with human performance for them to have extrinsic utility that matches human abilities. I will never understand some people’s rush to skip the actual message and method of a paper in their quest to see the state of affairs they want were it might not exist, and in doing so make themselves more vulnerable to confirmation bias, and less able to drive actual progress towards a goal we all share.

10

RevolutionaryGear647 t1_izo1lkc wrote

Well said, I think a lot of people have an unrealistic roadmap of AI development and they would go through heaps of logical fallacies in order to “be right”.

The progress has been massive, but I think identifying where are our current limitations is what’s gonna enable us to progress even further.

6

Competitive-Rub-1958 t1_izo1n60 wrote

Assuming those objections were directed towards my comment (as they seem to directly address them) and brushing over the antagonistic tone, I have no doubt that your evaluation was not systematic nor am I reaching conclusions with a few examples - that's misrepresenting what the general consensus towards this paper is.

I wholeheartedly agree with you that LLMs should 0-shot understanding implicature, but there are certain nuances here that seem to be ignored. What I was going for is simply this:

1> The paper should have compared their alternative prompt templates to CoT, especially if you explicitly mention CoT. The idea is quite clear - look at this paper for instance. Complex tasks which usually involve disambiguating a chain of events ("I wore gloves" -> gloves cover the finger -> they don't expose fingerprints -> therefore, the answer is Y/N) benefit greatly from CoT. It may seem like an insignifcant demand, maybe even some reviewer-2 vibes here but it seems reasonable to expect that a method that works on almost every task should have been tested here - merely out of scientific curiosity to observe the outcome had this template been incorporated.

2> And more important - when you prompt your model k-shot, it does NOT reveal any context whatsoever about the actual, target question. when you few-shot, you give it completely independent examples of how you perform the task at hand with no bearing to the actual question you ask. So it would perceive "gloves" and concept of fingerprints independently to the provided examples, which could be about bananas and groceries. Yet Few-shot primes the LLM for better understanding this task, there is so much literature exploring this interesting phenomena (mostly attributed to a mix of ICL and statistical patterns).

This extremely important point wasn't mentioned in the paper at all; few-shot doesn't actually invalidate LLMs not being human-aligned communicators. Hence why I quoted above there being an ~5% difference in accuracy between average human and Few-shot LLM.

Lastly, No one's claiming ChatGPT is perfect. All I mentioned was that I would like to see it being tested on that latest iteration of RLHF models and see how it fares. It was in no way meant to denigrate the authors or the paper at hand, or expressing some claim that ChatGPT can somehow perform tasks that GPT3/InstructGPT cannot.

1

leliner t1_izobd14 wrote

Just to respond to 2 (I disagree with 1 and Ed already extensively covered in another comment), I agree with you! It primes the model for the task, which might be more fair than zero-shot as a comparison to humans who are otherwise motivated, we do not currently know (e.g. see Andrew Lampinen's paper on the topic https://arxiv.org/abs/2210.15303).

We argue in the paper, and here, that ~5% is significant, and additionally on the subset of particularised examples the different is ~9%. The actual significance of this gap in terms of whether this will be noticeable to humans in some kind of Turing-style comparison is an endeavour for future work. I personally expect it to be first of all noticeable and second of all, to re-iterate, this is a very simple type of binary conversational implicature; it remains to be seen how they fare on more complex implicature.

3

Acceptable-Cress-374 t1_izigh23 wrote

Would this improve with some prompt engineering? Could you perhaps use the LLM to first provide itself some context and then answer the question (in what becomes a few-shot attempt)? In other words, is it worth training for 0shot or can we use the LLMs to self provide some context and answer the prompt in self-learned few-shot? Does my question even make sense?

6

mrx-ai OP t1_izijamw wrote

You might want to read at p.8 in the paper. The authors evaluate three different models (GPT-3-175B, InstructGPT-3-175B, and text-davinci-002) using different prompt templates, but none of the models show improved performance. The variance of the results for text-davinci-002 is particularly high, and the best prompt template only achieves a 74.5% accuracy rate.

6

MrTacobeans t1_izpt1y6 wrote

I just tried this on character.ai and although long winded since character.ai prefers sending bricks it answered this question perfectly both on a copy from above and a more natural language version that I tried to trip the bot up with. In both cases gpt passed with flying colors

1

nildeea t1_izkove7 wrote

Idgaf chatgpt now write 95% of my code and it's excellent.

−4