Comments

You must log in or register to comment.

killver t1_jdzbsaz wrote

I think the tricky thing about actually validating zero-shot capabilities is again a question of in-sample vs. out-of-sample. Which of these samples has ChatGPT actually already seen?

5

Disastrous_Elk_6375 t1_jdzccfq wrote

Is 0shot really the strength of GPT and especially chatGPT? From my (limited) experience interacting with chatGPT, the value seems to come from prompt understanding and adaptation to my next prompts / corrections. In the context of an assistant, I'm ok with priming the conversation first, if it can handle the subsequent requests better.

1

stimulatedecho t1_jdzxtb6 wrote

"complex reasoning is perhaps the most interesting feature of these models right now and it is unfortunately mostly absent from this survey"

Bingo. It is also the hardest to quantify; it's one of those "I know it when I see it" sort of behaviors. It is easy to imagine how one might harness that ability to reason to solve all sorts of problems, including (but certainly not limited to) improving benchmark performances. I think that is what has a lot of people excited.

1

rshah4 t1_jdzyo8u wrote

Nice work! -- How were the results when comparing using ChatGPT zero shot versus few shot? I have noticed that when using LLMs, you can get an improvement by using few shot learning with LLMs (giving it a few examples in the prompts).

I am not surprised for traditional NLP tasks that we don't see much of an improvement over GPT-3. It seems much of the focus from OpenAI is not on these benchmarks but on trying to make the results more useful to people (all the Instruction tuning / RLHF work).

https://arxiv.org/pdf/2209.12356.pdfhttps://arxiv.org/pdf/2301.13848.pdf

Also, for real-world use, it's not necessary that ChatGPT beats a fine-tuned SOTA model. ChatGPT is much easier to use than having to fine-tune a more traditional model.

1

matus_pikuliak OP t1_je0am88 wrote

Only some papers used few-shot prompting, and it was usually beneficial and sometimes it helped to beat the SOTA.

Yeah, OpenAI definitely does not care about these benchmarks, but I think they are still useful to see how capable the models are. I find it hard to imagine that the models could be used in some applications if they can not reliably do the even simple tasks evaluated by these benchmarks.

1

rshah4 t1_je0crbz wrote

I agree, these baselines are useful. I think we should push for is more human baselines for these benchmarks. That would help figure out how far we have left to go.

1