Viewing a single comment thread. View all comments

rshah4 t1_jdzyo8u wrote

Nice work! -- How were the results when comparing using ChatGPT zero shot versus few shot? I have noticed that when using LLMs, you can get an improvement by using few shot learning with LLMs (giving it a few examples in the prompts).

I am not surprised for traditional NLP tasks that we don't see much of an improvement over GPT-3. It seems much of the focus from OpenAI is not on these benchmarks but on trying to make the results more useful to people (all the Instruction tuning / RLHF work).

https://arxiv.org/pdf/2209.12356.pdfhttps://arxiv.org/pdf/2301.13848.pdf

Also, for real-world use, it's not necessary that ChatGPT beats a fine-tuned SOTA model. ChatGPT is much easier to use than having to fine-tune a more traditional model.

1

matus_pikuliak OP t1_je0am88 wrote

Only some papers used few-shot prompting, and it was usually beneficial and sometimes it helped to beat the SOTA.

Yeah, OpenAI definitely does not care about these benchmarks, but I think they are still useful to see how capable the models are. I find it hard to imagine that the models could be used in some applications if they can not reliably do the even simple tasks evaluated by these benchmarks.

1

rshah4 t1_je0crbz wrote

I agree, these baselines are useful. I think we should push for is more human baselines for these benchmarks. That would help figure out how far we have left to go.

1