Viewing a single comment thread. View all comments

matus_pikuliak OP t1_je0am88 wrote

Only some papers used few-shot prompting, and it was usually beneficial and sometimes it helped to beat the SOTA.

Yeah, OpenAI definitely does not care about these benchmarks, but I think they are still useful to see how capable the models are. I find it hard to imagine that the models could be used in some applications if they can not reliably do the even simple tasks evaluated by these benchmarks.

1

rshah4 t1_je0crbz wrote

I agree, these baselines are useful. I think we should push for is more human baselines for these benchmarks. That would help figure out how far we have left to go.

1