master3243 t1_j12nmgc wrote on December 21, 2022 at 5:32 AM

Reply to comment by Purplekeyboard in [R] Nonparametric Masked Language Modeling - MetaAi 2022 - NPM - 500x fewer parameters than GPT-3 while outperforming it on zero-shot tasks by Singularian2501

There's no way for a paper to just have a table of "real world comparison of GPT-3",

There needs to (for now) be some benchmark created that systematically tests for the things we care about. Which is exactly why I deeply respect researchers dedicated on creating better and more useful benchmarks as their work immensely accelerates the field while they mostly don't get the attention they (IMO) deserve.

Purplekeyboard t1_j12uk1s wrote on December 21, 2022 at 6:50 AM

But what I'm asking is, how do the benchmarks match real world performance? Because I've seen claims that other language models were supposedly close to or equal to GPT-3 in this or that benchmark, but try interacting with them and the difference is striking. It's like the difference between talking to a college grad student and talking to the meth addled homeless guy who shouts at lampposts.

valdanylchuk t1_j137hla wrote on December 21, 2022 at 9:43 AM

From the paper:

>Extension for generation. It is currently non-trivial to use NPM for generation, since it is the encoder-only model. Future work can explore autoregressive generation as done in Patel et al. (2022) or use NPM for editing (Schick et al., 2022; Gaoet al., 2022).

So, don't expect to talk to it just yet.

yaosio t1_j17p2bx wrote on December 22, 2022 at 7:17 AM

There was a thread awhile back about one benchmark being filled with spelling errors, grammar errors, and wrong answers. In many cases there were multiple correct answers but one was picked as the correct answer for no particular reason. Creating a benchmark for the subjective task of "is this text good?" seems to be pretty hard. It's even harder when the people creating the benchmark have a poor grasp of language.

If I were to ask a language model "Describe an apple." There are many correct answers, none more correct than the others. Multiple independent humans would have to go over the answers and make subjective decisions on if the LLM answerded well. This becomes much more difficult with better LLMs because the prompts and answers have to become more complex, which makes reviewing the answers harder and more time consuming.

Maximum t1_j136j97 wrote on December 21, 2022 at 9:29 AM

How about BIG-bench?