master3243 t1_j12nmgc wrote on December 21, 2022 at 5:32 AM

There's no way for a paper to just have a table of "real world comparison of GPT-3",

There needs to (for now) be some benchmark created that systematically tests for the things we care about. Which is exactly why I deeply respect researchers dedicated on creating better and more useful benchmarks as their work immensely accelerates the field while they mostly don't get the attention they (IMO) deserve.

Purplekeyboard t1_j12uk1s wrote on December 21, 2022 at 6:50 AM

But what I'm asking is, how do the benchmarks match real world performance? Because I've seen claims that other language models were supposedly close to or equal to GPT-3 in this or that benchmark, but try interacting with them and the difference is striking. It's like the difference between talking to a college grad student and talking to the meth addled homeless guy who shouts at lampposts.

valdanylchuk t1_j137hla wrote on December 21, 2022 at 9:43 AM

From the paper:

>Extension for generation. It is currently non-trivial to use NPM for generation, since it is the encoder-only model. Future work can explore autoregressive generation as done in Patel et al. (2022) or use NPM for editing (Schick et al., 2022; Gaoet al., 2022).

So, don't expect to talk to it just yet.

yaosio t1_j17p2bx wrote on December 22, 2022 at 7:17 AM

There was a thread awhile back about one benchmark being filled with spelling errors, grammar errors, and wrong answers. In many cases there were multiple correct answers but one was picked as the correct answer for no particular reason. Creating a benchmark for the subjective task of "is this text good?" seems to be pretty hard. It's even harder when the people creating the benchmark have a poor grasp of language.

If I were to ask a language model "Describe an apple." There are many correct answers, none more correct than the others. Multiple independent humans would have to go over the answers and make subjective decisions on if the LLM answerded well. This becomes much more difficult with better LLMs because the prompts and answers have to become more complex, which makes reviewing the answers harder and more time consuming.

Maximum t1_j136j97 wrote on December 21, 2022 at 9:29 AM

How about BIG-bench?

blose1 t1_j12voe0 wrote on December 21, 2022 at 7:04 AM

GPT-3 is like yesterday news, SOTA is chatGPT and it does circles around real world GPT-3 on every possible task.

RealGrande t1_j12zfl6 wrote on December 21, 2022 at 7:52 AM

ChatGPT is a fine-tuned version of gpt3 (well, gpt3.5 but pretty much the same barring some improvements)

blose1 t1_j14q7ul wrote on December 21, 2022 at 5:33 PM

Have you actually tried both on same tasks? I mean it seems like a lot of people here read a paper and some blog and make their conclusion without even using the tool, I've used both on the same tasks, compared on hundreds of real world cases and yes it's fine-tuned GPT3 but with human assisted RL and it's doing circles around GPT-3 in question answering, COT and code generation.

oathbreakerkeeper t1_j15vgip wrote on December 21, 2022 at 10:03 PM

What's COT?

[deleted] t1_j15vrid wrote on December 21, 2022 at 10:05 PM

[removed]

Think_Olive_1000 t1_j1cvpnz wrote on December 23, 2022 at 10:37 AM

Chain of thought

ShowerVagina t1_j13gxva wrote on December 21, 2022 at 11:49 AM

GPT-3 is still the best for general use. Or for story writing. Novel AI is good, but still not as good as GPT-3.

blose1 t1_j14qfir wrote on December 21, 2022 at 5:34 PM

Have you compared both yourself on question answering, COT and code generation ?

mtocrat t1_j14di0f wrote on December 21, 2022 at 4:11 PM

How is that relevant?

[R] Nonparametric Masked Language Modeling - MetaAi 2022 - NPM - 500x fewer parameters than GPT-3 while outperforming it on zero-shot tasks

Purplekeyboard t1_j12lik7 wrote on December 21, 2022 at 5:11 AM