Purplekeyboard t1_j12lik7 wrote
Ok, but how does it compare in the real world to GPT-3?
master3243 t1_j12nmgc wrote
There's no way for a paper to just have a table of "real world comparison of GPT-3",
There needs to (for now) be some benchmark created that systematically tests for the things we care about. Which is exactly why I deeply respect researchers dedicated on creating better and more useful benchmarks as their work immensely accelerates the field while they mostly don't get the attention they (IMO) deserve.
Purplekeyboard t1_j12uk1s wrote
But what I'm asking is, how do the benchmarks match real world performance? Because I've seen claims that other language models were supposedly close to or equal to GPT-3 in this or that benchmark, but try interacting with them and the difference is striking. It's like the difference between talking to a college grad student and talking to the meth addled homeless guy who shouts at lampposts.
valdanylchuk t1_j137hla wrote
From the paper:
>Extension for generation. It is currently non-trivial to use NPM for generation, since it is the encoder-only model. Future work can explore autoregressive generation as done in Patel et al. (2022) or use NPM for editing (Schick et al., 2022; Gaoet al., 2022).
So, don't expect to talk to it just yet.
yaosio t1_j17p2bx wrote
There was a thread awhile back about one benchmark being filled with spelling errors, grammar errors, and wrong answers. In many cases there were multiple correct answers but one was picked as the correct answer for no particular reason. Creating a benchmark for the subjective task of "is this text good?" seems to be pretty hard. It's even harder when the people creating the benchmark have a poor grasp of language.
If I were to ask a language model "Describe an apple." There are many correct answers, none more correct than the others. Multiple independent humans would have to go over the answers and make subjective decisions on if the LLM answerded well. This becomes much more difficult with better LLMs because the prompts and answers have to become more complex, which makes reviewing the answers harder and more time consuming.
__Maximum__ t1_j136j97 wrote
How about BIG-bench?
blose1 t1_j12voe0 wrote
GPT-3 is like yesterday news, SOTA is chatGPT and it does circles around real world GPT-3 on every possible task.
RealGrande t1_j12zfl6 wrote
ChatGPT is a fine-tuned version of gpt3 (well, gpt3.5 but pretty much the same barring some improvements)
blose1 t1_j14q7ul wrote
Have you actually tried both on same tasks? I mean it seems like a lot of people here read a paper and some blog and make their conclusion without even using the tool, I've used both on the same tasks, compared on hundreds of real world cases and yes it's fine-tuned GPT3 but with human assisted RL and it's doing circles around GPT-3 in question answering, COT and code generation.
oathbreakerkeeper t1_j15vgip wrote
What's COT?
[deleted] t1_j15vrid wrote
[removed]
Think_Olive_1000 t1_j1cvpnz wrote
Chain of thought
ShowerVagina t1_j13gxva wrote
GPT-3 is still the best for general use. Or for story writing. Novel AI is good, but still not as good as GPT-3.
blose1 t1_j14qfir wrote
Have you compared both yourself on question answering, COT and code generation ?
mtocrat t1_j14di0f wrote
How is that relevant?
Viewing a single comment thread. View all comments