Submitted by Singularian2501 t3_zr2en7 in MachineLearning

Paper: https://arxiv.org/abs/2212.01349

Github: https://github.com/facebookresearch/NPM

Abstract:

>Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. We show that NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval. Zero-shot evaluation on 9 closed-set tasks and 7 open-set tasks demonstrates that NPM outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach. It is particularly better on dealing with rare patterns (word senses or facts), and predicting rare or nearly unseen words (e.g., non-Latin script).

https://preview.redd.it/qf2lqrkku47a1.jpg?width=658&format=pjpg&auto=webp&s=7dc7e76f3075b4b4f0916c2de1e442b19b2c0f49

https://preview.redd.it/gqhlbykku47a1.jpg?width=1241&format=pjpg&auto=webp&s=39f63470d18ea6f4a8ed560b371cc46b939b2c6f

https://preview.redd.it/p7bzdukku47a1.jpg?width=883&format=pjpg&auto=webp&s=6a8eb2b66abcb1581abf7280180c1c0e86201232

https://preview.redd.it/z6niwykku47a1.jpg?width=1112&format=pjpg&auto=webp&s=8337a4802db983df1a4b0b11934c0708888641a4

https://preview.redd.it/s8fdhxkku47a1.jpg?width=1361&format=pjpg&auto=webp&s=28b307df857ef2262d3f8348fd1094ebb793a63d

https://preview.redd.it/94t5fwkku47a1.jpg?width=1362&format=pjpg&auto=webp&s=da8bca8fd08ecaf956658c674f5a32a930cdd3a2

271

Comments

You must log in or register to comment.

Dankmemexplorer t1_j123o1b wrote

time to train gpt-4 on my mom's laptop

58

farmingvillein t1_j12s4mn wrote

unfortunately still really slow (for now) to run, however:

> the speed of NPM is still on par with the speed of significantly larger parametric models that NPM outperforms

15

yaosio t1_j15h0xa wrote

They also say there's room for improvement but they didn't explore that in this paper. Just think, one day we'll have the power of the sun GPT-3 in the palm of our hand. Could be really soon, could be far away, but it's coming.

4

red75prime t1_j1899a0 wrote

GPT-3: Sure, I can tell you power output of the sun. It would be 3.8 x 1026 W or 3.234 kW. I'm glad to help.

2

rjromero t1_j12aza8 wrote

> We use the model architecture and initial weights of RoBERTa large (Liu et al., 2019), consisting of 354M parameters. Training is done for 100,000 steps, using thirty-two 32GB GPUs.

354M parameters? At FP32 that's 1.41gb. It's tiny.

55

vwings t1_j13pguc wrote

It was expected, right? A retrieval system should be much more efficient than storing phrases in neural net weights as GPT does...

6

CatalyzeX_code_bot t1_j11aphv wrote

Found relevant code at https://github.com/facebookresearch/NPM + all code implementations here

--

To opt out from receiving code links, DM me

7

Singularian2501 OP t1_j11bgj5 wrote

The github link is broken. That was also the reason I didn´t include it in the post. The paper is not from me! Also searched on paperswithcode but they also dont have a working link.

Edit the link is working now: https://github.com/facebookresearch/NPM !

14

Purplekeyboard t1_j12lik7 wrote

Ok, but how does it compare in the real world to GPT-3?

6

master3243 t1_j12nmgc wrote

There's no way for a paper to just have a table of "real world comparison of GPT-3",

There needs to (for now) be some benchmark created that systematically tests for the things we care about. Which is exactly why I deeply respect researchers dedicated on creating better and more useful benchmarks as their work immensely accelerates the field while they mostly don't get the attention they (IMO) deserve.

31

Purplekeyboard t1_j12uk1s wrote

But what I'm asking is, how do the benchmarks match real world performance? Because I've seen claims that other language models were supposedly close to or equal to GPT-3 in this or that benchmark, but try interacting with them and the difference is striking. It's like the difference between talking to a college grad student and talking to the meth addled homeless guy who shouts at lampposts.

12

valdanylchuk t1_j137hla wrote

From the paper:

>Extension for generation. It is currently non-trivial to use NPM for generation, since it is the encoder-only model. Future work can explore autoregressive generation as done in Patel et al. (2022) or use NPM for editing (Schick et al., 2022; Gaoet al., 2022).

So, don't expect to talk to it just yet.

7

yaosio t1_j17p2bx wrote

There was a thread awhile back about one benchmark being filled with spelling errors, grammar errors, and wrong answers. In many cases there were multiple correct answers but one was picked as the correct answer for no particular reason. Creating a benchmark for the subjective task of "is this text good?" seems to be pretty hard. It's even harder when the people creating the benchmark have a poor grasp of language.

If I were to ask a language model "Describe an apple." There are many correct answers, none more correct than the others. Multiple independent humans would have to go over the answers and make subjective decisions on if the LLM answerded well. This becomes much more difficult with better LLMs because the prompts and answers have to become more complex, which makes reviewing the answers harder and more time consuming.

1

blose1 t1_j12voe0 wrote

GPT-3 is like yesterday news, SOTA is chatGPT and it does circles around real world GPT-3 on every possible task.

−18

RealGrande t1_j12zfl6 wrote

ChatGPT is a fine-tuned version of gpt3 (well, gpt3.5 but pretty much the same barring some improvements)

16

blose1 t1_j14q7ul wrote

Have you actually tried both on same tasks? I mean it seems like a lot of people here read a paper and some blog and make their conclusion without even using the tool, I've used both on the same tasks, compared on hundreds of real world cases and yes it's fine-tuned GPT3 but with human assisted RL and it's doing circles around GPT-3 in question answering, COT and code generation.

2

ShowerVagina t1_j13gxva wrote

GPT-3 is still the best for general use. Or for story writing. Novel AI is good, but still not as good as GPT-3.

1

blose1 t1_j14qfir wrote

Have you compared both yourself on question answering, COT and code generation ?

1

machinelearner77 t1_j1437x9 wrote

Looks like cool stuff... but if you put a code link in the abstract and publish your paper, it should be a functioning link...

3

[deleted] t1_j15c4qw wrote

[deleted]

3

PengsoonThePenguin t1_j16rrtg wrote

I guess an easy explanation is that the model works solely from retrieval over the corpus. Every prediction has to be explained by the corpus.

3

drd13 t1_j1h3gvy wrote

Similarly to T5 (abd Bert) the model is pre-trained by predicting some randomly masked spans of words. However the way these spans of words are predicted is different.

In T5, masked words are generated one-by-one autoregressively (i.e. use a softmax over vocabulary to generate words one by one). Here a set of candidate possible spans, covering your whole trained corpus is preliminarily created and the model looks at all the candidate spans and chooses the one it thinks is the best (using a contrastive loss).

2

gbfar t1_j16478a wrote

I see lots of potential applications for this. I wonder if we could reliably control text generation by tweaking the reference corpus.

1