Abstract:

>Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. Finally, we show that UL2 20B works well with chain-of-thought prompting and reasoning. We release Flax-based T5X model checkpoints for the 20B model at https://github.com/google-research/google-research/tree/master/ul2.

https://preview.redd.it/kwjgesaiuzt91.jpg?width=1145&format=pjpg&auto=webp&s=0a822a7ae0defb6d0f992a7ad86c87d730b9a281

https://preview.redd.it/pafyuzaiuzt91.jpg?width=1142&format=pjpg&auto=webp&s=f26f78cc09a4bf8812a894cda34254a6295ce98f

https://preview.redd.it/5lidpyaiuzt91.jpg?width=1586&format=pjpg&auto=webp&s=9716645207b413861b8ccd0913918a22c18bac6f

https://preview.redd.it/uz4i7saiuzt91.jpg?width=932&format=pjpg&auto=webp&s=242d379e5919bcad0b3f61fab8b1cd8d63a3ec99

https://preview.redd.it/oplo6zaiuzt91.jpg?width=1122&format=pjpg&auto=webp&s=09015953559d794e854acee0069a3df1d4835e27

Comments

CatalyzeX_code_bot t1_isfuwjp wrote on October 15, 2022 at 5:40 PM

Found relevant code at https://github.com/google-research/google-research/tree/master/ul2 + all code implementations here

To opt out from receiving code links, DM me

rmsisme t1_isg3nmz wrote on October 15, 2022 at 6:42 PM

I wish to see the results of an amateur implementation of that

hosjiu t1_isi4pyg wrote on October 16, 2022 at 3:59 AM

the same point of view with u.

visarga t1_isij2xr wrote on October 16, 2022 at 6:36 AM

I'm wondering what is the minimum hardware to run this model, is this really the portable alternative of GPT-3?

cwhaley112 t1_ispnv6f wrote on October 17, 2022 at 7:40 PM

If you mean gpu, then 20B parameters * 2 bytes (assuming fp16) = 40GB VRAM.

massimosclaw2 t1_ishdjbw wrote on October 16, 2022 at 12:18 AM

I wonder how this will perform on out of distribution stuff + remembering obscure references like "Alfred Korzybski" (as GPT-3 does), and what they are related to or if 20B parameters is too small to memorize enough

EducationalCicada t1_isjcz0x wrote on October 16, 2022 at 12:59 PM

Is there a website that keeps track of all the models being released by the major AI labs?

I guess this sub has them all, but looking for a neater presentation.

ThatInternetGuy t1_isjg31t wrote on October 16, 2022 at 1:26 PM

paperswithcode.com

EducationalCicada t1_isk6nxe wrote on October 16, 2022 at 4:35 PM

Thanks!

SquareRootsi t1_isjsnk6 wrote on October 16, 2022 at 3:01 PM

I haven't vetted this yet, but it looks pretty well done from my first glance. It compares multiple models against multiple tasks, so you can hone in on your specific needs.

https://gem-benchmark.com/results

I think huggingface has something similar, but I haven't found all the info in a single page that's easy to compare. You kind of have to bounce around between various model cards, tasks, and metrics pages to find similar info.

EducationalCicada t1_isk6rc6 wrote on October 16, 2022 at 4:36 PM

Thanks. Will have a look.

[deleted] t1_ish5ee2 wrote on October 15, 2022 at 11:15 PM

[deleted]

freezelikeastatue t1_isho5v9 wrote on October 16, 2022 at 1:42 AM

Somebody must’ve listen to my comment about the originating data being all fucked up.

[R] UL2: Unifying Language Learning Paradigms - Google Research 2022 - 20B parameters outperforming 175B GTP-3 and tripling the performance of T5-XXl on one-shot summarization. Public checkpoints!