Github: https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints

Abstract:

>Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

https://preview.redd.it/xazwmswbb8v91.jpg?width=1593&format=pjpg&auto=webp&s=2eab22e36819a6ba6117c552b9620271ad8cc51c

https://preview.redd.it/xeovsswbb8v91.jpg?width=1392&format=pjpg&auto=webp&s=58355bc00ba17a78551fa43e2453b66ffd8659e4

https://preview.redd.it/xvpmc1xbb8v91.jpg?width=1040&format=pjpg&auto=webp&s=2725e0d263d2183f14ad53f6a1006a3488ba9f1c

https://preview.redd.it/ylqzw0xbb8v91.jpg?width=1145&format=pjpg&auto=webp&s=6a40ba4551821215dcf06d9ecdee9ac2b569995e

https://preview.redd.it/nf3kerwbb8v91.jpg?width=1149&format=pjpg&auto=webp&s=f28750ca5db60c3df295988d77c59f3c00651b55

Comments

rehrev t1_it9f0a4 wrote on October 21, 2022 at 10:12 PM

#171,555

Who actually tries to predict the sota level in the future, especially on language modeling?

rehrev t1_itb46ky wrote on October 22, 2022 at 7:58 AM

#175,937

Replying to Cheap_Meeting (#175,902)

Oh expert blogs

cygn t1_itbsjr4 wrote on October 22, 2022 at 1:07 PM

#177,496

Is it possible to run it on consumer grade gpus (3090) with 24 GB ram?

LetterRip t1_itchnjl wrote on October 22, 2022 at 4:19 PM

#179,429

Replying to cygn (#177,496)

I assume you mean 24GB of VRAM? Deepspeed with enough CPU RAM and mapping to hard drive as needed, might let you run it. Note that 540B parameters is more than 2 TB for float 32. Even going 8 bit, you are looking at 512 GB. Consumer hardware RAM is typically max 128 GB. So the vast majority of it is going to have to be mapped to the hard drive. The size can probably be reduced a lot using both quantization and compression, but you will either have to do the work yourself or wait till someone else does.

[deleted] t1_itef881 wrote on October 23, 2022 at 12:40 AM

#183,858

Replying to LetterRip (#179,429)

[deleted]

farmingvillein t1_itefjav wrote on October 23, 2022 at 12:42 AM

#183,875

Replying to LetterRip (#179,429)

> Note that 540B parameters is more than 2 TB for float 32

They only provide checkpoints up to the 11B model, however (unless I'm reading things wrong), so this is a moot point, at the moment.

Qumeric t1_itkg1un wrote on October 24, 2022 at 8:59 AM

#199,096

Replying to rehrev (#171,555)

There are things called prediction markets and prediction aggregators. See https://www.metaculus.com/questions/11676/mmlu-sota-in-2023-2025/ for up-to-date predictions for MMLU SotA.

Also, see https://www.lesswrong.com/posts/arveXgFbJwascKtQC/forecasting-ml-benchmarks-in-2023 for a detailed prediction from a single forecaster.

[R] Scaling Instruction-Finetuned Language Models - Flan-PaLM- Google 2022 - 75.2% on five-shot MMLU / Forecasters expected this SOTA would need until 2024! - Public checkpoints!