Submitted by Seankala t3_119onf8 in MachineLearning

The ELECTRA paper introduces a small version that has around 15M parameters. MobileBERT and TinyBERT also have around the same number of parameters.

Are there any other language models out there that are smaller? Would it be possible to further distill large models into smaller variants?

41

Comments

You must log in or register to comment.

adt t1_j9neq5w wrote

There should be quite a few models smaller than 15M params. What's your use case? A lot of the 2022-2023 optimizations mean that you can squish models onto modern GPUs now (i.e. int8 etc.).

Designed to be fit onto a standard GPU, DeepMind Gato was bigger than I thought, with starting size of 79M params.

Have you found the BERT compression paper, which compresses the models to 7MB? It lists some 1.2M-6.2M param models:

https://arxiv.org/pdf/1909.11687.pdf

My table shows...

https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit#gid=1158069878

*looks at table*

Smallest seems to be Microsoft Pact, which was ~30M params. Ignore that! Transformer is supposed to be wide and deep, I suppose, so it makes sense...

Many of the text-to-image models use smaller LLMs.

Also check HF, they now have 130,000 models of different sizes (to Feb/2023):

https://huggingface.co/models

Includes a tiny-gpt2: https://huggingface.co/sshleifer/tiny-gpt2

And t5-efficient tiny ('has 15.58 million parameters and thus requires ca. 62.32 MB of memory in full precision (fp32) or 31.16 MB of memory in half precision (fp16 or bf16).'):

https://huggingface.co/google/t5-efficient-tiny

Edit: I thought of Anthropic's toy models, but they were not really LLMs. They did train a 10M model during scaling research (paper), but the model hasn't been released.

25

Seankala OP t1_j9npctd wrote

Thanks for the detailed answer! My use case is that the company I work at currently uses image-based models for e-commerce purposes, but we want to use text-based models as well. The image-based model(s) are already taking up around 30-50M parameters so I didn't want to just bring in a 100M+ parameter model. Even 15M seems quite big.

5

currentscurrents t1_j9nqcno wrote

What are you trying to do? Most of the cool features of language models only emerge at much larger scales.

5

Seankala OP t1_j9nqmf5 wrote

That's true for all of the models. I don't really need anything cool though, all I need is a solid model that can perform simple tasks like text classification or NER well.

4

Friktion t1_j9oxnz6 wrote

I have some experience with FastText for e-commerce product classification. Its super lightweight and performs well as a MVP.

5

cantfindaname2take t1_j9qov0f wrote

For simple NER tasks some simpler models might work too,like conditional random fields. The crfsuite package has a very easy to use implementation of it and it is using a C lib under the hood for the model training.

1

chogall t1_j9nioqb wrote

> but they were not really LLMs.

What's the definition of LLM?

2

Seankala OP t1_j9np9ae wrote

I guess at least 100M+ parameters? I like to think of the BERT-base model as being the "starting point" of LLMs.

3

FluffyVista t1_j9ottk1 wrote

probably

1

Yahentamitsi t1_j9xi4q4 wrote

That's a good question! I'm not sure if there are any pretrained language models with fewer parameters, but you could always try training your own model from scratch and see how small you can get it.

1