Submitted by Vegetable-Skill-9700 t3_121a8p4 in MachineLearning

DataBricks's open-source LLM, Dolly performs reasonably well on many instruction-based tasks while being ~25x smaller than GPT-3, challenging the notion that is big always better?

From my personal experience, the quality of the model depends a lot on the fine-tuning data as opposed to just the sheer size. If you choose your retraining data correctly, you can fine-tune your smaller model to perform better than the state-of-the-art GPT-X. The future of LLMs might look more open-source than imagined 3 months back?

Would love to hear everyone's opinions on how they see the future of LLMs evolving? Will it be few players (OpenAI) cracking the AGI and conquering the whole world or a lot of smaller open-source models which ML engineers fine-tune for their use-cases?

P.S. I am kinda betting on the latter and building UpTrain, an open-source project which helps you collect that high quality fine-tuning dataset

101

Comments

You must log in or register to comment.

soggy_mattress t1_jdl4zkg wrote

I think of the 100b parameter models as analogous to the first room-sized computers that were built in the 70s. Seems the pattern is to first prove a concept, no matter how inefficiently, and then optimize it as much as possible.

188

Blacky372 t1_jdl62vl wrote

GPT-J-6B with instruction finetuning will surely not ever be better than GPT-4. With RLHF you may reach a similar response quality in some contexts for some types of instruction, but you will never match the vast amounts of proprietary data that ClosedAI fed into a probably 250+B parameter model with specialized expert data from literally 50 experts in various fields that worked on the response quality in their domain. This cannot be surpassed easily, unfortunately. But maybe future open source models will be of similar capabilities with advanced training techniques. I would definitely hope so.

56

blueSGL t1_jdl756z wrote

> with specialized expert data from literally 50 experts in various fields that worked on the response quality in their domain.

Sounds like a future goal for Open Assistant.

If one were being unethical... create a bot to post the current Open Assistant answers to technical questions in small specialist subreddits and wait for Cunningham's_Law to come into effect. (I'm only half joking)

20

Sorry-Balance2049 t1_jdl7yn7 wrote

The databrick's blog post doesn't really show much eval on the model, only choice examples. It's more of a "hey we did this!" blog post.

21

Vegetable-Skill-9700 OP t1_jdl8hh5 wrote

Agreed, it won't generalize as well as GPT-4, but it could achieve similar performance for a specialized task (say answering technical questions around a certain topic or writing social media posts for a certain entity, etc.).

3

ttkciar t1_jdl8i7w wrote

LLaMa-7B output is abysmally horrible. We might need less than 100B, but not too much less.

27

Zealousideal_Low1287 t1_jdlm2c0 wrote

It seems that contrary to conventional wisdom, models with more parameters learn more efficiently. My personal ‘hunch’ is that training large models and then some form of distillation may become the standard thing to do.

7

wojtek15 t1_jdlpai0 wrote

Exactly, I have seen many inaccurate claims, e.g. LLaMa-7B with Alpaca being as capable as ChatGPT. From my testing even much bigger LLaMa-30B with Alpaca is far worse than ChatGPT, can't even get simplest programming and common knowledge tasks right, and GPT3 ChatGPT get them right without any problems every time. I have not tried LLaMa-65B with Alpaca yet, because it has not being trained yet AFAIK, but I doubt it will be very different. GPT3 ChatGPT is 175B, maybe some 100B model can match it, but not 6B or 7B model, if someone claim this, he clearly don't know what he is talking about.

27

Yardanico t1_jdls342 wrote

Yeah, I think there's a lot of overhyping going around "running ChatGPT-grade language models on consumer hardware". They can "follow" instructions they same way as ChatGPT, but obviously those models know far, far less than the ClosedAI models do, and of course they'll hallucinate much more.

Although it's not an entirely bad thing, at least the community will innovate more so we might get something interesting in the future from this "push" :)

17

LeN3rd t1_jdls5jy wrote

How big do models need to be until certain capabilities emerge? That is the actual question here, isn't it? Do smaller models perform as well in all tasks, or just the one they are trained for?

2

shanereid1 t1_jdlt38a wrote

Have you read about the lotto ticket hypothesis? It was a paper from a few years ago that showed that within a fully connected neural network there exists a smaller sub network that can perform equally as well, even when the subnetwork is as low as a few % of the size of the original network. AFAIK they only proved this for MLP and CNNs. Its almost certain that the power of these LLMs can be distilled in some fashion without significantly degrading performance.

32

jabowery t1_jdm16ig wrote

Algorithmic information theory: Smallest model that memorizes all the data is optimal. "Large" is only there because of the need to expand in order to compress. Think decompress gz in order to compress with bz2. Countering over-fitting with over-informing (bigger data) yields interpolation, sacrificing extrapolation.

If you understand all of the above you'll be light years beyond the current ML industry including the political/religious bias of "algorithmic bias experts".

0

badabummbadabing t1_jdm1poy wrote

Well, if you apply all of those tricks that these smaller models perform (to get decent performance) AND increase the parameter count, can you get an even better model? Who knows, "Open"AI might already apply these.

The question is not: "Do fewer than 100B parameters suffice to get a model that performs 'reasonably' for a March 2023 observer?"

Chinchilla scaling rules tell us some upper bounds to the number of parameters that we can expect to still yield an improvement given the amount of available training data (PaLM is too big for instance), but even that only tells us half of the story: How good can our models get, if we make do with sub-optimal training efficiency (see LLaMA)? What is the influence of data quality/type? What if we train (gasp) multiple epochs with the same training set?

5

harharveryfunny t1_jdm3bm4 wrote

It seems most current models don't need the number of parameters that they have. DeepMind did a study on model size vs number of training tokens and concluded that for each doubling of number of parameters the number of training tokens also needs to double, and that a model like GPT-3, trained on 300B tokens would really need to be trained on 3.7T tokens (a 10x increase) to take advantage of it's size.

To prove their scaling law, DeepMind built the 70B params Chinchilla model, and trained it on the predicted optimal 1.4T (!) tokens, and found it to outperform GPT-3.

https://arxiv.org/abs/2203.15556

2

_Repeats_ t1_jdm3h7a wrote

For enterprise use cases, you might need only a small model in the 1-3 billion range that answers specific queries. For general knowledge, it remains to be seen how big or small you can retrain them.

6

Disastrous_Elk_6375 t1_jdm4h39 wrote

> I have seen many inaccurate claims, e.g. LLaMa-7B with Alpaca being as capable as ChatGPT

I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci. And that's an amazing progress from the raw outputs of the raw models.

20

WonderFactory t1_jdm4pk1 wrote

How long though before LLMs perform at the same level as experts in a most fields? A year, two, three? When you get to that point you can generate synthetic data that's the same quality as human produced data. The Reflexion paper mentioned in another thread claims that giving GPT 4 the ability to test the output of its code produces expert level coding performance. This output could be used to train an open source model.

6

noobgolang t1_jdm7pvm wrote

Big is not always better (˵ ͡° ͜ʖ ͡°˵)

2

gamerx88 t1_jdmql8y wrote

Answer is probably not. DeepMind's Chinchilla paper shows that many of those 100B+ LLMs are oversized for the amount of data used to pre-train them.

3

gamerx88 t1_jdmr4n2 wrote

My observations are similar to yours, but I think Stanford's claim was that it rivalled text-davinci-003's dialogue or chat capabilities, and only in a single turn setting.

2

Cherubin0 t1_jdmt5el wrote

I think have the particular knowledge inside the model is a bad approach. I think it would make much more sense that the model knows how to search and reason about the found data.

0

currentscurrents t1_jdmyjrb wrote

Bigger models are more sample efficient, so it should need less data.

But - didn't the Chinchilla paper say bigger models need more data? Yes, but that's only true because right now compute is the limiting factor. They're intentionally trading off more data for less model size.

As computers get faster and models bigger, data will increasingly become the limiting factor, and people will trade off in the opposite direction instead.

7

currentscurrents t1_jdmzphs wrote

That's true, but only for the given compute budget used in training.

Right now we're really limited by compute power, while training data is cheap. Chinchilla and LLaMA are intentionally trading more data for less compute. Larger models still perform better than smaller ones given the same amount of data.

In the long run I expect this will flip; computers will get very fast and data will be the limiting factor.

3

londons_explorer t1_jdn0t7k wrote

Paper after paper has shown that bigger model outperforms smaller model.

Sure, you can use tricks to make a small model work better. But apply those same tricks to a big model, and it works even better.

7

gamerx88 t1_jdn1dd3 wrote

> In the long run I expect this will flip; computers will get very fast and data will be the limiting factor.

I agree but I think data is already a limiting factor today, with the largest (that is public knowledge) models at 175B. The data used to train these models supposedly already cover a majority of the open internet.

1

fiftyfourseventeen t1_jdnhbn0 wrote

OpenAI is also doing a lot of tricks behind the scenes, so it's not really fair to just type two things into both, because they are getting nowhere near the same prompt. Llama is promising but it just needs to be properly instruction tuned

2

drinkingsomuchcoffee t1_jdnhxri wrote

Huge models are incredibly wasteful and unoptimized. Someday, someone is going to sit down and create an adaptive algorithm that expands or contracts a model during the training phase and we're going to laugh at how stupid we were.

7

Impressive-Ad6400 t1_jdnjakm wrote

Uhm, this is probably incorrect as an analogy, but, do we humans actually need those 75 billion neurons on our brains?

I mean, there are lots of people who have lost a brain hemisphere for different reasons, and yet, they live happy lives.

However, what they lose is flexibility. This means they have a hard time when faced to new situations and have difficulties adapting to them.

I can't be certain, but it's possible that the number of parameters in large language models can account for their flexibility. That is why you can throw anything to chatGPT and it will answer, within the scope given by its restrictions.

I'm not sure either if enlarging the number of parameters will give us emergent properties or if it will only slow down data processing. Blue whales have immense brains, but they aren't necessarily smarter than us. And this is because a larger brain means larger distances for neurons to connect, slower response times and increased energetic expenditure.

I could be wrong, though. Electronic brains don't have the same limitations of physical brains, so maybe increasing their size won't affect their output.

4

farmingvillein t1_jdnuvnf wrote

> I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci.

This is a misleading summary of the paper.

They instruction tune and then compare Alpaca versus GPT-3.5, and say that Alpaca is about equal on the tasks it compares (which, to be clear, is not equivalent to a test of "broad capability").

Yes, you are right that they don't make a statement that it is categorically more capable than ChatGPT, but they do state that their model is approximately as capable as GPT3.5 (which is of course not a 1:1 to chatgpt), on the diverse set of tasks tested.

It is very much not just a paper showing that you can make it output in the same "style".

4

farmingvillein t1_jdnwda6 wrote

> But apply those same tricks to a big model, and it works even better.

In general, yes, although there are many techniques that help small models that do not help large ones.

That said, agree with your overall point. I think the only reason we won't see model sizes continue to inflate is if 1) there are substantial underlying architecture discoveries (possible!) or 2) we really hit problems with data availability. But synthetic + multi-modal probably gives us a ways to go there.

2

londons_explorer t1_jdo4kj3 wrote

Think how many hard drives there are in the world...

All of that data is potential training material.

I think a lot of companies/individuals might give up 'private' data in bulk for ML training if they get a viable benefit from it (for example, having a version of ChatGPT with perfect knowledge of all my friends and neighbours, what they like and do, etc. would be handy)

2

LahmacunBear t1_jdo7k0w wrote

Here’s a thought — 175B in GPT3 original, the best stuff thrown at it, performs as it did. ChatGPT training tricks, suddenly same size performs magnitudes better. I doubt that currently the LLMs are fully efficient, i.e. just as with GPT3 to 3.5, with the same size we can continue to get much better results, and therefore current results with much smaller models.

1

blose1 t1_jdoj8kl wrote

GPT models struggle with out of distribution programming tasks, which means it can't create novel ideas, I tested this myself many times and it's not a prompt engineering issue. I think LLMs could act as great teachers but not researchers, teachers just teach what we already know, researchers create novel knowledge that teachers use.

7

ganzzahl t1_jdovu3h wrote

I'm also very interested in this – does anyone have papers similar to Chinchilla, but without the training FLOPs restriction, and instead comparing identical dataset sizes?

An aside: I feel like I remember some older MT papers where LSTMs outperformed Transformers for some low resource languages, but I think that's outdated – using transfer learning, multilingual models and synthetic data, I'm fairly certain Transformers always outperform nowadays.

1

YoloSwaggedBased t1_jdp9cge wrote

I can't find it now, but I've read a paper that essentially proposed this, at least for inferencing. You essentially have a model output and task loss after every n layers of the model. At training time, you produce outputs up to the end of the architecture and then at inference time utilise some heuristic to measure how much accuracy loss you're willing to sacrifice for layer wise model reduction.

2

drinkingsomuchcoffee t1_jdpg1cb wrote

The problem is learned features aren't factored nicely into a minimal set of parameters. For example, identifying if an image is a cat may be 1000s of parameters over n layers, where it may actually be expressed as 10 parameters over fewer layers. A small model does this automatically, as it's obviously physically constrained. A large model has no such constraint, so it is wasteful. There's probably many solutions to get the best of both worlds at training time, but it's by no means an easy problem. And the current distillation methods or retraining feel clunky. We actually want the big model to use all its parameters efficiently and not waste them, which it's likely doing if much more compact models can get similar results. It's probably extremely wasteful if it requires an order of magnitude in size to get a few percentage points improvement. Compare that to biological entities where an order of magnitude size increase results in huge cognitive improvements.

3

PilotThen t1_jdpn8eb wrote

I'm down the rabbit hole of finding the best model to build on and learn with this weekend.

Currently poking at PygmalionAI/pygmalion-1.3b

Beware: The different size pygmalion model are finetuned from different pretrained models, so have inherited different licenses.

I like my results with 6b better but 1.3b has the better license (apgl-3.0)

1

PilotThen t1_jdpnoul wrote

I didn't find a paper but I think that is sort of what EleutherAI was doing with their pythia models.

You'll find the models on huggingface and I'd say that they are also interesting from an opensource perspective because of their license (apache-2.0)

(Also open-assistent seems to be building on top of them.)

2

Poseidon_22 t1_jdpyo9u wrote

Apparently, for linear improvement in accuracy, we would need exponentially more parameters. Gpt-4 with more than 1 trillion parameters would need to be trained on 6,700gpus for a whole year!

1

minhrongcon2000 t1_jdr6xtv wrote

Right now yes! Most of the papers published recently (like Chinchilla, GPT, etc.) show a scaling law on the number of data wrt the number of params in a model. If you want a no-brain training with little preprocessing, bigger models are mostly better. However, if you have sufficient data, then the number of params needed may be mitigated. However, I feel like the number of parameters decreases really slow when the data size grows. So yeah, we still somehow need larger model (of course, this also depends on the scenario where you apply LLM, for example, you don't really need that big of a model for an ecom app)

2

andreichiffa t1_jdvojfg wrote

It's a common result from the domain of flat minima - to train the model needs to be overparametrized to avoid getting stuck in local minima and to smooth the loss landscape.

However the overparameterization at the training stage can be trimmed at the inference stage.

1

CacheMeUp t1_jdxvq8t wrote

Perhaps the challenge is not the size of the internet (it's indeed big and easy to generate new content), but rather the uniqueness and novelty of the information. Anecdotally, looking at the first page of Google results often shows various low-informativeness webpages, where only a few sentences provide information and the rest is boilerplate, disclaimers, generic advice or plain spam.

1