Viewing a single comment thread. View all comments

Meddhouib10 t1_jcptalr wrote

What are the techniques to male such large models run on low ressources ?

15

simpleuserhere OP t1_jcpttav wrote

This model is 4 bit quantized,so it will take less RAM (model size around 4GB)

27

timedacorn369 t1_jcqg4v6 wrote

What is the performance hit with various levels of quantization??

10

starstruckmon t1_jcrbf0m wrote

You can see some benchmarks here

https://github.com/qwopqwop200/GPTQ-for-LLaMa

11

Taenk t1_jcs53iw wrote

The results for LLaMA-33B quantised to 3bit are rather interesting. That would be an extremely potent LLM capable of running on consumer hardware. Pity that there are no test results for the 2bit version.

3

starstruckmon t1_jcswg1g wrote

I've heard from some experienced testers that the 33B model is shockingly bad compared to even the 13B one. Despite what the benchmarks say. That we should either use the 65B one ( very good apparently ) or stick to 13B/7B. Not because of any technical reason but random luck/chance involved with training these models and the resultant quality.

I wonder if there's any truth to it. If you've tested it yourself, I'd love to hear what you thought.

5

Taenk t1_jctdmvi wrote

I haven’t tried the larger models unfortunately. However I wonder how the model could be „shockingly bad“ despite having almost three times the parameter count.

2

starstruckmon t1_jcte34d wrote

🤷

Sometimes models just come out crap. Like BLOOM which has almost the same number of parameters as GPT3, but is absolute garbage in any practical use case. Like a kid from two smart parents that turns out dumb. Just blind chance.

Or they could be wrong. 🤷

3

baffo32 t1_jcronvh wrote

- offloading and accelerating (moving some parts to memory mapped disk or gpu ram, this can also make for quicker loading)

- pruning (removing parts of the model that didn’t end up impacting outputs after training)

- further quantization below 4 bits

- distilling to a mixture of experts?

- factoring and distilling parts out into heuristic algorithms?

- finetuning to specific tasks (e.g. distilling/pruning out all information related to non-relevant languages or domains) this would likely make it very small

EDIT:

- numerous techniques published in papers over the past few years

- distilling into an architecture not limited by e.g. a constraint of being feed forward

3

Art10001 t1_jcwfyw8 wrote

I heard MoE is bad. I have no sources sadly.

1

baffo32 t1_jcxqr2i wrote

i visited cvpr last year and people were saying that moe was what mostly was being used; i haven’t tried these things myself though

1