Viewing a single comment thread. View all comments

luaks1337 t1_jc320gp wrote

With 4-bit quantization you could run something that compares to text-davinci-003 on a Raspberry Pi or smartphone. What a time to be alive.


Disastrous_Elk_6375 t1_jc3e9ao wrote

With 8-bit this should fit on a 3060 12GB, which is pretty affordable right now. If this works as well as they state it's going to be amazing.


atlast_a_redditor t1_jc3jzcf wrote

I know nothing about these stuff, but I'll rather want the 4-bit 13B model for my 3060 12GB. As I've read somewhere quantisation has less effect on larger models.


disgruntled_pie t1_jc4ffo1 wrote

I’ve successfully run the 13B parameter version of Llama on my 2080TI (11GB of VRAM) in 4-bit mode and performance was pretty good.


pilibitti t1_jc56vv5 wrote

hey do you have a link for how one might set this up?


disgruntled_pie t1_jc5g6or wrote

I’m using this project:

The project’s Github wiki has a page on llama that explains everything you need.


pdaddyo t1_jc5uoly wrote

And if you get stuck check out /r/oobabooga


FaceDeer t1_jc3k2oi wrote

I'm curious, there must be a downside to reducing the bits, mustn't there? What does intensively jpegging an AI's brain do to it? Is this why Lt. Commander Data couldn't use contractions?


luaks1337 t1_jc3p8oq wrote

Backpropagation requires a lot of accuracy so we need 16- or 32-bit while training. However, post-training quantization seems to have very little impact on the results. There are different ways in which you can quantize but apparently llama.cpp uses the most basic way and it still works like a charm. Georgi Gerganov (maintainer) wrote a tweet about it but I can't find it right now.