Comments

You must log in or register to comment.

RoaRene317 t1_jdfnzna wrote

My suggestion is using 8 bit or 4 bit quantization. Also you can using automatic device mapping on Transformers that can offload partially to your CPU (warning : It use lots of System Memory [RAM]).

3

Civil_Collection7267 t1_jdfogwq wrote

You can use 4-bit LLaMA 13B or 8-bit LLaMA 7B with the alpaca lora, both are very good. If you need help, this guide explains everything

2

ggf31416 t1_jdesxc0 wrote

With memory offloading and 8-bit quantization you may be able to run the 13B model, but slowly. The 7B will be faster.

1

suflaj t1_jdf3j2k wrote

Unless you plan on quantizing your model or loading it layer by layer, I'm afraid 2B parameters is the most you'll get. 10GB VRAM is not really enough for CV nowadays, let alone NLP. With quantization, you can barely run the 7B model.

4 bit doesn't matter at the end of the day since it's not supported out of the box, unless you intend to implement it yourself.

1