Viewing a single comment thread. View all comments

pitrucha t1_jckiv1q wrote

Any plans to quantize it? I saw that someone managed to do so with 65B LLama and push it from 120 to 30 GB

24

juliensalinas OP t1_jcktk8o wrote

This is maybe something I'll focus on in the future. But for the moment I find this fp16 version well suited for small budgets as it runs on a 16GB GPU while the native fp32 version of GPT-J requires at least 24GB of VRAM.

Also, with the bitsandbytes integration in HF Transformers you can use the model in 8 bits: https://huggingface.co/blog/hf-bitsandbytes-integration

12