Submitted by poppear t3_11ozl85 in MachineLearning
I put together this plain pytorch implementation of LLaMA (i just substituted the fairscale layers with the native ones and converted the weights accordingly) that can be more easily run in different environments.
The big problem with the official implementation is that in order to run the 65B version you need 8 GPUs no matter what, and to run the 30B version you need 4 and so on. In reality you can easily fit the 65B version in 2 A100 with 100G of VRAM.
vanilla-llama solves this problem. You just need to have enough memory and the model will be load in all the available GPUs.
​
LoaderD t1_jbw3640 wrote
> In reality you can easily fit the 65B version in 2 A100 with 100G of VRAM.
Ughhh are you telling me I have to SSH into my DGX 100 instead of just using my local machine with 1 A100? (Satire I am a broke student)
Appreciate the implementation and transparency. I don't think many people realize how big a 65B parameter model is since there's no associated cost with downloading them.