Viewing a single comment thread. View all comments

karyo t1_jb03jq0 wrote

The first question is kinda difficult. Deep speed, zero, Megatron all play into it. There's a reason somebody recently said that there are only 200 people on the world atm that can pull it off.

For the second question ,

4090s just won't cut it. Nvidia fused off P2P this generation so unless you have an embarrassingly parallel pipeline ( which current llms aren't) they are not useful. Problem is ada a6000 was restricted severely P2P wise.

If you're doing llms at billion scale you gotta get v,a,h100s

2

ChristmasInOct OP t1_jb2cwwf wrote

Thanks for the response. Do you recall where you read the "only 200 people" bit? I'll take a look around for it as well; seems like the context could have found itself surrounded by interesting conversation.

P2P is not so much of a limitation so long as you can fit the entire model / pipeline into a single cards VRAM though, correct?

So for example, if you have a 7B Param model at FP16 and its around 14GB, presumably you should be safe with 24GB VRAM?

Thanks again for your time.

1

karyo t1_jb2qo4e wrote

For inference?yes. Look at eleutherai transformer math page. Also others are trying out llama rn so check them out

1