Viewing a single comment thread. View all comments

ustainbolt t1_je7plqi wrote

For a 65b model you are probably going to have to parallelise the model parameters. See this link. As for training, it would be best to use a vm (any provider will work, lambda and vast.ai are cheap). I would a recommend 4x (or 8x) A100 machine. I'm sure you can find more information about all of this.

31

wrossmorrow t1_je7vy2p wrote

+1 for lambda labs

8

ustainbolt t1_je7xtcw wrote

I love lambda. More reliable than vast.ai, and WAY cheaper than AWS/GCP/Azure.

8

Nhabls t1_je9598b wrote

Every time I logged on to lambdalabs in the past year all their instances were full. Not that available in my experience

5

badabummbadabing t1_je9cdf7 wrote

They just had their Series B funding, they should upscale their resources soon.

1

itsyourboiirow t1_jecqc1d wrote

This is the only downside I've found. Sometimes it's too darn hard to find an instance.

1

learn-deeply t1_je9eovt wrote

Tensor (aka model parallel) parallel with model checkpointing works better than FSDP (though they can be used in conjunction) from my experience. FSDP is easier to work with though.

1