Fellow machine learning enthusiast here!

I want to train a large NLP model and I'm wondering whether its worth it to use Google Cloud's TPU's for it. I already have an Nvidia RTX 3060 Laptop GPU with 8.76 TFLOPS, but I was unable to find out what the exact performance (in TFLOPS to be able to compare them) of google TPU v3 and v4 are.

I know TPUs (I think the factor is 12x) are a ton faster and more optimized for machine learning than GPU's, but I'm still wondering whether its worth it to just build a graphics card rig for the long term. (since the pricing and estimation seems unclear to me since I cannot see how much I'm paying per TFLOP.)

Has anyone done the numbers on price/performance and hourly cost? Also is there any factor I missed? Thanks a lot in advance!

Comments

You must log in or register to comment.

spaccetime t1_izsnm7n wrote on December 11, 2022 at 3:41 PM

8x NVIDIA A100 = 25$/hour

TPU v3-4 = 8$/hour

TPU v4-4 = 12$/hour

When training BERT on 27B tokens I measured faster training times when using the TPU.

Nvidias’ GPUs are great for Deep Learning, but DL is not what they are designed for. They have CUDA cores or even RT-cores. You pay extra for being good at rendering, but you don’t use this or use it just just a little when training deep learning models.

Google’s TPU is engineered only for Deep Learning. The MXU is unrivaled.

For short term usage take the TPU and for long term a DGX station or another cluster.

TPU is not for experimental usage. Use it only when you are sure that your model, data and parameterization make sense.

JanGehlYacht t1_izuzh6o wrote on December 12, 2022 at 1:14 AM

Adding to this: TPUs are heavily optimized for transformer architectures since Google uses them heavily. Also, you'll see that TPU has a different stack: it has XLA as a compiler so has many more compiler optimizations for ML training on the fly (things like op fusion in CUDA are very beneficial and it comes almost for free with XLA.) It also scales to very large models easily because of its multi-host high-speed interconnects (you'll see most NLP models with GPU are bound at ~340B params while Google published a paper on a 540B model almost a year ago)

You should evaluate this yourself for your workload though: If you are training a large NLP model you'll be paying a lot anyway. So, you should really train your model for an hour on whatever options you're thinking and compare what QPS you get for the $ you pay. Then, you can pick a stack and get going to train the model fully.

Likely, given that TPUs only job is ML, you'll find out that for most ML workloads it'll be heavily optimized.

pommedeterresautee t1_izu96au wrote on December 11, 2022 at 10:02 PM

Why do you say TPU is not for experimental usage?

spaccetime t1_izw55mj wrote on December 12, 2022 at 7:41 AM

Yes, just as /u/Mrgod2u82 mentioned - it’s expensive.

You should debug and prepare your model on less expensive machine - your experimental and development machine - and then run the top model with all the data on the TPU - your production-grade machine.

For example, we trained BERT for 4 days. If we didn’t pay enough attention when setting up the training we could have spent another 800$ just for experimenting, which is too expensive for us. Of course, at some companies like Google Brain and OpenAI they probably don’t care about cost minimization. There you can use TPU as your daily work station.😄

Use one machine for development and one for the heavy-and-long training.

Mrgod2u82 t1_izujb7i wrote on December 11, 2022 at 11:14 PM

Guessing because you're paying for it? No point in paying if you're not confident it makes sense to pay. All depends on how deep one's pockets are I suppose.

Shardsmp OP t1_izy3s9z wrote on December 12, 2022 at 6:31 PM

thanks for the answer. what do you recommend for experimental usage?

spaccetime t1_j01j8j7 wrote on December 13, 2022 at 12:16 PM

I’d love to have as daily work station Dell Precision with A5500, but our hardware budget can’t afford it. 😀

For us, so far, anything with fp32 tensor-cores and 16GB VRAM was sufficient for developing and debugging our models, mainly RNNs, Transformers, CNNs and GANs, but the moment we want to train on millions of samples with higher batch size we have to switch to a faster machine or cluster.

Deep-Station-1746 t1_izrr83v wrote on December 11, 2022 at 10:23 AM

If you are looking to maximize the TFLOP per $, just use vast.ai. Unless you got enterprise-level VRAM needs, vast will likely be much cheaper than anything a cloud provider lists.

leepenkman t1_izrr972 wrote on December 11, 2022 at 10:23 AM

I recommend getting a box with a 3090 ti or upwards, it's much faster than a laptop GPU, on a 24g vram machine I can train a 3b model or do inference on a 11b one so training is much more intensive on the memory, also recommend looking into TRC where they will give you free tpu for a month, but still won't end up being completely free, also CloudFlare r3 sounds good for storing models but it's not really the storage/transfer costs that are important during experimental stuff anyway.

Thanks, also checkout https://text-generator.io as it's really efficient to try the pretrained models first instead of trying complex training

Shardsmp OP t1_izy3k9t wrote on December 12, 2022 at 6:29 PM

thank you!

norcalnatv t1_izshj9v wrote on December 11, 2022 at 2:56 PM

If you own the laptop it’s always going to be cheaper to use that than going to the cloud.

HateRedditCantQuitit t1_iztdm8c wrote on December 11, 2022 at 6:37 PM

> but I'm still wondering whether its worth it to just build a graphics card rig for the long term.

Pretty much never, assuming it's for personal use.

If you're going to use this rig exclusively for ML, then maybe it still makes sense. The calculation becomes simple: cost to buy + energy cost to use * amount of use before it doesnt fit your needs vs cloud cost. If you use it enough for this to make sense, you might also be surprised how quickly you outgrow it (e.g. maybe you'll want to run some experiments in parallel sometimes, or you want to use models bigger than this thing's VRAM in a year or few).

If you want to use it for non-ML use, then no just use the cloud. If you're using it enough that the above calculation says to buy, then you won't actually get to use it for non-ML use, which will just annoy the hell out of you.

Shardsmp OP t1_izy4io0 wrote on December 12, 2022 at 6:35 PM

haha ty

puppet_pals t1_izrpa5d wrote on December 11, 2022 at 9:54 AM

Unfortunately, there are too many variables at play to give you a set in stone answer.

herokocho t1_izukksc wrote on December 11, 2022 at 11:23 PM

TPU is massively better price/performance at the cluster scale in practice due to better interconnect leading to better utilization, but worse price/performance at the single-node scale.

Shardsmp OP t1_izwhsfm wrote on December 12, 2022 at 10:42 AM

is there any data to back this up?
How do I know where exactly the line is, from what scale it is worth it more to use a TPU?

herokocho t1_izxnzhd wrote on December 12, 2022 at 4:50 PM

not aware of any good comparisons out there, this is all anecdata from looking at profiler traces when training diffusion models and noticing that I was communication bottlenecked even on TPUs, so on GPUs it would be much worse.

it's usually better to use TPU as soon as you'd have to use multiple GPU nodes, and basically always better at v4-128 scale and above (v4-128 has 2x faster interconnect than anything smaller).

VirtualHat t1_izrki2g wrote on December 11, 2022 at 8:45 AM

I would also like to know the answer to this...