spaccetime t1_izsnm7n wrote on December 11, 2022 at 3:41 PM

8x NVIDIA A100 = 25$/hour

TPU v3-4 = 8$/hour

TPU v4-4 = 12$/hour

When training BERT on 27B tokens I measured faster training times when using the TPU.

Nvidias’ GPUs are great for Deep Learning, but DL is not what they are designed for. They have CUDA cores or even RT-cores. You pay extra for being good at rendering, but you don’t use this or use it just just a little when training deep learning models.

Google’s TPU is engineered only for Deep Learning. The MXU is unrivaled.

For short term usage take the TPU and for long term a DGX station or another cluster.

TPU is not for experimental usage. Use it only when you are sure that your model, data and parameterization make sense.

JanGehlYacht t1_izuzh6o wrote on December 12, 2022 at 1:14 AM

Adding to this: TPUs are heavily optimized for transformer architectures since Google uses them heavily. Also, you'll see that TPU has a different stack: it has XLA as a compiler so has many more compiler optimizations for ML training on the fly (things like op fusion in CUDA are very beneficial and it comes almost for free with XLA.) It also scales to very large models easily because of its multi-host high-speed interconnects (you'll see most NLP models with GPU are bound at ~340B params while Google published a paper on a 540B model almost a year ago)

You should evaluate this yourself for your workload though: If you are training a large NLP model you'll be paying a lot anyway. So, you should really train your model for an hour on whatever options you're thinking and compare what QPS you get for the $ you pay. Then, you can pick a stack and get going to train the model fully.

Likely, given that TPUs only job is ML, you'll find out that for most ML workloads it'll be heavily optimized.

pommedeterresautee t1_izu96au wrote on December 11, 2022 at 10:02 PM

Why do you say TPU is not for experimental usage?

spaccetime t1_izw55mj wrote on December 12, 2022 at 7:41 AM

Yes, just as /u/Mrgod2u82 mentioned - it’s expensive.

You should debug and prepare your model on less expensive machine - your experimental and development machine - and then run the top model with all the data on the TPU - your production-grade machine.

For example, we trained BERT for 4 days. If we didn’t pay enough attention when setting up the training we could have spent another 800$ just for experimenting, which is too expensive for us. Of course, at some companies like Google Brain and OpenAI they probably don’t care about cost minimization. There you can use TPU as your daily work station.😄

Use one machine for development and one for the heavy-and-long training.

Mrgod2u82 t1_izujb7i wrote on December 11, 2022 at 11:14 PM

Guessing because you're paying for it? No point in paying if you're not confident it makes sense to pay. All depends on how deep one's pockets are I suppose.

Shardsmp OP t1_izy3s9z wrote on December 12, 2022 at 6:31 PM

thanks for the answer. what do you recommend for experimental usage?

spaccetime t1_j01j8j7 wrote on December 13, 2022 at 12:16 PM

I’d love to have as daily work station Dell Precision with A5500, but our hardware budget can’t afford it. 😀

For us, so far, anything with fp32 tensor-cores and 16GB VRAM was sufficient for developing and debugging our models, mainly RNNs, Transformers, CNNs and GANs, but the moment we want to train on millions of samples with higher batch size we have to switch to a faster machine or cluster.