Viewing a single comment thread. View all comments

suflaj t1_j6zq1k9 wrote

Well one reason I could think of why is custom kernels. To really get the most out of your model performance, you will likely be optimizing the kernels you use for your layers, sometimes fusing them. A GPU can't adapt to that as well. The best you can do is use TensorRT to optimize for a speficic model of GPU, but why do that when you can create ex. the optimal CNN kernel in hardware on an FPGA? On a GPU you can only work with the hardware that came with the GPU.

That being said, this is in regard to processing, not necessarily scaling it up. And maybe it makes sense for inference, where it would be nice making a processor that is made specifically to run some architecture and which doesn't necessarily process things in large batches.

But for training, obviously nothing is going to beat a GPU/TPU cluster because of pricing and seemingly infinite scaling of GPUs. If money is not a problem you can always just buy more GPUs and your training will be faster. But parallelization will probably not make your inference faster, since the "deep" in DL refers to the long serial chain of processing, and that's where a hardware implementation of the optimized model makes sense.

Ideally, though, you'd want a TPU, not FPGA processors. TPUs are cheaper and you can use them for research as well.

5

Open-Dragonfly6825 OP t1_j72s5ov wrote

One question: what do you mean by "kernels" here? It is the CNN operation you do to the layers? (As I said, I am not familiar with Deep Learning, and "kernels" means another thing when talking about GPU and FPGA programming.)

I know about TPUs and I understand they are the "best solution" for deep learning. However, I did not mention them since I won't be working with them.

Why wouldn't GPU parallelization make inference faster? Isn't inference composed mainly of matrix multiplications as well? Maybe I don't understand very well how GPU training is performed and how it differs from inference.

1

suflaj t1_j731s6u wrote

I mean kernels in the sense of functions.

> Why wouldn't GPU parallelization make inference faster?

Because most DL models are deep, and not exactly wide. I've explained already, deep means a long serial chain. Not parallelizable outside of data parallelism, which doesn't speed up inference, and model parallelism (generally not implemented, and has heavy IO costs).

Wide models and how they become equivalent to deep ones are unexplored, although they are theoretically just as expressive.

1

Open-Dragonfly6825 OP t1_j73258d wrote

Ok, that makes sense. Just wanted to confirm I understood it well.

Thank you.

2