Submitted by Open-Dragonfly6825 t3_10s3u1s in deeplearning
I've worked for some years developing scientific applications for GPUs. Recently we've been trying to integrate FPGAs into our technologies; and consequently I've been trying to understand what they are useful for.
I've found many posts here and there that claim that FPGAs are better suited than GPUs to accelerate Deep Learning/AI workloads (for example, this one by Intel). However, I don't understand why that would be the case. I think the problem is that all those posts try to explain what an FPGA is and what its differences are to a GPU, so that people that work on Deep Learning understand why they are better suited. Nevertheless, my position is exactly the opposite: I know quite well how a GPU works and what it is good for, I know well enough how an FPGA works and how it differs from a GPU, but I do not know enough about Deep Learning to understand why Deep Learning applicatios would benefit more from the special features of FPGAs rather than from the immense parallelism GPUs offers.
As far as I know, an FPGA will never beat a traditional GPU in terms of raw parallelism (or, if it does, it would be much less cost efficient). Thus, when it comes to matrix multiplications, i.e. the main operation in Deep Learning models, or convolutions, GPUs can parallelly work with much bigger matrices. The only explanation I can think of is that traditional Deep Learning applications don't necessarily use such big matrices, but rather smaller ones that can also be fully parallelized in FPGAs and benefit highly from custom-hardware optimizations (optimized matrix multiplications/tensor operations, working with reduced-bit values such as FP16, deep-pipeline parallelism, ...). However, given the recent increase in popularity of very complex models (GPT-3, dall-e, and the like) which boast using millions or even billions of parameters, it is hard to imagine that popular deep learning models work with small matrices of which fully parallel architectures can be synthesized in FPGAs.
What am I missing? Any insight will be greatly appreciated.
EDIT: I know TPUs are a thing and are regarded as "the best option" for deep learning acceleration. I will not be working with them, however, so I am not interested in knowing the details on how they compare with GPUs or FPGAs.
suflaj t1_j6zq1k9 wrote
Well one reason I could think of why is custom kernels. To really get the most out of your model performance, you will likely be optimizing the kernels you use for your layers, sometimes fusing them. A GPU can't adapt to that as well. The best you can do is use TensorRT to optimize for a speficic model of GPU, but why do that when you can create ex. the optimal CNN kernel in hardware on an FPGA? On a GPU you can only work with the hardware that came with the GPU.
That being said, this is in regard to processing, not necessarily scaling it up. And maybe it makes sense for inference, where it would be nice making a processor that is made specifically to run some architecture and which doesn't necessarily process things in large batches.
But for training, obviously nothing is going to beat a GPU/TPU cluster because of pricing and seemingly infinite scaling of GPUs. If money is not a problem you can always just buy more GPUs and your training will be faster. But parallelization will probably not make your inference faster, since the "deep" in DL refers to the long serial chain of processing, and that's where a hardware implementation of the optimized model makes sense.
Ideally, though, you'd want a TPU, not FPGA processors. TPUs are cheaper and you can use them for research as well.