Submitted by Open-Dragonfly6825 t3_10s3u1s in deeplearning

I've worked for some years developing scientific applications for GPUs. Recently we've been trying to integrate FPGAs into our technologies; and consequently I've been trying to understand what they are useful for.

I've found many posts here and there that claim that FPGAs are better suited than GPUs to accelerate Deep Learning/AI workloads (for example, this one by Intel). However, I don't understand why that would be the case. I think the problem is that all those posts try to explain what an FPGA is and what its differences are to a GPU, so that people that work on Deep Learning understand why they are better suited. Nevertheless, my position is exactly the opposite: I know quite well how a GPU works and what it is good for, I know well enough how an FPGA works and how it differs from a GPU, but I do not know enough about Deep Learning to understand why Deep Learning applicatios would benefit more from the special features of FPGAs rather than from the immense parallelism GPUs offers.

As far as I know, an FPGA will never beat a traditional GPU in terms of raw parallelism (or, if it does, it would be much less cost efficient). Thus, when it comes to matrix multiplications, i.e. the main operation in Deep Learning models, or convolutions, GPUs can parallelly work with much bigger matrices. The only explanation I can think of is that traditional Deep Learning applications don't necessarily use such big matrices, but rather smaller ones that can also be fully parallelized in FPGAs and benefit highly from custom-hardware optimizations (optimized matrix multiplications/tensor operations, working with reduced-bit values such as FP16, deep-pipeline parallelism, ...). However, given the recent increase in popularity of very complex models (GPT-3, dall-e, and the like) which boast using millions or even billions of parameters, it is hard to imagine that popular deep learning models work with small matrices of which fully parallel architectures can be synthesized in FPGAs.

What am I missing? Any insight will be greatly appreciated.

EDIT: I know TPUs are a thing and are regarded as "the best option" for deep learning acceleration. I will not be working with them, however, so I am not interested in knowing the details on how they compare with GPUs or FPGAs.

15

Comments

You must log in or register to comment.

suflaj t1_j6zq1k9 wrote

Well one reason I could think of why is custom kernels. To really get the most out of your model performance, you will likely be optimizing the kernels you use for your layers, sometimes fusing them. A GPU can't adapt to that as well. The best you can do is use TensorRT to optimize for a speficic model of GPU, but why do that when you can create ex. the optimal CNN kernel in hardware on an FPGA? On a GPU you can only work with the hardware that came with the GPU.

That being said, this is in regard to processing, not necessarily scaling it up. And maybe it makes sense for inference, where it would be nice making a processor that is made specifically to run some architecture and which doesn't necessarily process things in large batches.

But for training, obviously nothing is going to beat a GPU/TPU cluster because of pricing and seemingly infinite scaling of GPUs. If money is not a problem you can always just buy more GPUs and your training will be faster. But parallelization will probably not make your inference faster, since the "deep" in DL refers to the long serial chain of processing, and that's where a hardware implementation of the optimized model makes sense.

Ideally, though, you'd want a TPU, not FPGA processors. TPUs are cheaper and you can use them for research as well.

5

BellyDancerUrgot t1_j6zyiqm wrote

I’ll be honest, I don’t really know what FPGAs (I reckon they are an ASIC for matrix operations?) do and how they do it but tensor cores already provide optimization for matrix / tensor operations and fp16 and mixed precision has been available for quite a few years now. Ada and hopper even enable insane performance improvements for fp8 operations. Is there any real verifiable benchmark that compares training and inference time of the two?

On top of that there’s the obvious Cuda monopoly that nvidia has a tight leash on. Without software even the best hardware is useless and almost everything is optimized to run on Cuda backend.

0

yannbouteiller t1_j70o6y3 wrote

FPGAs are theoretically better than GPUs to deploy Deep Learning models simply because they are theoretically better than anything at doing anything. In practice, though, you never have enough circuitry on an FPGA to efficiently deploy a large model, and they are not targetted by the main Deep Learning libraries so you have to do the whole thing by hand including quantizing your model, extracting its weights, coding each layer in embedded C/VHDL/etc, and doing most of the hardware optimization by hand. It is tedious enough for preferring plug-and-play solutions like GPUs/TPUs in most cases, including embedded systems.

4

TheDailySpank t1_j70szn3 wrote

I tried to get I to FPGAs back in the day but found the hardware/software/über nerd level of knowledge to be way out of my league. Dove into the whole IP thing/level of logic and found it way above my level of autism.

0

alex_bababu t1_j716j6g wrote

Do the say FPGAs are better for training or for infering already trained models?

Training I don't know. Inference I would say FPGAs are better

3

AzureNostalgia t1_j7199dd wrote

Don't listen to anyone saying FPGAs are better than GPUs in AI. They don't know the platforms well enough.

FPGAs are obsolete for AI (training AND inference) and there are many reasons for that. Less parallelism, less power efficiency, no scaling, they run at like 300Mhz at best, they don't have the ecosystem and support GPUs have (i.e. support for models and layers). Even the reduced precision "advantage" they had it is now gone a long time ago. GPUs can do 8bit and even FP8 now. Maybe the largest FPGA (for example a Xilinx Alveo card) can be compared with a small embedded Jetson Xavier in AI. (you can compare the performance results from each company to see yourself).

Wonder why there are no FPGAs in MLPerf? (an AI benchmark which became the standard). Yeah you guess it right. Even Xilinx realized how bad FPGAs are for AI and stopped their production for this reason. They created the new Versal series which are not even FPGAs, they are more like GPUs (specifically they work like Nvidia Tensor cores for AI).

To sum up, FPGAs are worse in everything when compared with GPUs. Throughput, latency, power efficiency, performance/cost, you name it. Simple as that.

2

Open-Dragonfly6825 OP t1_j72om7m wrote

Maybe I missed it, but the posts I read don't specify that. Some scientific works claim that FPGAs are better than GPUs both for training and inference.

Why would you say they are better only for inference? Wouldn't a GPU be faster for inference too? Or is it just that inference doesn't require high speeds and FPGAs are for their energy efficiency?

1

Open-Dragonfly6825 OP t1_j72ow1i wrote

It is definitely hard to get started with FPGAs. High Level Synthesis such as OpenCL has easen the effort in recent years, but it is still particularly... different than regular programming. Requires more thoughtfulness, I would say.

2

Open-Dragonfly6825 OP t1_j72pzlc wrote

FPGAs are reconfigurable hardware accelerators. That is, you could theoretically "syntehthize" (implement) any digital circuit into an FPGA, given that the FPGA has a high enough amount of "resources".

This would let the user to deploy custom hardware solutions to virtually any application, which could be way more optimized than software solutions (including using GPUs).

You could implement tensor cores or a TPU using an FPGA. But, obviously, an ASIC is faster and more energy efficient than its equivalent FPGA implementation.

Linking to what you say, besides all the "this is just theory, in practice things are different" of FPGAs, programming GPUs with CUDA is way way easier than programming FPGAs as of today.

2

Open-Dragonfly6825 OP t1_j72qtao wrote

That actually makes sense. FPGAs are very complex to program, even though the gap between software and hardware programming has been narrowed with High Level Synthesis (e.g. OpenCL). I can see how it is just easier to use a GPU that is simpler to program, or a TPU that already has compatible libraries built for that abstract the low level details.

However, FPGAs have been increasing in area and available resources in recent years. It is still not enough circuitry?

1

Open-Dragonfly6825 OP t1_j72s5ov wrote

One question: what do you mean by "kernels" here? It is the CNN operation you do to the layers? (As I said, I am not familiar with Deep Learning, and "kernels" means another thing when talking about GPU and FPGA programming.)

I know about TPUs and I understand they are the "best solution" for deep learning. However, I did not mention them since I won't be working with them.

Why wouldn't GPU parallelization make inference faster? Isn't inference composed mainly of matrix multiplications as well? Maybe I don't understand very well how GPU training is performed and how it differs from inference.

1

Open-Dragonfly6825 OP t1_j72yyst wrote

Could you elaborate on some of the points you make? I have read the opposite to what you say regarding the folliwng points:

  • Many scientific works claim that FPGAs have similar or better power (energy) efficiency than GPUs in almost all applications.
  • FPGAs are considered a good AI technology for embedded devices where low energy consumption is key. Deep Learning models can be trained somewhere else, using GPUs, and, theoretically, inference can be done on the embedded devices using the FPGAs, for good speed and energy efficiency. (Thus, FPGAs are supposedly well-suited for inference.)
  • Modern high-end (data center) FPGAs target 300 MHz clock speeds as base speeds. It is not unusual for designs to achieve performances higher than 300 MHz. Not much higher, though, unless you highly optimize the design and use some complex tricks to boost the clock speeds.

The comparison you make about the largest FPGA being comparable only to small embedded GPUs is interesting. I might look more into that.

1

suflaj t1_j731s6u wrote

I mean kernels in the sense of functions.

> Why wouldn't GPU parallelization make inference faster?

Because most DL models are deep, and not exactly wide. I've explained already, deep means a long serial chain. Not parallelizable outside of data parallelism, which doesn't speed up inference, and model parallelism (generally not implemented, and has heavy IO costs).

Wide models and how they become equivalent to deep ones are unexplored, although they are theoretically just as expressive.

1

AzureNostalgia t1_j732f33 wrote

The claim that FPGAs have better power efficiency than GPUs is a reminiscent of the past. In the real world and industry (and not in scientific papers which are written by PhDs) GPUs achieve way higher performance. The simple reason is FPGAs as devices are way behind in architecture, compute capacity and capabilities.

A very simple way to see my point is this. Check one of the largest FPGAs from Xilinx, the Alveo U280 (https://www.xilinx.com/products/boards-and-kits/alveo/u280.html#specifications). It theoretically can achieve up to 24.5 INT8 TOPs AI performance and it's a 225W card. Now check a similar architecture (in nm) embedded GPU, the AGX xavier (https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/). Check the specs on the bottom. Up to 22TOPs in a 30W device. That's why FPGAs are obsolete. I have countless examples like that but you get the idea.

2

BrotherAmazing t1_j73k2x8 wrote

It’s very specific to what you are doing. GPUs are absolutely superior hands down for the kind of R&D and early offline prototyping I do when you consider all the practical business aspects with efficiency, cost, flexibility, and practicality given our business’ and staff pedigree and history.

2

alex_bababu t1_j73ofte wrote

You know probably much more than me. My thoughts were, for inference you don't need the calculation power for backpropagation. The model is fixed and you can find a efficient way to program a fpga to run it.

Basically like an ASIC. And lsi more energy efficient.

You could map the model on the fpga in such a way, so you would not need to store intermediate results in memory.

2

Open-Dragonfly6825 OP t1_j74ntpw wrote

Hey, maybe it's true that I know my fair lot about acceleration devices. But, until you mentioned it, I had actually forgotten about backpropagation, which is something basic for deep learning. (Or, rather than forget, I hadn't thought about it.)

Now that you mentioned it, it makes so much sense why FPGAs might be better suited but only for inference.

1

Open-Dragonfly6825 OP t1_j74oes8 wrote

I guess the suitability of the acceleration devices change depending on your specific context of development and/or application. Deep learning is such a broad field with so many applications, it may be reasonable that different applications benefit from different accelerators better.

Thank you for your comment.

2