Exarctus t1_iycjag7 wrote on November 30, 2022 at 11:36 AM

Reply to comment by kaskoosek in Does anyone uses Intel Arc A770 GPU for machine learning? [D] by labloke11

PyTorch has a ROCm distribution so most modernish AMD cards should be fine…

Ronny_Jotten t1_iyd1p42 wrote on November 30, 2022 at 2:30 PM

There are many issues with ROCm. "AMD cards should be fine" is misleading. For example, you can get Stable Diffusion to work, but not Dreambooth, because it has dependencies on specific CUDA libraries, etc.:

Training memory optimizations not working on AMD hardware · Issue #684 · huggingface/diffusers

Also, you must be running Linux. AMD cards can be useful, especially with 16 GB VRAM starting in the RX 6800, but currently will be extra effort, and just won't work in some cases.

Exarctus t1_iyd2ety wrote on November 30, 2022 at 2:36 PM

My comment was aimed more towards ML scientists (the vast majority of whom are linux enthusiasts) who are developing their own architectures.

Translating CUDA to HIP is also not particularly challenging, as there are tools available which do this for you.

Ronny_Jotten t1_iyd6ouv wrote on November 30, 2022 at 3:07 PM

> My comment was aimed more towards ML scientists (the vast majority of whom are linux enthusiasts) who are developing their own architectures.

Your original comment implied that ROCm works "fine" as a drop-in replacement for CUDA. I don't think that's true. I'm not an ML scientist, but nobody develops in a vaccum. There are generally going to be dependencies on various libraries. The issue with Dreambooth I mentioned involves this for example:

ROCM Support · Issue #47 · TimDettmers/bitsandbytes

While it should be possible to port it, someone has to take the time and effort to do it. Despite the huge popularity of Dreambooth, nobody has. My preference is to use AMD, and I'm happy to see people developing for it, but it's only "fine" in limited circumstances, compared to Nvidia.

Exarctus t1_iyd7r5i wrote on November 30, 2022 at 3:15 PM

I am an ML scientist. And the statement you're making about AMD GPUs only "being fine in limited circumstances" is absolutely false. Any network that you can create for a CUDA-enabled GPU can also be ported into an AMD-enabled GPU when working with PyTorch with a single code line change.

The issues arise when developers of particular external libraries that you might want to use only develop for one platform. This is **only** an issue when these developers make customized CUDA C implementations for specific part of their network, but don't use HIP for cross-compatibility. This is not an issue if the code is pure PyTorch.

This is not an issue with AMD, it's purely down to laziness (and possibly ill-experience) of the developer.

Regardless, whenever I work with AMD GPUs and implement or derive from other people work, it does sometimes include extra development time to convert e.g any customized CUDA C libraries that have been created by the developer to HIP libraries, but this in itself isn't too difficult as there are conversion tools available.

Ronny_Jotten t1_iydddfe wrote on November 30, 2022 at 3:53 PM

> the statement you're making about AMD GPUs only "being fine in limited circumstances" is absolutely false

Sorry, but there are limitations to the circumstances in which AMD cards are "fine". There are many real-world cases where Nvidia/CUDA is currently required for something to work. The comment you replied to was:

> Limited use in neural network applications at present due to many application's CUDA requirements (though the same could be said of AMD)

It was not specificaly about "code that is pure PyTorch", nor self-developed systems, but neural network applications in general.

It's fair of you to say that CUDA requirements can be met with HIP and ROCm if the developer supports it, though there are numerous issues and flaws in ROCm itself. But there are still issues and limitations in some circumstances, where they don't, as you've just described yourself! You can say that's due to the "laziness" of the developer, but it doesn't change the fact that it's broken. At the least it requires extra development time to fix, if you have the skills. I know a lot of people would appreciate it if you would convert the bitsandbytes library! Just because it could work, doesn't mean it does work.

The idea that there's just no downside to AMD cards for ML, because of the existence of ROCm, is true only in limited circumstances. "Limited" does not mean "very few", it means that ROCm is not a perfect drop-in replacement for CUDA in all circumstances; there are issues and limitations. The fact that Dreambooth doesn't run on AMD proves the point.