ThatInternetGuy t1_iycfakq wrote on November 30, 2022 at 10:42 AM

Stick to Nvidia if you don't want to waste your time researching for non-Nvidia solutions.

However, it's worth noting that many researchers and devs just stick to renting cloud GPUs anyway. Training usually needs something like A100 40GB or at least a T4 16GB.

kaskoosek t1_iyc16fh wrote on November 30, 2022 at 7:18 AM

Limited use in neural network applications at present due to many application's CUDA requirements (though the same could be said of AMD)

This is what i read from a newegg review.

labloke11 OP t1_iyc2aoy wrote on November 30, 2022 at 7:33 AM

Intel has Intel OneAPI extension for pytorch, sklearn and tensorflow. Not sure how they work, but any experiences?

Exarctus t1_iycjag7 wrote on November 30, 2022 at 11:36 AM

PyTorch has a ROCm distribution so most modernish AMD cards should be fine…

Ronny_Jotten t1_iyd1p42 wrote on November 30, 2022 at 2:30 PM

There are many issues with ROCm. "AMD cards should be fine" is misleading. For example, you can get Stable Diffusion to work, but not Dreambooth, because it has dependencies on specific CUDA libraries, etc.:

Training memory optimizations not working on AMD hardware · Issue #684 · huggingface/diffusers

Also, you must be running Linux. AMD cards can be useful, especially with 16 GB VRAM starting in the RX 6800, but currently will be extra effort, and just won't work in some cases.

Exarctus t1_iyd2ety wrote on November 30, 2022 at 2:36 PM

My comment was aimed more towards ML scientists (the vast majority of whom are linux enthusiasts) who are developing their own architectures.

Translating CUDA to HIP is also not particularly challenging, as there are tools available which do this for you.

Ronny_Jotten t1_iyd6ouv wrote on November 30, 2022 at 3:07 PM

> My comment was aimed more towards ML scientists (the vast majority of whom are linux enthusiasts) who are developing their own architectures.

Your original comment implied that ROCm works "fine" as a drop-in replacement for CUDA. I don't think that's true. I'm not an ML scientist, but nobody develops in a vaccum. There are generally going to be dependencies on various libraries. The issue with Dreambooth I mentioned involves this for example:

ROCM Support · Issue #47 · TimDettmers/bitsandbytes

While it should be possible to port it, someone has to take the time and effort to do it. Despite the huge popularity of Dreambooth, nobody has. My preference is to use AMD, and I'm happy to see people developing for it, but it's only "fine" in limited circumstances, compared to Nvidia.

Exarctus t1_iyd7r5i wrote on November 30, 2022 at 3:15 PM

I am an ML scientist. And the statement you're making about AMD GPUs only "being fine in limited circumstances" is absolutely false. Any network that you can create for a CUDA-enabled GPU can also be ported into an AMD-enabled GPU when working with PyTorch with a single code line change.

The issues arise when developers of particular external libraries that you might want to use only develop for one platform. This is **only** an issue when these developers make customized CUDA C implementations for specific part of their network, but don't use HIP for cross-compatibility. This is not an issue if the code is pure PyTorch.

This is not an issue with AMD, it's purely down to laziness (and possibly ill-experience) of the developer.

Regardless, whenever I work with AMD GPUs and implement or derive from other people work, it does sometimes include extra development time to convert e.g any customized CUDA C libraries that have been created by the developer to HIP libraries, but this in itself isn't too difficult as there are conversion tools available.

Ronny_Jotten t1_iydddfe wrote on November 30, 2022 at 3:53 PM

> the statement you're making about AMD GPUs only "being fine in limited circumstances" is absolutely false

Sorry, but there are limitations to the circumstances in which AMD cards are "fine". There are many real-world cases where Nvidia/CUDA is currently required for something to work. The comment you replied to was:

> Limited use in neural network applications at present due to many application's CUDA requirements (though the same could be said of AMD)

It was not specificaly about "code that is pure PyTorch", nor self-developed systems, but neural network applications in general.

It's fair of you to say that CUDA requirements can be met with HIP and ROCm if the developer supports it, though there are numerous issues and flaws in ROCm itself. But there are still issues and limitations in some circumstances, where they don't, as you've just described yourself! You can say that's due to the "laziness" of the developer, but it doesn't change the fact that it's broken. At the least it requires extra development time to fix, if you have the skills. I know a lot of people would appreciate it if you would convert the bitsandbytes library! Just because it could work, doesn't mean it does work.

The idea that there's just no downside to AMD cards for ML, because of the existence of ROCm, is true only in limited circumstances. "Limited" does not mean "very few", it means that ROCm is not a perfect drop-in replacement for CUDA in all circumstances; there are issues and limitations. The fact that Dreambooth doesn't run on AMD proves the point.

trajo123 t1_iyciqd9 wrote on November 30, 2022 at 11:29 AM

For users, it's quite expensive that Nvidia has such a monopoly on ML/DL compute acceleration. People replying with "don't bother, just use Nvidia&CUDA" only make the problem worse ...music for Nvidia's ears.
I would say, by all means try it out and share your experience, just be aware that it's likely going to be more hassle than using Nvidia&CUDA.

r_linux_mod_isahoe t1_iyclnxu wrote on November 30, 2022 at 12:03 PM

no, it's AMD who fucked up. The whole ROCm is an afterthought. Hire a dev, make pytorch work on all modern AMD GPUs, then we'll talk. For now this is somehow a community effort.

serge_cell t1_iycp9ri wrote on November 30, 2022 at 12:42 PM

For that first AMD had to make normal implementation of OpenCL. People complain all the time - slowdowns, crashes, lack of portability. This going on for 10 years already and it doesn't get better.

Ronny_Jotten t1_iyd43te wrote on November 30, 2022 at 2:49 PM

> People replying with "don't bother, just use Nvidia&CUDA" only make the problem worse

No, they don't "only make it worse". It's good advice to a large proportion of people who just need to get work done. AMD/Intel need to hear that, and step up, by providing real, fully-supported alternatives, not leaving their customers to fool around with half-working CUDA imitations. ML is such an important field right now, and they've dropped the ball.

ReginaldIII t1_iycqo8c wrote on November 30, 2022 at 12:56 PM

> People replying with "don't bother, just use Nvidia&CUDA" only make the problem worse ...music for Nvidia's ears.

My job is to get a trained model out the door so we can run experiments.

My job is not revolutionize the frameworks and tooling available so that competing hardware can be made a feasible alternative for everyone.

There are only so many hours in the day. I get paid for a very specific job. I have to work within the world that exists around me right now.

philthechill t1_iycso44 wrote on November 30, 2022 at 1:14 PM

If you’re in a commercial setting your job is to get market-beating learning done at minimal cost. OP says these things might have a revolutionary cost per learning value, so yeah it is within your job parameters to look at specs, pricing and tool support at the very least. Ignoring technological revolutions is definitely one way companies end.

ReginaldIII t1_iyctmv8 wrote on November 30, 2022 at 1:23 PM

90% of the time my job is to be a small and consistently performing cog in a much bigger machine because I am there to help drive down stream science outcomes for other scientists (often in a different discipline).

We need to get X done within Y timeframe.

> "Lets consider upending our infrastructure and putting millions of pounds worth or existing and battle proven code and hardware up in flux so we can fuck around seeing if Intel has actually made a viable GPU-like product on their umpteenth attempt"

... is not exactly an easy sell to my board of governance.

I was in the first wave of people who got access to Xeon Phi Knights Corner Co-Processor cards. Fuck my life did we waste time on that bullshit. The driver support was abysmal, even with Intels own ICC compiler and their own MPI distribution.

philthechill t1_iyctys1 wrote on November 30, 2022 at 1:26 PM

Yeah fair.

ReginaldIII t1_iydnekt wrote on November 30, 2022 at 4:59 PM

Also worth considering how many years it is going to take to offset the sizeable cost of such a migration.

Forget the price of the hardware, how long is it going to take to offset the cost of the programming and administration labour to pull off this sort of move?

What about maintenance? We've got years of experience with Nvidia cards in datacentres, we understand the failure modes pretty well, we understand the tooling needed to monitor and triage these systems at scale.

What guarantees do I have that if I fill my racks with this hardware they won't be dying or catching on fire within a year?

What guarantees do I have that Intel won't unilaterally decide this is a dead cat for them and they want to scrap the project? Like they have for almost every GPU adjacent project they've had.

AtomKanister t1_iycydix wrote on November 30, 2022 at 2:04 PM

>"Lets consider upending our infrastructure and putting millions of pounds worth or existing and battle proven code and hardware up in flux so we can fuck around seeing if Intel has actually made a viable GPU-like product on their umpteenth attempt"

That's exactly how innovation is made, and missing out on this in crucial moments is how previously big players become irrelevant in the blink of an eye. See: Kodak, Blockbuster, Sears, Nokia.

It's valid to be skeptical of new developments (because a lot of them will be dead ends), but overdo it and you're setting yourself up for disaster.

hgoel0974 t1_iyczwvn wrote on November 30, 2022 at 2:17 PM

Setting up infrastructure that relies on a GPU that can't do what you need yet and is not optimized for it either is certainly innovative but not in the way that you're thinking.

ReginaldIII t1_iydl00l wrote on November 30, 2022 at 4:44 PM

> That's exactly how innovation is made

It's also how companies overextend and go out of business.

hgoel0974 t1_iycyx88 wrote on November 30, 2022 at 2:09 PM

For most users of ML frameworks, results take priority and there isn't much they can do about AMD's shit software and unreliable support. Plus even 4090s aren't really that expensive relative to what ML people make.

That said, Intel might actually be able to compete once their drivers have caught up. Unlike AMD, who seems to have systemic issues (not to mention fatal design flaws in ROCm in general), Intel just needs time because they clearly rushed the devices out before the drivers were fully ready.

slashdave t1_iydpou1 wrote on November 30, 2022 at 5:14 PM

Power costs dwarf hardware costs, by miles. Come up with an power efficient GPU, and we'll talk.

AerysSk t1_iycz3fs wrote on November 30, 2022 at 2:10 PM

No, dealing with Nvidia dependencies are just too enough. My department sticks with Nvidia.

staros25 t1_iyd748z wrote on November 30, 2022 at 3:10 PM

Yes, I’ve been using one for about a month now.

labloke11 OP t1_iydkwc0 wrote on November 30, 2022 at 4:43 PM

And?

staros25 t1_iydwrac wrote on November 30, 2022 at 5:59 PM

So far I’m happy with it.

Intel publish extensions for PyTorch and Tensorflow. I’ve been working with PyTorch so I just needed to follow these instructions to get everything set up.

This was a replacement to my GTX 1070. I don’t have any direct benchmarks, but the memory increase alone allowed me to train some models I had issues with before.

For “pros”, I’d say the performance for the price point is pretty money. Looking at NVIDIA GPUs that have 16+ GB of memory, you’d need a 3070 which looks to be in the $600-$700 range. The setup took me an evening to get everything figured out, but it wasn’t too bad.

For “cons”, it’s still a new GPU and there are a couple open issues. So far I haven’t run into any dealbreakers. Probably the biggest drawback is Intel needs to release their extension paired to a release of PyTorch / Tensorflow. I think the Tensorflow extension works with the newest version. PyTorch current supports v1.10 (1.13 is current).

All in all I think it’s a solid choice if you’re OK diving into the Intel ecosystem. While their extensions aren’t nearly as plug-and-play as CUDA, you can tell Intel really does take open-source seriously by the amount of engagement in GitHub. Plus, for $350 you can almost by 2 for the cost of a 3070.

labloke11 OP t1_iye5bj2 wrote on November 30, 2022 at 6:53 PM

Thank you for the info.

staros25 t1_iye786t wrote on November 30, 2022 at 7:05 PM

Happy to contribute! Hit me up with any questions.

Week_Cold t1_iydby4r wrote on November 30, 2022 at 3:44 PM

Buy Nvidia， and use the time to earn money. Debugging with AMD is tough.

kaskoosek t1_iyc0goy wrote on November 30, 2022 at 7:08 AM

Im interested in this.

deepneuralnetwork t1_iyckmuv wrote on November 30, 2022 at 11:52 AM

Wouldn’t waste my time on non-NVDIA at this point honestly.

mentatf t1_iyegqxd wrote on November 30, 2022 at 8:07 PM

Yes, fuck monopolies

solimaotheelephant3 t1_iydg7gy wrote on November 30, 2022 at 4:12 PM

Ignoring software issues I wonder about tensor cores… arc might be good in gaming benchmarks but for ML all that matters are tensor cores I believe

iamquah t1_iydjm59 wrote on November 30, 2022 at 4:35 PM

I went to one of their launch events and saw a NN being trained live. Having said that, it was a SNN and I was a little surprised as to why they chose to do that instead of a standard NN.

I felt that it looked appealing, but my biggest problem with it was that you needed a special version of pytorch (and Tensorflow, I think?) which always worries me. It's not easy to pull the two repos together and I'd rather not have an entirely separate repo just for my GPU especially when the two can get diverged

retrorays t1_iyej3ly wrote on November 30, 2022 at 8:22 PM

Yes, need more players in this space. Besides Nvidia driver support is freaking abysmal. They don't open any of their drivers; ultra-closed ecosystem

learn-deeply t1_iyctgh0 wrote on November 30, 2022 at 1:22 PM

Arc GPU only has 16GB, it would be worth giving it a shot if it had 24GB+ like the 3090/4090 does imo.

devingregory_ t1_iydm0pz wrote on November 30, 2022 at 4:50 PM

There was PlaidML which was working with OpenCL. It was getting serious. Then bought by Intel.

I was in search of framework alternatives because I had Ati 270 at that time. Rocm does not support 270, only 270x and later. At the end, I bought an nvidia card.

Nvidia is serious about ML, but I don't think others did not take ML and software support seriously.

[deleted] t1_iydwn7h wrote on November 30, 2022 at 5:58 PM

[deleted]

ivan_kudryavtsev t1_iycjqd9 wrote on November 30, 2022 at 11:41 AM

Just take a look at list of Intel's products with R.I.P. status to get the answer. The only thing Intel guarantees to live is X86_64 CPU.

Comments