Submitted by laprika0 t3_yj5xkp in MachineLearning

Hi, what's the latest state of affairs with prototyping ML on Apple silicon, especially M1 Macbook Pro (or M2 if you can see the future)?

EDIT I'm interested in what it's like on both GPU and just CPU

I need to be able to run ML code incl training and inference, but it doesn't need to be efficient as it's just for local validation/prototyping (esp. unit testing) before I do the significant training elsewhere.

I want to know what it's like doing ML dev day to day. If it's still finnicky then it's probably not worth it. Unfortunately I still don't know if I'm going to be using TensorFlow, PyTorch or JAX (EDIT but I'll be using one of them).

This question has been asked a few times, but not recently and things can change fast.

EDIT: I'll be working in industry as an ML engineer.



You must log in or register to comment.

BlazeObsidian t1_ium8sru wrote

I haven’t tried out the performance yet but it appears PyTorch supports the apple silicon processors now as a separate device named ‘mps’ similar to cuda for Nvidia gpus. There is also a tensor flow plugin that can be separately installed to take advantage of the apple chips


papinek t1_iums091 wrote

Works very well. I use Stable Diffusion on Mac M1 and using mps its blazing fast


caedin8 t1_iunmrya wrote

By blazing fast he means as fast as a gtx 1060. My 3070 is 5x faster than my M1 Pro


papinek t1_iunqbum wrote

Well on my M1 it takes using mps 20 seconds to generate image using SD and 30 steps. Using CPU it would be lots of minutes per one image. So I would say it works on M1 well.


caedin8 t1_iunqou5 wrote

I don't even think you can use CPU to make images using stable diffusion, but maybe you can.

Yeah my M1 Pro takes about 25-30 seconds per image, some of that has to do with configuration. But my RTX 3070 cranks them about in about 4 to 5 seconds per image.


papinek t1_iunrbv0 wrote

Yes you can switch to CPU. Which then takes like 5-10 minutes per picture. So the gain using mps is big.


Hobit104 t1_iunyudt wrote

That wasn't their point though.


BlazeObsidian t1_iumubsj wrote

Did you run into memory issues ? I assumed it wouldn’t work with only 8 gigs unified memory.


papinek t1_iunm0b5 wrote

I have 32GB but never run into issue. Next to SD I run Photoshop, Intellij Idea and Chrome with 20 tabs and it was always enough.


BlazeObsidian t1_iunmgfn wrote

Hmm. Might give it a try. Usually I use colab. If there isn’t much of a difference during inference, local is better


TheEdes t1_iuqcu3t wrote

Pytorch supports it but there's still some bugs here and there, you might also find that a function or its gradient isn't implemented yet on some architectures.


vade t1_iumm84n wrote

So there’s a few ways to think about this and some things to know

A) Apple Neural Engine is designed for inference workloads and not back prop or training as far as I’m aware.

B) This means only GPU or CPU for training for DL

C) You can get partial GPU accceleration using pytorch and tensorflow but neither are fully optimized or really competitive.

D) you can accept the training wheels (pun intended) and train simple models using createML GUI which has about as good as you’ll get M series support for GPU but is woefully out of date for many classes of problems and doesn’t support arbitrary layers, losses,optimizers etc. it’s a black box.

E) You can use createML api to get a tad more control but not much more.

If you’re interested in coreML for inference I will say from experience model conversion is non trivial if you want performance as some layers don’t always convert appropriately and shapes can’t always be deduced depending on the models source code.

Also CoreML inference in python doesn’t properly support batching. I’m not joking.

All in all if you get simple shit working it’s fast, but if you want anything remotely nuanced or not out of the box you’re fucked unless you want to write custom metal re-implementations of things like NMS do you can get access to outputs Apples layers don’t supply.

Source: banging my head against fucking wall


ThatInternetGuy t1_iummggq wrote

It's not that worth it.

People with RTX cards will often rent cloud GPU instances for training when inference/training requires more than 24GB VRAM which it often does. Also, sometimes we just need to shorten the training time with 8x A100 so... yeah renting seems to be the only way to go.


moist_buckets t1_iumgkeu wrote

I’ve found tensor flow to work on the GPU if you install everything correctly. PyTorch didn’t work on the GPU for my project because some functions aren’t supported yet. It does work fine with the cpu though.


urstrulyabhiram t1_iunqcqo wrote

what are steps of installation 😅 i use mac m1 unable to use tensorflow


suflaj t1_ium3372 wrote

ML is easy since it's mostly on the CPU. DL still remains shit, unless your definition of prototyping is verifying that the shapes match and that the network can do backprop and save weights at the end of an epoch.

Things are not going to change fast unless Macs start coming with Nvidia CUDA capable GPUs.


laprika0 OP t1_ium613q wrote

Thanks. You differentiate ML from DL. Can you say what you mean by that in this context? Is working with DL a different experience than working with e.g. probabilistic modelling? Or do you mean e.g. tensorflow, pytorch, jax vs pandas, numpy, scikit-learn?


TheDeviousPanda t1_ium7iy4 wrote

Scikit learn numpy pandas xgboost etc, totally fine to do on CPU which is great on MacBooks. Pytorch tensorflow jax? Forget it. If anyone in lab asks for help debugging on their local machine bc cluster is down, I just ignore it. Impossible to do prototyping on Mac.


suflaj t1_ium7xcv wrote

ML is a superset of DL. It's very different working on those two, almost like most ML rules and theory straight up do not apply to modern DL.


suedepaid t1_iumqiu8 wrote

Depends on exactly what “prototyping” means to you.

For just pushing layers around and stuff it’s fine because you can just use CPU and verify that your model compiles and batches flow, etc etc.

For like “train for 5 epochs and tweak hyperparams” it’s tough. You can wait out CPU-only training. Or sometimes you can use the GPU in pytorch and that’s great when it works. But like, the pytorch LSTM layer is literally implemented wrong on MPS (that’s what the M1 GPU is called, equivalent to “CUDA”). So you’ll get shape errors lol.

Basically it’s fairly unstable to train on anything but CPU.


Thalesian t1_iun1w8t wrote

TBH the value of the M1 Macs is the RAM. Because CPU, GPU, and RAM are all on the same chip, you don’t have VRAM - just RAM. Not such a big deal on an M1 with 16 Gb, but perhaps something to consider when thinking about the kinds of models you can build on the M1 Max with 64 Gb or M1 Ultra with 128 Gb. The unified memory framework Apple uses is weird, and I haven’t seen aggressive testing of its limits. For example - does the M1 Ultra have 128 GB for GPUs, or the 64 Gb of each M1 Max component? Either way, if it is slower than a 3090, just use absurd batch sizes paired with a slightly faster learning rate.

That said, the big problem I’ve found is with OpenMP. It doesn’t adjust well to the dynamic CPU use Apple wants, and this leads to frequent crashes and/or low thread counts on anything more complex than an M1/M2 . With the RAM - CPU link XGBoost is an absolute beast until it crashes within a few minutes. There may be some leeway in setting threads lower, but hopefully development will catch up with the new chipsets, particularly since Intel is also pursuing the efficiency/performance core division.


new_name_who_dis_ t1_iumsvkq wrote

It depends. I use pytorch and mostly it works. But sometimes I see a cool library and download it and try to run it and I get a C level error where it says "unsupported hardware" -- in which case I need to run that code on linux.

I think it should be fine since your laptop should just be a client for doing deep learning, and not the server. So whenever you have problems you can just test on a linux machine.

I've personally never written code myself that throws the unsupported hardware error. So it must be some specialty accelerated code that only works with intel or whatever. But yeah this hasn't been an issue for writing code, only for trying to use other people's code (and even then it's pretty rare, usually only happens when you clone from like nvidia or something).


utopiah t1_iun6a16 wrote

This is not my field but I find this question genuinely surprising.

Why would one even consider this unless prototyping from the actual jungle?

In any other situation where you even just have a 3G connection then delegating to the cloud (or your own on premise machines available online behind a VPN) seems much more efficient as soon as you have any inference and even more so training to run.

Why do I feel the question itself to be surprising? Well because ML is a data based field so the question can be answered with a spreadsheet. Namely your "model" would be optimizing for faster feedback in order to learn better about your problem with as costs your hardware but also your time. If you do spend X hours tinkering with an M1 (or M2 or even "just" a 4090) versus A100 in that or a random cloud e.g AWS or a local OVH by booting on a generalist distribution like Ubuntu versus dedicated setups like or or even higher level like HuggingFace on their own infrastructure then IMHO that does give you some insight.

Everything else seems anecdotal because others might not have your workflow.

TL;DR: no unless they are minuscule models and of course if you use it to ssh on remote machines but IMHO you have to figure it out yourselves as we all have different needs.

PS: to clarify and not to sound like an opinionated idiot, even though it's not my field I did run and trains dozens of models locally and remotely.


muxamilian t1_iuniww3 wrote

There's Apple's "Tensorflow Metal Plugin", which allows for running Tensorflow on Apple Silicon's graphics chip. However, it's basically unusably buggy; I'd recommend you to stay away from it:

For example, tf.sort only sorts up to 16 values and overwrites the rest with -0. Apparently not fixed for over one year:

Also, tf.random always returns the same random numbers:


lqstuart t1_iuouqng wrote

I'm a little confused by some of the answers here, not sure if it's because this sub skews towards academia/early career or maybe I'm just out of touch.

Pretty much anywhere in the industry your $3000 M1 Mac is going to be used for either opening up a tab in Chrome or, at most, the rigorous task of SSHing into a Linux VM. There are basically two reasons:

  • Large, public companies typically don't allow their user/production data to exist on any machine that has internet access--you can get in very deep shit or in some cases even fired for having it stored locally--so that's game over for a MBP automatically.
  • Most companies of any size will have some need for distributed training. That means you need a platform to schedule training jobs, or else your GPUs will sit idle and you'll go bankrupt. That means maintaining compatible versions of TF/PT, CUDA, and GCC, which eventually means building their own distribution of TF/PT. They're not going to bother building a separate distro for two different flavors of MBP floating around in addition to production. Often, your model code doesn't work in the absence of the platform SDKs, because, for example, they need to be able to authenticate with S3 or HDFS or wherever the data is stored.

I'm not sure that any company besides Apple will ever invest in Apple's M1 shit specifically, but nobody uses TPUs and that doesn't stop Google from pushing it every chance they get. However, a lot of the industry is getting more and more pissed at NVIDIA, which may in turn open things up to local development in the future.


laprika0 OP t1_iup4bqt wrote

These are interesting points. I think it depends on where in the stack I'll be. At my last place I spent most of my time building and testing abstract ML functionality that I never deployed to production myself (other teams did that) and could be tested on a CPU in a reasonable amount of time. I can imagine the "other team" worked with the restrictions you mention. In my next role, I may well wear both hats.


lqstuart t1_iuq94uz wrote

The roles where you do a little of both are the most fun! I used to do the algo work, now I work entirely on the infra side of things at one of the larger corps. We support some massive teams that have their own platform team between the AI devs and us, and also some smaller teams where the AI devs do it all themselves and just talk to us directly.

In all cases, where I am now and in my previous infra-only role, the AI teams were kinda stuck on our Linux shit for the reasons I described--specifically, you need to write stuff differently (or use an SDK that's tightly coupled to the underlying compute) for distributed training so there's no real point running it locally.

I personally REALLY miss the ability to develop and test locally with a real IDE, so I hope something changes--however, the trend is heading towards better remote development, not making stuff work on Mac.


Tricky_Nail_6659 t1_iupi571 wrote

Tensorflow works quite well on Apple silicon since you can use the GPU


NeffAddict t1_iumqkal wrote

Keras works well with my M1 Max Mac Studio. Hasn’t struggled with anything I throw at it


5death2moderation t1_iumr1v1 wrote

As someone who actually owns an M1 and has a job running large models in the cloud - it's not nearly as bad as I was expecting. mps support in pytorch is growing every day, most recently I have been able to finetune various sentence transformers and GPT-J at reasonable speeds (before pushing to gpus in the cloud). If I was choosing the laptop I would go with linux + gpu obviously, but our mostly clueless executive chose the M1. The upside with the M1 is that I can use the 64gb of system memory for loading models whereas the most gpu memory I could get in a nvidia laptop is 16-24.


C0hentheBarbarian t1_iutruwo wrote

Hey, I was facing issues with sentence transformers and M1 (some missing layers not implemented for MPS). Could you tell me how you are getting around that?


suricatasuricata t1_iumxisa wrote

I tried fine tuning gpt2 on 1000 examples on my M1, I think it was supposed to be 10 hours versus 30 minutes on a V100 on Gcloud. Inference was comparable. I think there is a way out by installing the right drivers, but honestly, what would be the point in that for someone who works in industry? I am not planning to run production code trained on a MBP, so might as well develop cloud first.


MateTheNate t1_iungke3 wrote

I have a Mac Studio with the M1 max and it is like a 2x boost over my Intel i5 MBP on CPU. Couldn't get metal to work with TF, think it's half precision only and I didn't want to convert my model.

I think getting a M1 laptop and using Colab is probably a good choice. I could justify the paid subscription version if I needed the GPU compute.


1infiniteloop t1_iunkkee wrote

Tensorflow models seem to work well on GPU (M1 Pro 32 GB memory). With Pytorch I have had difficulty getting the model to train on the GPU. CPU training for Pytorch models has been very slow.


darklinux1977 t1_iupcn1b wrote

The M series is a generation or two slower compared to the Nvidia GPU


wojak386 t1_iuqh44r wrote

From my experience:

Inference is good enough, I am using mac mini m1 for it.
Training is hell.
Also a lot of things are still not working, in for example in PyTorch, so be prepared to fight hard with it.
It is cheaper and better to go along with some serious well supported hardware.

My current setup - I have "expensive" local server for ML and stuff, I am connecting to it and work remotely, while my laptops are light and cheap.


C0hentheBarbarian t1_iuqlo3z wrote

I’ve been using an M1 for prototyping and found a couple of issues with some PyTorch models. It’s a buggy mess at times and even their fallback doesn’t work at times. Here are a few things not implemented yet - these do show up decently often as you see in that GitHub issue


extracheez t1_iuqw9zn wrote

I'm prototyping on an M1 mac for work and it's not great. One of the main issues is if you work in a team with multiple chip architectures, it adds a whole layer of complexity to dependency management just to collaborate.