Viewing a single comment thread. View all comments

ThisIsMyStonerAcount t1_irx8urr wrote

So, in case you're not aware, matrix-matrix multiplication is THE workhorse of every BLAS implementation. I'm not too familiar with the Accelerate framework, but the really good implementations (e.g. MKL from Intel, or OpenBLAS) are extremely highly optimized (as in: there are people who are working on this professionally for years as their main job). You're very unlikely to get close to their performance, and shouldn't feel bad if they beat you by a lot.

I'd suggest giving OpenBLAS a whirl if you want to optimize for the absolute top achievable speeds. It's the best free BLAS implementation out there. For learning, googling for "cache optimized gemm" will give you good starting points on techniques for achieving SOTA performance in matrix-matrix multiplication.

2