Ulfgardleo t1_ir7508n wrote
Reply to comment by ReginaldIII in [R] Discovering Faster Matrix Multiplication Algorithms With Reinforcement Learning by EducationalCicada
no, because these algorithms are terribly inefficient to implement as SIMD. They have nasty data access patterns and need many more FLOPS when also taking additions into account (just the last steps of adding the elements to the result matrix are more than twice the additions of a standard matmul in the case of the results shown here)
neanderthal_math t1_ir7l0k3 wrote
In practice, do libraries like CUDA and MKL do Matrix multiplication the standard way or do they have fancy decompositions?
I remember when I was young, the atlas library would look at your hardware and do a bunch of matmuls and figure out what the “optimal” configuration would be for your system.
Ulfgardleo t1_ir7lytl wrote
All Standard unless very large. Atlas is just picking different kernels that "only" change order of operations to maximize CPU utilization.
Red-Portal t1_ir7xeyo wrote
The funny thing is that the lesson of ATLAS and OpenBLAS was that, matrix multiplication optimized to the assembly level by humans is still the best way to squeeze out performance.
harharveryfunny t1_ira5afy wrote
cuDNN supports Winograd on CUDA cores (not sure about Tensor cores) for convolution, but only for certain filter sizes such as 3x3.
Thorusss t1_ir9pbcd wrote
So why is matrix multiplication faster with it?:
>Leveraging this diversity, we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.
Are you saying it would be slower, if it had to multiply multiple matrixes of the same dimension one after the other?
Ulfgardleo t1_ir9xy3t wrote
You seem to be confused.
-
Experiment 1 uses small 5x5 matrices. Not block-matrices. There they only count the number of mults. These are not faster than SIMD implementations of 5x5 matrix mults, otherwise they would have shown it off proudly.
-
Experiment 2 was about 4x4 block-matrices. But here the 10-20% faster than the COMMONLY used algorithms is actually an overstatement of the results. For GPUs, their implementation is only 5% faster than their default jax implementation of Strassen. The difference to TPU could just mean that their Jax compiler sucks for TPUs. (//Edit: by now i low-key assume that the 10-20% refers to standard cBLAS because i do not get 20% compared to strassen for any result in Figure 5 (and how could they, because they never even get more than 20% improvement over cBLAS.))
-
They do not cite any of the papers that are concerned with efficient implementation of strassen. Especially the efficient memory scheme, from 1994. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.6887 it is unclear whether a GPU implementation of that would be faster, since they are not even discussing the GPU implementation of their strassen variant. They do not claim that their algorithm is faster in complexity, so we are completely reliant on that their implementation of strassen makes sense.
mgostIH t1_ir7euf7 wrote
You can apply it on the top call of your matrix mul and do everything inside the standard way, you still gain the efficiency since these algorithms also work in block matrix form.
Ulfgardleo t1_ir7m5md wrote
Is it? I could not see from the paper whether they assume non-commutative multiplication in their small matrix optimization.
//Edit: they do a 4x4 block matrix, but the gains are less than 5% over the existing Strassen algorithm.
Viewing a single comment thread. View all comments