programmerChilli t1_j7tpwd7 wrote

Lots of things. You can see a flamegraph here: (taken from

Dispatcher is about 1us, but there's a lot of other things that need to go on - inferring dtype, error checking, building the op, allocating output tensors, etc.


programmerChilli t1_iufqn15 wrote

Well, you disclosed who you are, but that's pretty much all you did :P

The OP asked a number of questions, and you didn't really answer any of them. You didn't explain what BentoML can offer, you didn't explain how it can speed up inference, you didn't really even explain what BentoML is.

Folks will tolerate "advertising" if it comes in the form of interesting technical content. However, you basically just mentioned your company and provided no technical content, so it's just pure negative value from most people's perspective.


programmerChilli t1_is7vgbp wrote

I mean... it's hard to write efficient matmuls :)

But... recent developments (i.e. CuBLAS and Triton) do allow NN frameworks to write efficient matmuls, so I think you'll start seeing them being used to fuse other operators with them :)

You can already see some of that being done in projects like AITemplate.

I will note one other thing though - fusing operators with matmuls is not as big of a bottleneck in training, this optimization primarily helps in inference.