Comments

You must log in or register to comment.

kkchangisin t1_j5gcgbe wrote

Nice work! Triton already looks good but have you tried optimizing with the Triton Model Analyzer?

https://github.com/triton-inference-server/model_analyzer

In various models I use with Triton I've found the output model formats and configurations for use with Triton can provide drastically increased performance whether that be throughput, latency, etc.

Hopefully I get some time soon to try it out myself!

Again, nice work!

5

op_prabhuomkar OP t1_j5i7oyj wrote

Thank you for the feedback. I am looking forward to using the Triton's model analyzer possibly with different batch sizes and also FP16! Lets see how that goes :)

2

kkchangisin t1_j5if8hc wrote

Depending on how much time I have there just might be a PR coming your way 😀…

Triton is really a somewhat hidden gem - the implementation and toolkit surrounding it is pretty impressive!

2

NovaBom8 t1_j5h30af wrote

Very cool, great work!!

In the context of running .pt (or any other device-agnostic filetypes), I’m guessing dynamic batching is the reason for Triton’s superior throughout?

3

kkchangisin t1_j5ijvdy wrote

Looking at the model configs in the repo there’s definitely dynamic batching going on.

I think what’s really interesting is the fact that even with default parameters for dynamic batching the response times are superior and very consistent.

3

Ok_Two6167 t1_j5jrd8u wrote

Hello u/op_prabhuomkar,

​

That's a super cool test! any chance you can compare it to the HTTP API as well?

1

op_prabhuomkar OP t1_j5k0h1j wrote

It’s actually easier to do for HTTP, will probably take that as a TODO. Thanks for the suggestion!

1