ibmw t1_iu7zr9l wrote
In my previous company, we used the Nvidia: Triton + ONNX + ONNX runtime it works well, but with some engineering, because the models we used were not fully supported by ONNX, and we do some work to be able to change some components (like python/conda env to more generic and fastest solution)
In addition, we have some models that run on CPU (without openVINO -- actually, we didn't have time to test that), and we use a k8s cluster to deploy and do the scaling. It works, but we still need to improve the inference time to align with the use cases... I don't know if they have managed to tackle this part since my departure.
Finally, we have done some benchmarks (triton, kserve, torchserve, sagemaker), and with Triton (with engineering) we managed to get the best result for throughput (our target, but I know that we could have done the same for latency)
big_dog_2k OP t1_iu86mb7 wrote
Thanks! It sounds like investing time in onnx and using triton is the best bet.
Viewing a single comment thread. View all comments