Viewing a single comment thread. View all comments

machineko t1_ja4jubd wrote

Inference acceleration involves model accuracy / latency / cost trade-offs and also how much $ and time you are willing to spend to speed things up. Is your goal to achieve real-time? Can you do it while taking 2-3% accuracy hits? What compute resource is the model going to run on? On the cloud and you have access to any GPUs? For example, certain inference optimization techniques will only run on newer and more expensive GPUs.

For example, for highly scalable and low-latency deployment, you'd probably want to do model compression. And once you have a compressed model, you can optimize inference using TensorRT and/or other compilers/kernel libraries. Happy to share more thoughts, feel free to reply here or DM me with more details.

1

_learn_faster_ OP t1_ja6zovh wrote

We have GPUs (e.g. A100) but can only use 1 GPU per request (not multi-gpu). We are also willing to take a bit of an accuracy hit.

Let me know what you think would be best for us?

When you say compression do you mean things like pruning and distillation?

1