Comments

You must log in or register to comment.

nikola-b t1_j9mdw5s wrote

Might not be what you want, but you can use our hosted flan-t5 models at deepinfra.com. This way you can just call them as API. Even flan-t5-xxl. Disclaimer I work at Deep Infra.

1

guillaumekln t1_j9nfl9t wrote

You can also check out the CTranslate2 library which supports efficient inference of T5 models, including 8-bit quantization on CPU and GPU. There is a usage example in the documentation.

Disclaimer: I’m the author of CTranslate2.

5

machineko t1_ja4jubd wrote

Inference acceleration involves model accuracy / latency / cost trade-offs and also how much $ and time you are willing to spend to speed things up. Is your goal to achieve real-time? Can you do it while taking 2-3% accuracy hits? What compute resource is the model going to run on? On the cloud and you have access to any GPUs? For example, certain inference optimization techniques will only run on newer and more expensive GPUs.

For example, for highly scalable and low-latency deployment, you'd probably want to do model compression. And once you have a compressed model, you can optimize inference using TensorRT and/or other compilers/kernel libraries. Happy to share more thoughts, feel free to reply here or DM me with more details.

1

_learn_faster_ OP t1_ja6zovh wrote

We have GPUs (e.g. A100) but can only use 1 GPU per request (not multi-gpu). We are also willing to take a bit of an accuracy hit.

Let me know what you think would be best for us?

When you say compression do you mean things like pruning and distillation?

1