Submitted by johnhopiler t3_11a8tru in MachineLearning

Let's assume for a minute one has:

  • the necessary compute instances
  • enough $ to cough up to rent those instances somewhere

What are the latest "easy" solutions to get optbloomzand flan-t5hosted as API endpoints?

I spent about 2 weeks trying to get seldon-core and MLServer to work with its huggingface wrapper. But I've lost hope at this point. There are so many parameters and tweaks one has to be mindful of and I feel like I'm behaving like a very crude operating system replacement when I pass a device_mapto a python function to tell it how much ram to use for what instance. In what world can MS 95 manage 4 DIM DDR rams but in 2023, we cannot auto-assign model data to the right GPUs?

So. What's the "right way" to do this? I am aware of

Any pointers would be appreciated. We have a goal to get 2-3 models up and running as API endpoints in 2 weeks and I have a lot of ppl waiting for me to get this done...

​

Edit:

I am talking about self hosted solutions where the inference input & output is "under your control"

​

Edit:

What about a K8S + Ray Cluster + alpa.ai? It feels like the most industrialised version of all the things I've seen so far after reading up on ray (which feels like a spark cluster for ML)

10

Comments

You must log in or register to comment.

Desticheq t1_j9qo0mu wrote

Hugginface actually allows a fairly easy deployment process for models trained with their framework

8

CKtalon t1_j9r2k9j wrote

Probably FasterTransformers with Triton Inference Server

3

Desticheq t1_j9xiv9l wrote

Well, in terms of "out-of-the-box," I'm not sure what else could be better. AWS, Azure or Google provide empty units basically, and you'd have to configure all the "Ops" stuff like network, security, load balancing, etc. That's not that difficult if you do it once in a while, but for a "test-it-and-forget-it" project it might be too difficult.

2