Submitted by fgp121 t3_yhjpo2 in MachineLearning

I'm building an application that runs an AI model inference on GPU servers. Based on the demand profile for no. of requests and usage of GPUs I want to autoscale the GPU servers up/down.

I don't want to use Kubernetes for orchestration/autoscaling as it is an overkill for my application which is pretty experimental right now.

Also, I don't need all the MLOps lifecycle management as I am using an open source model which doesn't need that fast updates.

All I am currently looking for is suggestions on how should I go about implementing a simple approach for scaling GPU servers based on incoming demand (such as no. of requests/min or GPU utilization).

3

Comments

You must log in or register to comment.

alibrarydweller t1_iuedvyb wrote

You might look at Nomad -- it manages containers like K8s, but it's significantly simpler. We run GPU jobs on it, although we don't currently autoscale.

3

EnvironmentOptimal98 t1_iuepik7 wrote

Pm me with your project details, and ill give you a ton of tips if you're not making a competing project

2

Crazy-Space5384 t1_iuf6325 wrote

virtualized or bare metal ? Running on a cloud provider or on your own premises?

1

m98789 t1_iug9ma9 wrote

I think the simplest approach is just to set up GPU-enabled VMs with your cloud providers auto-scale option (like scale sets), which can respond to http traffic “triggers” to create more or less of the same VMs in a pool.

When a VM comes online, it has an auto-start action to pull and run your container, joining the load balanced pool of workers.

As a starting point to learn more of this approach (Azure link, but they are all similar):

https://azure.microsoft.com/en-us/products/virtual-machine-scale-sets/#overview

I suggest VM as the simplest approach rather than your cloud provider’s serverless container instance infra because usually they lack or are limited in GPU support, or it is more experimental or complex. A VM approach is about as simple as it gets.

1