Submitted by skypilotucb t3_yxui76 in MachineLearning

Announcing SkyPilot - an open-source framework to run ML and Data Science jobs on any cloud, seamlessly and cost effectively. I’m a developer on the project, and would love to hear your feedback.

Github: https://github.com/skypilot-org/skypilot

SkyPilot is motivated by the challenges in reducing cloud spend for ML workloads

Using the cloud for ML and Data Science is plenty hard. Trying to cut your costs makes it even harder:

  • Want to use spot-instances? That can add weeks of work to handle preemption.
  • Want to stop leaving machines up when they’re idle? You’ll need to spin them up and down repeatedly, including environment and data setup and wrap-up.
  • Want to queue jobs for an overnight run? You’ll need to implement job and log management.
  • Want to leverage price differences between regions and cloud providers? You’ll need to re-architect all the features above for each cloud!

SkyPilot automates the heavy-lifting of running jobs on the cloud

  • Reliably provision a cluster, with automatic failover to other locations if capacity or quota errors occur
  • Sync user code and files (from local, or cloud buckets) to the cluster
  • Manage job queueing and execution

SkyPilot substantially reduces your cloud bills, often by over 3x

  • Automatically find the cheapest zone/region/cloud that offers the requested resources (~2x cost savings)
  • Managed spot provides ~3–6x cost savings by using spot instances, with automatic recovery from preemptions
  • Autostop automatically cleans up idle clusters — the top contributor to avoidable cloud overspending

Here’s an example of using SkyPilot to train BERT using spot instances, transparently handling preemptions across regions and clouds and reducing cost by 3x:

https://i.imgur.com/Ujy251r.gif

More resources:

21

Comments

You must log in or register to comment.

Fast-for-a-starfish t1_iwtwxpk wrote

Very interesting, do you know how much the training of BERT (in the gif) cost?

5

skypilotucb OP t1_iwvqqgj wrote

Thanks for your question! Training BERT with SkyPilot's managed spot feature cost $18.4 and took 21 hours. Running the same job with on-demand AWS instances cost $61.2 (>3x more) and took 20 hours.

Note that both jobs were run on the same GPU (V100) and the cost and time taken by SkyPilot includes the data transfer costs for moving checkpoints and all overheads associated with restarting jobs.

3

bmsan-gh t1_iwu6v4s wrote

Very interesting project! Thanks for sharing.

Datasets today can get really big, I could imagine cases where 20GB-100GB archives of data would be needed to be downloaded for training. So you might get download waiting times of from tens of minutes up to a few hours.

Do you factor in or have thought to factoring in your cost metrics the overhead created by data transfers? (My reasoning might not be correct but I am assuming that you need to also pay for the time you spend downloading your data to a new provider)

4

skypilotucb OP t1_iwvr4s0 wrote

That's a great question! SkyPilot uses an optimizer to make cost-aware decisions on where to run tasks and when to move data. It accounts for both, data egress costs and the time taken to transfer data.

To avoid long download times, SkyPilot also allows direct access to cloud object stores (S3/GCS) by mounting them as a file system on your VM.

With this mounting feature, you can directly read/write to an object store as you would access a regular files on your machine, without having to download to disk first. Thus the cost of downloading files gets amortized over the execution of your job, and our users have reported it's usually not a bottleneck since it can parallelized with other steps to effectively hide the time cost of downloading data (e.g., you can prefetch the data for next minibatch directly from S3 while the current batch runs on the GPU).

2

Acceptable-Cress-374 t1_iwyekoq wrote

Do you have any plans of supporting other, smaller vendors? I keep seeing very cheap spot instances on runpod (they seem to be recommended a lot on r/stablediffusion), but they seem to have a setup where they run docker instead of actual VMs. Their spot prices seem pretty low compared to the "big3" ($0.197/hr for an A5000)

1

skypilotucb OP t1_iwzzu87 wrote

Absolutely! We're planning on adding support for smaller and cheaper cloud vendors (runpod included). If this something you'd like to see prioritized, I would encourage you to open a github issue!

2