Submitted by skypilotucb t3_yxui76 in MachineLearning
Announcing SkyPilot - an open-source framework to run ML and Data Science jobs on any cloud, seamlessly and cost effectively. I’m a developer on the project, and would love to hear your feedback.
Github: https://github.com/skypilot-org/skypilot
SkyPilot is motivated by the challenges in reducing cloud spend for ML workloads
Using the cloud for ML and Data Science is plenty hard. Trying to cut your costs makes it even harder:
- Want to use spot-instances? That can add weeks of work to handle preemption.
- Want to stop leaving machines up when they’re idle? You’ll need to spin them up and down repeatedly, including environment and data setup and wrap-up.
- Want to queue jobs for an overnight run? You’ll need to implement job and log management.
- Want to leverage price differences between regions and cloud providers? You’ll need to re-architect all the features above for each cloud!
SkyPilot automates the heavy-lifting of running jobs on the cloud
- Reliably provision a cluster, with automatic failover to other locations if capacity or quota errors occur
- Sync user code and files (from local, or cloud buckets) to the cluster
- Manage job queueing and execution
SkyPilot substantially reduces your cloud bills, often by over 3x
- Automatically find the cheapest zone/region/cloud that offers the requested resources (~2x cost savings)
- Managed spot provides ~3–6x cost savings by using spot instances, with automatic recovery from preemptions
- Autostop automatically cleans up idle clusters — the top contributor to avoidable cloud overspending
Here’s an example of using SkyPilot to train BERT using spot instances, transparently handling preemptions across regions and clouds and reducing cost by 3x:
https://i.imgur.com/Ujy251r.gif
More resources:
Fast-for-a-starfish t1_iwtwxpk wrote
Very interesting, do you know how much the training of BERT (in the gif) cost?