skypilotucb
skypilotucb OP t1_iwvr4s0 wrote
Reply to comment by bmsan-gh in [P] SkyPilot: ML on any cloud with massive cost savings by skypilotucb
That's a great question! SkyPilot uses an optimizer to make cost-aware decisions on where to run tasks and when to move data. It accounts for both, data egress costs and the time taken to transfer data.
To avoid long download times, SkyPilot also allows direct access to cloud object stores (S3/GCS) by mounting them as a file system on your VM.
With this mounting feature, you can directly read/write to an object store as you would access a regular files on your machine, without having to download to disk first. Thus the cost of downloading files gets amortized over the execution of your job, and our users have reported it's usually not a bottleneck since it can parallelized with other steps to effectively hide the time cost of downloading data (e.g., you can prefetch the data for next minibatch directly from S3 while the current batch runs on the GPU).
skypilotucb OP t1_iwvqqgj wrote
Reply to comment by Fast-for-a-starfish in [P] SkyPilot: ML on any cloud with massive cost savings by skypilotucb
Thanks for your question! Training BERT with SkyPilot's managed spot feature cost $18.4 and took 21 hours. Running the same job with on-demand AWS instances cost $61.2 (>3x more) and took 20 hours.
Note that both jobs were run on the same GPU (V100) and the cost and time taken by SkyPilot includes the data transfer costs for moving checkpoints and all overheads associated with restarting jobs.
skypilotucb OP t1_iwzzu87 wrote
Reply to comment by Acceptable-Cress-374 in [P] SkyPilot: ML on any cloud with massive cost savings by skypilotucb
Absolutely! We're planning on adding support for smaller and cheaper cloud vendors (runpod included). If this something you'd like to see prioritized, I would encourage you to open a github issue!