bmsan-gh t1_iwu6v4s wrote on November 18, 2022 at 11:26 AM

Very interesting project! Thanks for sharing.

Datasets today can get really big, I could imagine cases where 20GB-100GB archives of data would be needed to be downloaded for training. So you might get download waiting times of from tens of minutes up to a few hours.

Do you factor in or have thought to factoring in your cost metrics the overhead created by data transfers? (My reasoning might not be correct but I am assuming that you need to also pay for the time you spend downloading your data to a new provider)

skypilotucb OP t1_iwvr4s0 wrote on November 18, 2022 at 6:43 PM

That's a great question! SkyPilot uses an optimizer to make cost-aware decisions on where to run tasks and when to move data. It accounts for both, data egress costs and the time taken to transfer data.

To avoid long download times, SkyPilot also allows direct access to cloud object stores (S3/GCS) by mounting them as a file system on your VM.

With this mounting feature, you can directly read/write to an object store as you would access a regular files on your machine, without having to download to disk first. Thus the cost of downloading files gets amortized over the execution of your job, and our users have reported it's usually not a bottleneck since it can parallelized with other steps to effectively hide the time cost of downloading data (e.g., you can prefetch the data for next minibatch directly from S3 while the current batch runs on the GPU).