Viewing a single comment thread. View all comments

jakethesnake_ t1_j0h769x wrote

To be honest, I very much doubt we'd ever let a 3rd party manage our data. We have non-sensitive data on s3, and some more sensitive data on prem. My ideal would be a VCS which either leaves the data in place, or to a dedicated on prem deployment. For commercial sensitivity and data governance reasons, transfering data to a 3rd party is a non-starter.

I doubt a 3rd party storing a Merkle tree of the data would be acceptable to our partners either. We work with sensitive information.

That being said, XetHub looks useful for me and my team. I particularly like the mounting feature. Our distributed computing system uses docker images to run jobs, and I currently download the data as needed inside the image...which works but is not efficient. I'd much prefer to mount a data repo. I think this would solve some pain points in our experimentations.

I'm off work for the next two weeks, but I'll probably experiment with XetHub in the new year - cool stuff!

3

rajatarya OP t1_j0h8m7z wrote

Great, can't wait to hear your feedback once you've gotten back to work in the new year!

We definitely can do a dedicated (single-tenant) deployment of XetHub. That way your data stays in your environment for its entirety. It also means you can scale up or down the caching/storage nodes to meet the throughput needs for your workloads.

Yes, we built mount with the data center use case in mind. We have seen how distributed GPU clusters are at 3-5% utilization as they are sitting around idle while downloading data. With mount those GPUs get busy right away, we have seen 25% reductions in 1st epoch training time.

Small clarification - we store the Merkle Tree in the Git repo, in a Git notes database - so that lives with the repo. The only thing we store outside the repo are the ~16MB data blocks that represent the files in the repo that are managed by XetHub.

I would also love to hear about the data governance requirements for your company. Those can help us plan what features we need to add to our roadmap. Can you DM me your work email so I can follow up in January?

3

jakethesnake_ t1_j0hch5f wrote

Sounds great, I'll scout out XetHub in more detail when I'm back and DM you. Thanks for the helpful answers :)

re: data governance, we have signed very strict agreements with our clients. They specify where the data resides, who has access to it and a bunch of other stuff. I'm not invovled in those types of talks with clients, but the negotiations took months. A lot of care has been taken to meet these requirements, and adding another site and unvetted company into the mix is likely going to be tricky. This seems pretty standard for enterprise clients in my experience.

3