rajatarya OP t1_j0h40s0 wrote on December 16, 2022 at 4:29 PM

Reply to comment by jakethesnake_ in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya

Tell me more about this. Are you are looking to push your data to S3 and then have XetHub ingest it automatically from S3? Or that you would like to keep your data in S3 and then have XetHub work with your data stored in S3 in-place?

We are planning on building the first one (automatic ingestion from S3) - it is on our roadmap for 2023.

Since XetHub builds a Merkle tree over the entire repo we don't actually store the files themselves - instead we store data blocks that are ~16MB chunks of the files. This allows us to efficiently transfer data while still providing fine-grained diffs. That means the files you store in S3 aren't represented in the same way in XetHub - so we cannot manage S3 files in-place. Instead we need to chunk them and build the Merkle tree so we can deduplicate the repo and store it efficiently.

Why would you want to be responsible for your own S3 buckets and files and then have XetHub manage things from there?

jakethesnake_ t1_j0h769x wrote on December 16, 2022 at 4:49 PM

To be honest, I very much doubt we'd ever let a 3rd party manage our data. We have non-sensitive data on s3, and some more sensitive data on prem. My ideal would be a VCS which either leaves the data in place, or to a dedicated on prem deployment. For commercial sensitivity and data governance reasons, transfering data to a 3rd party is a non-starter.

I doubt a 3rd party storing a Merkle tree of the data would be acceptable to our partners either. We work with sensitive information.

That being said, XetHub looks useful for me and my team. I particularly like the mounting feature. Our distributed computing system uses docker images to run jobs, and I currently download the data as needed inside the image...which works but is not efficient. I'd much prefer to mount a data repo. I think this would solve some pain points in our experimentations.

I'm off work for the next two weeks, but I'll probably experiment with XetHub in the new year - cool stuff!

rajatarya OP t1_j0h8m7z wrote on December 16, 2022 at 4:58 PM

Great, can't wait to hear your feedback once you've gotten back to work in the new year!

We definitely can do a dedicated (single-tenant) deployment of XetHub. That way your data stays in your environment for its entirety. It also means you can scale up or down the caching/storage nodes to meet the throughput needs for your workloads.

Yes, we built mount with the data center use case in mind. We have seen how distributed GPU clusters are at 3-5% utilization as they are sitting around idle while downloading data. With mount those GPUs get busy right away, we have seen 25% reductions in 1st epoch training time.

Small clarification - we store the Merkle Tree in the Git repo, in a Git notes database - so that lives with the repo. The only thing we store outside the repo are the ~16MB data blocks that represent the files in the repo that are managed by XetHub.

I would also love to hear about the data governance requirements for your company. Those can help us plan what features we need to add to our roadmap. Can you DM me your work email so I can follow up in January?

jakethesnake_ t1_j0hch5f wrote on December 16, 2022 at 5:23 PM

Sounds great, I'll scout out XetHub in more detail when I'm back and DM you. Thanks for the helpful answers :)

re: data governance, we have signed very strict agreements with our clients. They specify where the data resides, who has access to it and a bunch of other stuff. I'm not invovled in those types of talks with clients, but the negotiations took months. A lot of care has been taken to meet these requirements, and adding another site and unvetted company into the mix is likely going to be tricky. This seems pretty standard for enterprise clients in my experience.