rajatarya OP t1_j0h532a wrote on December 16, 2022 at 4:35 PM

Reply to comment by tlklk in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya

Yes, you can keep data entirely remotely. We built Xet Mount specifically for this - just mount the repo to get a virtual filesystem view over the repo. We stream the files in the background and on-demand. Or you can clone the repo with --no-smudge and just have pointer files. Then you can choose which files to hydrate (smudge) yourself.

Comparing to DVC, we have a handy feature comparison available here: https://xetdata.com/why-xethub. The short answer is DVC requires registration of which files for it to track and does file-based deduplication by simply storing the files in a remote location. This means if 1MB of a 500MB file changed daily, with DVC/Git LFS every day all 500MB would have to be uploaded/downloaded. With XetHub only around ~1MB would have to be uploaded/downloaded daily.

Are you using DVC currently? Would love to hear more about your experience using it and have you try XetHub instead.

BossOfTheGame t1_j0i2v7d wrote on December 16, 2022 at 8:18 PM

Why doesn't it scale past 1TB currently? I have DVC repos that are indexing on the order of 1-2 TB of image data. The data hardly ever changes, and if it does there isn't a big problem in just storing both copies of the image (in XetHub it would be the same probably because most of the image pixels would be different, depending on the processing level). All that we really care about is that it's content addressable, it has access controls (otherwise we would use IPFS), and you can distribute subsets of the data.

If I tried XetHub on a 2TB dataset would it simply fail?

rajatarya OP t1_j0iemk4 wrote on December 16, 2022 at 9:39 PM

There isn’t a hard limit at 1TB currently. The main thing is the experience / performance may degrade. The size of the merkle tree is roughly 1% of total repo size so at 1TB even downloading that can take some time. You can definitely use XetHub past 1TB repo today - but your mileage may vary (in terms of perf/experience).

To avoid downloading the entire repo you can use Xet Mount today to get a file system readonly view of the repo. Or use the —no-smudge flag on clone to simply get pointer files. Then call git xet checkout for the files you want to hydrate.

I would love to talk more about the 2TB DVC repos you are using today - and believe they would be well served by XetHub. Something I would be eager to explore. DM me your email if interested and I will follow up.

Thanks for the question!