BossOfTheGame t1_j0i2v7d wrote on December 16, 2022 at 8:18 PM

Reply to comment by rajatarya in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya

Why doesn't it scale past 1TB currently? I have DVC repos that are indexing on the order of 1-2 TB of image data. The data hardly ever changes, and if it does there isn't a big problem in just storing both copies of the image (in XetHub it would be the same probably because most of the image pixels would be different, depending on the processing level). All that we really care about is that it's content addressable, it has access controls (otherwise we would use IPFS), and you can distribute subsets of the data.

If I tried XetHub on a 2TB dataset would it simply fail?

rajatarya OP t1_j0iemk4 wrote on December 16, 2022 at 9:39 PM

There isn’t a hard limit at 1TB currently. The main thing is the experience / performance may degrade. The size of the merkle tree is roughly 1% of total repo size so at 1TB even downloading that can take some time. You can definitely use XetHub past 1TB repo today - but your mileage may vary (in terms of perf/experience).

To avoid downloading the entire repo you can use Xet Mount today to get a file system readonly view of the repo. Or use the —no-smudge flag on clone to simply get pointer files. Then call git xet checkout for the files you want to hydrate.

I would love to talk more about the 2TB DVC repos you are using today - and believe they would be well served by XetHub. Something I would be eager to explore. DM me your email if interested and I will follow up.

Thanks for the question!