Paper: https://www.cidrdb.org/cidr2023/papers/p43-low.pdf

Abstract:

Dataset management is one of the greatest challenges to the application of machine learning (ML) in the industry. Although scaling and performance have often been highlighted as the significant ML challenges, development teams are bogged down by the contradictory requirements of supporting fast and flexible data iteration while maintaining stability, provenance, and reproducibility. For example, blobstores are used to store datasets for maximum flexibility, but their unmanaged access patterns limit reproducibility. Many ML pipeline solutions to ensure reproducibility have been devised, but all introduce a degree of friction and reduce flexibility.

In this paper, we propose that the solution to the dataset management challenges is simple and apparent: Git. As a source control system, as well as an ecosystem of collaboration and developer tooling, Git has enabled the field of DevOps to provide both speed of iteration and reproducibility to source code. Git is not only already familiar to developers, but is also integrated into existing pipelines, which facilitates adoption. However, as we (and others) demonstrate, Git, as designed today, does not scale to the needs of ML dataset management. In this paper, we propose XetHub; a system that retains the Git user experience and ecosystem, but can scale to support large datasets. In particular, we demonstrate that XetHub can support Git repositories at the TB scale and beyond. By extending Git to support large-scale data, and building upon a DevOps ecosystem that already exists for source code, we create a new user experience that is both familiar to existing practitioners and truly addresses their needs.

https://preview.redd.it/19x4sim19nba1.png?width=1746&format=png&auto=webp&v=enabled&s=2be4a2e8059f9ac8e00bdea15d5cd97b0574b7ce

https://preview.redd.it/xsqqjjm19nba1.png?width=2422&format=png&auto=webp&v=enabled&s=270c68d290b17bee7dbf0f38588a5d29fa040fe2

Comments

PassionatePossum t1_j429c7y wrote on January 12, 2023 at 6:00 PM

Admittedly, I just skimmed the paper. But I found it weird that DVC wasn’t mentioned at all. Maybe I missed something but it seems to address the same or at least a similar use case.

I think it would deserve at least a mention in the related work section and a discussion what is different or better in XetHub. Maybe even a performance comparison between the two would be interesting.

theDaninDanger t1_j42vf8i wrote on January 12, 2023 at 8:15 PM

They mention it a few times, but kind of hand-wave it away:

> Solutions such as Git LFS [9] and DVC [10] provide a

light-weight facade for adding large files to Git repositories but do

not provide sufficient integration to support the needs of industry

ML datasets as described in Sec. 2.

I'm not sure what they mean by 'sufficient integration', but whatever the insufficiencies, why not address those? Considering all the authors work at XetHub, I'm pretty sure this is an advertisement disguised as a research paper.

seba07 t1_j46wuue wrote on January 13, 2023 at 4:17 PM

Isn't that the standard way research papers are written: you only compare your solution to methods that are worse that yours? ;)

MUSEy69 t1_j43g8x4 wrote on January 12, 2023 at 10:19 PM

not in the paper, but I found a table on their site: https://xetdata.com/why-xethub/

BossOfTheGame t1_j43tep8 wrote on January 12, 2023 at 11:45 PM

Biggest issue with XetHub is current lack of open source. They have a way better storage design than DVC, but open source is absolutely necessary. If I can't give someone else zero-cost instructions to reproduce my research then my research becomes far less useful.

Perhaps industry will pay for this, but non-free software is a deal breaker for someone who wants to advance science.

uhules t1_j47fkag wrote on January 13, 2023 at 6:11 PM

I'm guessing this is unintentional, but you talk like XetHub has been a thing for a while. I even went to see what had I missed, and for what I gathered it's a startup that just emerged from it's stealth status like, ~~five days ago~~ (from its twitter status it's more like three weeks, but still). They'll probably opensource the core tech as a freemium like almost everything else in the current convoluted MLOps landscape.

rajatarya OP t1_j43uhw1 wrote on January 12, 2023 at 11:52 PM

Does the Community edition of XetHub help address this? See here: https://xetdata.com/pricing/. Everyone today gets 20GB of storage for free.

BossOfTheGame t1_j446cth wrote on January 13, 2023 at 1:14 AM

You can't self host?

tamale t1_j4457xf wrote on January 13, 2023 at 1:06 AM

Did they mention Pachyderm?

ConverseHydra t1_j45p0h5 wrote on January 13, 2023 at 9:55 AM

This sort of reads like an advertisement….

[deleted] t1_j46pt21 wrote on January 13, 2023 at 3:32 PM

[deleted]