Submitted by rajatarya t3_10a4mns in MachineLearning

Paper: https://www.cidrdb.org/cidr2023/papers/p43-low.pdf

Abstract:

Dataset management is one of the greatest challenges to the application of machine learning (ML) in the industry. Although scaling and performance have often been highlighted as the significant ML challenges, development teams are bogged down by the contradictory requirements of supporting fast and flexible data iteration while maintaining stability, provenance, and reproducibility. For example, blobstores are used to store datasets for maximum flexibility, but their unmanaged access patterns limit reproducibility. Many ML pipeline solutions to ensure reproducibility have been devised, but all introduce a degree of friction and reduce flexibility.

In this paper, we propose that the solution to the dataset management challenges is simple and apparent: Git. As a source control system, as well as an ecosystem of collaboration and developer tooling, Git has enabled the field of DevOps to provide both speed of iteration and reproducibility to source code. Git is not only already familiar to developers, but is also integrated into existing pipelines, which facilitates adoption. However, as we (and others) demonstrate, Git, as designed today, does not scale to the needs of ML dataset management. In this paper, we propose XetHub; a system that retains the Git user experience and ecosystem, but can scale to support large datasets. In particular, we demonstrate that XetHub can support Git repositories at the TB scale and beyond. By extending Git to support large-scale data, and building upon a DevOps ecosystem that already exists for source code, we create a new user experience that is both familiar to existing practitioners and truly addresses their needs.

https://preview.redd.it/19x4sim19nba1.png?width=1746&format=png&auto=webp&v=enabled&s=2be4a2e8059f9ac8e00bdea15d5cd97b0574b7ce

https://preview.redd.it/xsqqjjm19nba1.png?width=2422&format=png&auto=webp&v=enabled&s=270c68d290b17bee7dbf0f38588a5d29fa040fe2

45

Comments

You must log in or register to comment.

PassionatePossum t1_j429c7y wrote

Admittedly, I just skimmed the paper. But I found it weird that DVC wasn’t mentioned at all. Maybe I missed something but it seems to address the same or at least a similar use case.

I think it would deserve at least a mention in the related work section and a discussion what is different or better in XetHub. Maybe even a performance comparison between the two would be interesting.

35

theDaninDanger t1_j42vf8i wrote

They mention it a few times, but kind of hand-wave it away:

> Solutions such as Git LFS [9] and DVC [10] provide a

light-weight facade for adding large files to Git repositories but do

not provide sufficient integration to support the needs of industry

ML datasets as described in Sec. 2.

I'm not sure what they mean by 'sufficient integration', but whatever the insufficiencies, why not address those? Considering all the authors work at XetHub, I'm pretty sure this is an advertisement disguised as a research paper.

27

BossOfTheGame t1_j43tep8 wrote

Biggest issue with XetHub is current lack of open source. They have a way better storage design than DVC, but open source is absolutely necessary. If I can't give someone else zero-cost instructions to reproduce my research then my research becomes far less useful.

Perhaps industry will pay for this, but non-free software is a deal breaker for someone who wants to advance science.

17

tamale t1_j4457xf wrote

Did they mention Pachyderm?

5

uhules t1_j47fkag wrote

I'm guessing this is unintentional, but you talk like XetHub has been a thing for a while. I even went to see what had I missed, and for what I gathered it's a startup that just emerged from it's stealth status like, five days ago (from its twitter status it's more like three weeks, but still). They'll probably opensource the core tech as a freemium like almost everything else in the current convoluted MLOps landscape.

2