Submitted by rajatarya t3_10a4mns in MachineLearning
Paper: https://www.cidrdb.org/cidr2023/papers/p43-low.pdf
Abstract:
Dataset management is one of the greatest challenges to the application of machine learning (ML) in the industry. Although scaling and performance have often been highlighted as the significant ML challenges, development teams are bogged down by the contradictory requirements of supporting fast and flexible data iteration while maintaining stability, provenance, and reproducibility. For example, blobstores are used to store datasets for maximum flexibility, but their unmanaged access patterns limit reproducibility. Many ML pipeline solutions to ensure reproducibility have been devised, but all introduce a degree of friction and reduce flexibility.
In this paper, we propose that the solution to the dataset management challenges is simple and apparent: Git. As a source control system, as well as an ecosystem of collaboration and developer tooling, Git has enabled the field of DevOps to provide both speed of iteration and reproducibility to source code. Git is not only already familiar to developers, but is also integrated into existing pipelines, which facilitates adoption. However, as we (and others) demonstrate, Git, as designed today, does not scale to the needs of ML dataset management. In this paper, we propose XetHub; a system that retains the Git user experience and ecosystem, but can scale to support large datasets. In particular, we demonstrate that XetHub can support Git repositories at the TB scale and beyond. By extending Git to support large-scale data, and building upon a DevOps ecosystem that already exists for source code, we create a new user experience that is both familiar to existing practitioners and truly addresses their needs.
PassionatePossum t1_j429c7y wrote
Admittedly, I just skimmed the paper. But I found it weird that DVC wasn’t mentioned at all. Maybe I missed something but it seems to address the same or at least a similar use case.
I think it would deserve at least a mention in the related work section and a discussion what is different or better in XetHub. Maybe even a performance comparison between the two would be interesting.