Submitted by rajatarya t3_znfgap in MachineLearning
Thanks to everyone who replied to our earlier post requesting pre-launch product feedback! We’re excited to announce that we’ve now publicly launched XetHub, a collaborative storage platform for data management.
I’ve been in the MLOps space for ~10 years, and data is still the hardest unsolved open problem. Code is versioned using Git, data is stored somewhere else, and context often lives in a 3rd location like Slack or GDocs.
This is why we built XetHub, a platform that enables teams to treat data like code, using Git.
Unlike Git LFS, XetHub doesn’t just store the files. It uses content-defined chunking and Merkle Trees to dedupe against everything in history, allowing small changes in large files to be stored compactly. Here’s how it works: https://xethub.com/assets/docs/how-xet-deduplication-works
XetHub includes a GitHub-like web interface that provides automatic CSV summaries and allows custom visualizations using Vega. And we know how painful downloading a huge repository can get, so we built Git-Xet mount—which, in seconds, provides a user-mode filesystem view over the repo.
Today, XetHub works for 1 TB repositories, and we plan to scale to 100 TB in the next year. Our implementation is in Rust (client & cache + storage) and our web application is written in Go.
XetHub is available today for Linux & Mac (Windows coming soon) and we’d love for you to try it out!
More info here:
- https://xetdata.com/blog/2022/12/13/introducing-xethub
- https://xetdata.com/blog/2022/10/15/why-xetdata
- Hacker News discussion (launched on Show HN at #1): https://news.ycombinator.com/item?id=33969908
Retarded_Rhino t1_j0gn93r wrote
Wow this is stupidly good, congrats! You should share over r/rust as well, they will love this