Keepclamand- t1_j0gxj2o wrote on December 16, 2022 at 3:47 PM

Agree data is biggest challenge. I haven’t tried xet hub yet will check it out. Some questions do you support all data type - txt, image etc.

Also can you track versioning of data.
One big need is to map which model was trained on version of data.

Do you have apis?

rajatarya OP t1_j0gyotd wrote on December 16, 2022 at 3:55 PM

Great questions. Definitely check us out - within 15m of getting started you'll experience answers to your questions :)

Do you support all data types?
Yes, all file types are supported. The level of deduplication we can achieve varies by file type (some file types are already compressed) but all file types are supported. We have some great example repos with images, text, and other data types.
Can you track versioning of data?
Yes, since you are just using Git - each commit captures the version of the data (since the data is just files in the repo). This way you have full collaboration features of Git while having full reproducibility. With the added benefit of having confidence that the code will work with the data at each commit.
Do you have APIs?
Not today. Can you tell me what sort of APIs would be interesting to you? We built Xet Mount specifically for use cases when you don't want to download the entire repo - instead you mount it and get a filesystem view over the repo and stream in the files you want to explore/examine/analyze.

Do check out XetHub - I would love to hear your feedback!

rajatarya OP t1_j0gzh2r wrote on December 16, 2022 at 4:00 PM

Oh I forgot to mention - yes! mapping model to training data is a key part of reproducibility. 100% agree!

Using XetHub you can _finally_ commit the data, features, models, and metadata all in one place (along with the code). Have full confidence everything is aligned & working.

Liorithiel t1_j0h19at wrote on December 16, 2022 at 4:11 PM

> finally

I was doing so with git annex for a long time, so this is a bit of a stretch that it wasn't possible in the past. Kind of a Schmidhuber moment…

Still, nice work with the merkle tree!

rajatarya OP t1_j0h7npz wrote on December 16, 2022 at 4:52 PM

True :) I haven't used `git annex` myself so for me it felt like _finally_ when I could put all parts of the project in one place with XetHub.

How do you like using git annex? Are you working with others on your projects - does git annex help support team collaboration?

Again, appreciate the comment!

Liorithiel t1_j0hehga wrote on December 16, 2022 at 5:36 PM

> How do you like using git annex? Are you working with others on your projects - does git annex help support team collaboration?

Right now I've got one large 5 TB repository with general media and archives, and some smaller project-specific repos. Slow with many small files (like, over 1 million), but very easy to set up. Haven't tried collaboration, I've mostly worked with projects where my collaborators were rather less technical. My main use case was working with the same dataset on different computers, and for that it was more than enough.