Viewing a single comment thread. View all comments

davidbun t1_jdr115n wrote

Full disclosure, I'm one of the creators of the project, but this is exactly why we've built Deep Lake, the Data Lake for Deep Learning. It addresses all your concerns. Specifically:

- Works with any framework (PyTorch, TensorFlow - you might also want to look into training models with MMDetection)- Stores (and visualizes!) all your data, together with your metadata.

- Outperforms Zarr (we built on top of it in v1, but sadly were constrained a lot by it, so had to build everything from scratch), as well as various dataloaders in a variety of use cases.

- Achieves near full or full GPU utilization regardless of scale (think LAION-400B images battle-tested). This is regardless of which cloud you store your images on and where do you train your model, e.g., streaming from EC2 to AWS Sagemaker and achieving full GPU utilization at half the cost (no GPU idle time due to streaming capability).

27