Viewing a single comment thread. View all comments

GaseousOrchid t1_j97gxhy wrote

What are some good tools for data pipelines that scale well? I'm locked into Jax/Flax for work, but would like to disconnect from TensorFlow to the greatest extent possible. I was looking at the huggingface dataloaders, does anyone have experience with those?

1

ParanoidTire t1_j9hbd77 wrote

I have years of grievances with io. It's really difficult to have something that is both flexible, performant, and can scale to terabytes of data with complex strucuture. As soon as you leave the nice cv or nlp domain you are on your own. Raw c type arrays loaded manually from disk in a separate Cuda stream can sometimes be really be your best shot.

1

GaseousOrchid t1_j9hhxdm wrote

yeah, this has been my experience -- i'm working with a lot of custom data, and even though some of it is CV adjacent, it doesn't fit exactly (e.g., ~40 channels instead of 3 like RGB). would be nice, especially for research prposes, to have something to plug and play that just worked.

1