Viewing a single comment thread. View all comments

numpee t1_j67dlt8 wrote

So with ImageNet, the main performance (speed) bottleneck is actually data loading, especially if your models are not that large (such as Res18, Res50). ImageNet is roughly 1.2M images (train) that are roughly <1MB each, which means that you're performing random reads 1.2M times for each epoch. Modern NVME SSDs have great sequential read speeds, but still lack random read accesses (which is the case if you're shuffling the image order at each epoch). BTW, data loading won't be a bottleneck if you're training models like ViT or even Res152.

I highly suggest you try out a dataset format such as FFCV or WebDataset. I personally use FFCV, which is extremely fast because it caches data onto your RAM. But there definitely are some limitations, such as code compatability or not enough RAM to cache all images (this is something you should check on the server-side). You can remap ImageNet to the FFCV/WebDataset format on a local machine, then transfer your data to the server for training.

Just for reference, one epoch of training ImageNet on 4x A6000 (roughly 2~2.5x slower than A100) with Res18 takes me around 3 minutes using FFCV. But, using A100s won't necessarily be faster because even with FFCV, data loading itself takes 2~3mins without model forward/backward. IIRC, with ordinary data loading, you'd be looking at around 10~15 minutes per epoch.

If you want more details, feel free to DM me.

4