numpee

numpee t1_j67dlt8 wrote

So with ImageNet, the main performance (speed) bottleneck is actually data loading, especially if your models are not that large (such as Res18, Res50). ImageNet is roughly 1.2M images (train) that are roughly <1MB each, which means that you're performing random reads 1.2M times for each epoch. Modern NVME SSDs have great sequential read speeds, but still lack random read accesses (which is the case if you're shuffling the image order at each epoch). BTW, data loading won't be a bottleneck if you're training models like ViT or even Res152.

I highly suggest you try out a dataset format such as FFCV or WebDataset. I personally use FFCV, which is extremely fast because it caches data onto your RAM. But there definitely are some limitations, such as code compatability or not enough RAM to cache all images (this is something you should check on the server-side). You can remap ImageNet to the FFCV/WebDataset format on a local machine, then transfer your data to the server for training.

Just for reference, one epoch of training ImageNet on 4x A6000 (roughly 2~2.5x slower than A100) with Res18 takes me around 3 minutes using FFCV. But, using A100s won't necessarily be faster because even with FFCV, data loading itself takes 2~3mins without model forward/backward. IIRC, with ordinary data loading, you'd be looking at around 10~15 minutes per epoch.

If you want more details, feel free to DM me.

4

numpee t1_j4whr9r wrote

Hi u/timdettmers, I had a great time reading your blog post :) I just wanted to point out something that might be worth mentioning: the issue with 4090 (and probably 4080 as well) is that they won't fit in servers, specifically 4U rack mounted servers. In rack mounted servers, the PCIe slots are placed at the bottom (facing upwards), so the GPUs are placed "vertically" (PCIe pointing downwards). The 4090s are too tall for the 4U server, which makes it unusable (plus, 3.5slots for a single GPU complicates things further).

3