Submitted by Maxerature t3_10ir1su in MachineLearning

I have 2 GPUs, an RTX 3080 and a GTX 1080Ti. Currently I am using only the 3080, and the 10 GB VRAM doesn't seem to cut it. Can I use both the 3080 and 1080 simultaneously? My motherboard has multiple PCI-E x16 slots. My OS is PopOS. Is there any way to use multiple GPUs of different types? I'm particularly looking at KoboldAI, but it would also be useful in general. I know that SLI won't work since they're different GPUs.

4

Comments

You must log in or register to comment.

jess-plays-games t1_j5fy01j wrote

U could sli a 1080 with a previous card easily enough but they don't share ram they just use one cards vram.

The 2000 series and later don't support sli anymore and instead use nvlink witch does share vram between cards.

There a handy lil program called any sli I used to use

3

romek_ziomek t1_j5hhn2h wrote

Of course you can use them both, provided that you have free PCI-E slots. I use 3060 and 2060 super in my setup. I'm not sure what exactly you wanna do, I can tell you that I'm working in pytorch and it's a painless process, you can choose with one variable which gpu to use, or use one wrapper class (DataParallel) to train on both of them simultaneously. One trick that was specific to my motherboard and I had to figure out by trial and error (since it wasn't in the documentation) was that my second gpu wouldn't work if I had two NVMe drives installed. Other than that it works flawlessly.

3

FastestLearner t1_j5if4nz wrote

Tim Dettmers wrote about this in one of his articles. AFAIK, SLI is not required for DL (it’s a gaming thing where sync between GPUs becomes important for smooth gameplay). In DL tasks, any GPU can just wait for others to finish. So you can use any combination of any number of Nvidia GPUs as long as you can interface with them (PCIe or Ethernet). The catch is that the speed of training/inference will be limited by the weakest link in the chain, i.e. the weakest GPU will bottleneck all other GPUs. But on the flip side, you should be able to fit more data owing to the increased VRAM.

The other thing that you can do is run two different experiments on each GPU simultaneously. In that way, you can maximize the usage of your GPUs.

If you do want to fit more data on the 3080, look for pytorch plug-ins, such as deepspeed or FP16 or simply do two forward passes per backward pass, which will double your batch size.

1

FastestLearner t1_j5iklgu wrote

If you don't engage the second GPU, it will remain dormant, and should not automatically interfere with anything. For example if you are training a network in PyTorch without using DP or DDP, then it will use the first GPU by default. You can always change which GPU it uses using the environment variable CUDA_VISIBLE_DEVICES. Also, make sure the primary GPU occupies the first PCIe slot. You could verify this with nvidia_smi. When you have the display hooked up to it, the primary GPU will have a slightly higher memory usage (~100 MB) because of display server processes like Xorg, than all other GPUs.

2

ArnoF7 t1_j5lknua wrote

I have the similar suspicion as well, that the training will be bottlenecked by the slow 1080. But I am wondering if it’s possible to treat 1080 as a pure VRAM extension?

Although it’s possible that the time spent on transferring between different memories makes the gain of having more VRAM pointless

1

FastestLearner t1_j5mwz47 wrote

It is possible, but it would require you to write custom code for every memcopy operation that you want to perform i.e. tensor.to(device), which you can get away with on a smaller project but could become prohibitively cumbersome on a large project. Also you'd still need to do two forward passes (one with the data on the 3080 itself, and then another with the data on the 1080 after having it transferred to the 3080). Whether or not this is beneficial boils down to differences in transfer rates between the RAM-3080 route and the RAM-1080-3080 route. I won't be able to tell which one is faster without benchmarking.

DeepSpeed handles the RAM-3080 to-and-fro transfers for large batch sizes automatically.

1