currentscurrents t1_jd007aa wrote on March 20, 2023 at 9:05 PM

Reply to comment by satireplusplus in [Project] Alpaca-30B: Facebook's 30b parameter LLaMa fine-tuned on the Alpaca dataset by imgonnarelph

Right. And even once you have enough VRAM, memory bandwidth limits the speed more than tensor core bandwidth.

They could pack more tensor cores in there if they wanted to, they just wouldn't be able to fill them with data fast enough.

pointer_to_null t1_jd0bv74 wrote on March 20, 2023 at 10:23 PM

This is definitely true. Theoretically you can page stuff in/out of VRAM to run larger models, but you won't be getting much benefit over CPU compute with all that thrashing.

Enturbulated t1_jd1x9uu wrote on March 21, 2023 at 6:30 AM

You are absolutely correct. text-gen-webui offers "streaming" via paging models in and out of VRAM. Using this your CPU no longer gets bogged down with running the model, but you don't see much improvement in generation speed as the GPU is churning with loading and unloading model data from main RAM all the time. It can still be an improvement worth some effort, but it's far less drastic of an improvement than when the entire model fits in VRAM.

shafall t1_jd2380o wrote on March 21, 2023 at 7:56 AM

To give some more specifics, most of the time its not the CPU that copies the data on modern systems, it is the PCI DMA chip (that may be on the same die though). CPU just sends address ranges to DMA Info