pan_berbelek

pan_berbelek t1_j1cpkth wrote

I'm trying to do basically the same thing and yes, running bloom does require a lot of memory. I managed to run it on:

  • ordinary computer with no GPU and 16GB of RAM, by loading parts of the model (divided to 73 parts) every time for every token. But this is painfully slow: 2-3 minutes per single token produced
  • a VM in Azure with no GPU but with lots of RAM (600+GB). This can generate a single token in 2-3 seconds, still way too slow for my usecase

Now I'm trying to run on a Azure VM with 8 A100 GPUs, as is recommended by Bloom authors, but this of course is significantly more expensive: the right sized VM costs $35 per hour. From what I read this setup could be capable in generating a single token in less than 1 millisecond, and if this is really true then this means this setup is actually the cheapest one for my usecase, despite high VM cost, but I need to validate first if I can really achieve this speed.

3