Viewing a single comment thread. View all comments

thundergolfer t1_iykjcgc wrote

How do you deploy and scale the transcription? Is it on GPU, which model variant?

I also built a whisper transcription app: modal.com/docs/guide/whisper-transcriber. It can do serverless CPU transcription on-demand. You can check it out and borrow from it if it's useful. The code is open-source.r

PS.: Yes this does violate rule 5 (promote on weekends). I violated the same rule when I posted my whisper app :)

2

t0mkaka OP t1_iyko74k wrote

Yes, It's on GPU. I used tiny and medium. I haven't tried large because I wanted to run fast. I tried for 3-4 days to parallelize and was inpired by your post also and one by Assembly who demoed with parallelized.

But unfortunately, I was not able to parallelize. Whisper uses 30 seconds clips and then for the next 30 seconds it passes the last 30 seconds text as prompt. Since podcasts are not cut out in 30 seconds so I needed to enter the prompt in anycase. I cannot transcribe them independently.

I deploy on vast.ai for cheap GPU usage in a day and run 2 models parallely. The GPU memory usage is low around 30% but the GPU CPU usage goes to full and then speed begins to fall after 2 parallel models. So I run 2 inference runs per GPU. I have used only 1 GPU at the moment and not scaled it but it should not be tough task now.

3