blackkettle t1_j7tyq1r wrote on February 9, 2023 at 11:40 AM

This is very interesting, thanks for sharing! Do you have any more detail on RTF vs Accuracy curves? Also did you run this on any other data sets? Librispeech - even the “other” pieces is very clean, simple data from an acoustic and linguistic standpoint.

It would be really interesting to see how well this holds on noisy spontaneous speech like conversations.

pommedeterresautee OP t1_j7u0p8z wrote on February 9, 2023 at 12:03 PM

Using CG doesn't affect the output quality.

What works with Whisper will still work with CG+Whisper.

blackkettle t1_j7u2kd0 wrote on February 9, 2023 at 12:23 PM

Probably my question was not well-formulated. I'm just curious about what the RTF vs Accuracy tradeoff looks like. I'm not questioning whether it works, I'm just curious what the actual performance looks like.

You report on memory usage and beam sizes, as well as relative speedup, but it would be interesting to also see WER performance, as well as the actual absolute RTFs.

whata_wonderful_day t1_j7ubutx wrote on February 9, 2023 at 1:47 PM

His point is that it's identical. They didn't use quantization or anything that would hurt performance. The whisper paper has a lot of the details you're asking for

blackkettle t1_j7ud34i wrote on February 9, 2023 at 1:57 PM

Are you talking about this paper:

- https://cdn.openai.com/papers/whisper.pdf

maybe I missed it but I can't find any place in that paper where they talk about the trade-offs with respect to real time factor and decoding strategies. RTF vs acc curves for CPU vs GPU for STT typically vary not in terms of absolute performance but in terms of where along the RTF curve you achieve a particular accuracy. That impacts what kinds of tasks you can expect to use the model for, and how you can expect to scale it to real world applications. So far this has been the weakest point for all the Whisper related work (still better off with espnet, k2, speechbrain, etc). This information would be interesting to see if they have it.