blackkettle

blackkettle t1_j7ud34i wrote

Are you talking about this paper:

- https://cdn.openai.com/papers/whisper.pdf

maybe I missed it but I can't find any place in that paper where they talk about the trade-offs with respect to real time factor and decoding strategies. RTF vs acc curves for CPU vs GPU for STT typically vary not in terms of absolute performance but in terms of where along the RTF curve you achieve a particular accuracy. That impacts what kinds of tasks you can expect to use the model for, and how you can expect to scale it to real world applications. So far this has been the weakest point for all the Whisper related work (still better off with espnet, k2, speechbrain, etc). This information would be interesting to see if they have it.

2

blackkettle t1_j7u2kd0 wrote

Probably my question was not well-formulated. I'm just curious about what the RTF vs Accuracy tradeoff looks like. I'm not questioning whether it works, I'm just curious what the actual performance looks like.

You report on memory usage and beam sizes, as well as relative speedup, but it would be interesting to also see WER performance, as well as the actual absolute RTFs.

2

blackkettle t1_j7tyq1r wrote

This is very interesting, thanks for sharing! Do you have any more detail on RTF vs Accuracy curves? Also did you run this on any other data sets? Librispeech - even the “other” pieces is very clean, simple data from an acoustic and linguistic standpoint.

It would be really interesting to see how well this holds on noisy spontaneous speech like conversations.

4

blackkettle t1_j6mk578 wrote

I work in R&D in this space. There is a cost associated with training and running inference on these things. With data curation, and with the human resource funding for research. But the latter is also funded in large part by the public.

The data itself is entirely produced by the collective output of humanity. In the next 5-10 years these tools will begin to eliminate white collar professional jobs - it will happen. And as it does, dealing with that at a societal level will become a matter of great import.

Recognizing our collective contribution and actively directing these achievements towards a better shared future - sharing the benefits - will either make or break us IMO.

My 6 year old son will come of age in a radically different world. And I believe that we the creators have a responsibility to ensure that that world promotes better equity for all.

6

blackkettle t1_j6itdxo wrote

How familiar are you with the existing frameworks out there for this topic space? There's a lot of active work here; I'm curious about what you are focusing on, and how that reflects against the shortcomings of existing frameworks:

- https://github.com/kaldi-asr/kaldi

- https://github.com/k2-fsa

- https://github.com/espnet/espnet

- https://github.com/speechbrain/speechbrain

- https://github.com/NVIDIA/NeMo

- https://github.com/microsoft/UniSpeech

- https://github.com/topics/wav2vec2 [bajillions of similar]

- https://github.com/BUTSpeechFIT/VBx

this list is of course incomplete, but there is a _lot_ of active work in this space and a lot of opensource. Recently you've also got larger and larger public datasets becoming available. The SOTA is really getting close to commoditization as well.

What sort of OSS intersection or area are you focusing on, and why?

19

blackkettle t1_j4p8nqv wrote

It doesn’t seem to discuss the computational advantages in any detail. How interesting is this whole FF idea at this point? I’d love to hear more detailed analysis.

So far it seems like an interesting alternative but the “brain inspired” part is pushed in every article. In terms of accuracy it always seems slightly below traditional back prop. If there’s a huge computational improvement that would seriously recommend it I guess, but is there? Or is it just too early to tell?

9