Submitted by jiamengial t3_10p66zc in MachineLearning

I've been working in the speech and voice space for a while now and am now building out some tooling in the space to make it easier for researchers/engineers/developers to build speech processing systems and features; I'd love to hear what people in ML struggle with when you're trying to build or work with speech processing for your projects/products (beyond speech-to-text APIs)

34

Comments

You must log in or register to comment.

the_Wallie t1_j6igyba wrote

2 things.

  1. there is still a ton of room for valuable innovation with structured data
  2. the cost of processing is typically astronomical, and the return hard to quantify.

In short I see this tech as a very specific solution to a very specific set of problems.

16

tripple13 t1_j6ihcum wrote

GPUs my friend. GPUs. I pray everyday, one day, an H100 may come my way. And yet, everyday, I pray, no H100 is yet here to stay.

62

gunshoes t1_j6im7go wrote

Atm, my hard drive failed and SSD doesn't come until Tuesday.

In actuallity, I work in the space and the main limitation is hardware. Most small problems still require a ton of storage space and Google Collab ain't giving me a terribyte for audio until I start paying tiers.

2

wintermute93 t1_j6ipwqu wrote

It's a lot harder to find/gather/create/curate a large high quality dataset of audio recordings relevant to a given task than it is for image or tabular data.

8

blackkettle t1_j6itdxo wrote

How familiar are you with the existing frameworks out there for this topic space? There's a lot of active work here; I'm curious about what you are focusing on, and how that reflects against the shortcomings of existing frameworks:

- https://github.com/kaldi-asr/kaldi

- https://github.com/k2-fsa

- https://github.com/espnet/espnet

- https://github.com/speechbrain/speechbrain

- https://github.com/NVIDIA/NeMo

- https://github.com/microsoft/UniSpeech

- https://github.com/topics/wav2vec2 [bajillions of similar]

- https://github.com/BUTSpeechFIT/VBx

this list is of course incomplete, but there is a _lot_ of active work in this space and a lot of opensource. Recently you've also got larger and larger public datasets becoming available. The SOTA is really getting close to commoditization as well.

What sort of OSS intersection or area are you focusing on, and why?

19

TikkunCreation t1_j6iwke2 wrote

The perception that it’ll be high latency and therefore annoying for users

5

psma t1_j6j51oq wrote

Streaming inference support. Deploying ML models to work in real-time with little latency is a pain.

6

jiamengial OP t1_j6j6ruc wrote

If anything this is what's motivating me; getting Kaldi (or any of these other repos) to compile and run on your own data is usually painful enough that it's putting off anyone who isn't already knowledgeable in the area, where wrappers such as pykaldi and Montreal Forced Aligner try to result a lot of problems, but only really add to it.

I've personally had great experiences with repo's like NeMo, though that was mainly through nailing myself to a specific commit in the main branch and heavily wrapping various classes I needed to use (I still have no idea what a manifest file format should look like)

The field is still incredibly recipe-heavy in terms of setting up systems and running them; if you were someone testing the waters with speech processing (especially if you want to go beyond STT or vanilla TTS), there little to nothing that compares to the likes of HuggingFace for the text side

3

jiamengial OP t1_j6j8c8c wrote

To go into your question further, one area that might be really interesting is open standards or formats for speech data; like the MLF formats in HTK and Kaldi but, like, modern, so that (to the point of some others here w.r.t. data storage costs) datasets can be hosted more centrally and people don't have to reformat them to their own data storage structures (which, let's face it, is basically someone's folder structure)

1

psma t1_j6jhwdl wrote

Not sure how. If I have, e.g. a PyTorch model how do I deploy it for streaming data without having to rewrite it in another framework? (e.g. stateful convolutions, ability to receive an arbitrary number of samples as input, etc). It's doable, but mostly amounts to rewriting your model. This should be automated.

3

Brudaks t1_j6jqizr wrote

Availability of corpora for other languages.

If you care about much less resourced languages than English or the big ones, then you can generally get sufficient text to do interesting stuff, but working with speech becomes much more difficult due to very limited quantity of decent quality data.

2

RedditIsDoomed-22 t1_j6ju2p3 wrote

Computing cost, storing and processing speech data is so expensive.

1

nielsrolf t1_j6jvtq0 wrote

Inference time is an issue for me at the moment. I tried openai whisper on replicate and hosted it on banana.dev but both take too long. I would like to use it for a conversational bot, so 50s is too long to transcribe 7s of audio, but this is what I got so far.

2

like_a_tensor t1_j6k1nde wrote

I feel like there's a lot of signal processing math in speech and voice that I have zero background in. Even though everything is deep learning now, speech and voice architectures seem more complex than in other fields.

2

babua t1_j6khgfr wrote

I don't think it stops there either, streaming architecture probably breaks core assumptions of some speech models. e.g. for STT, when do you "try" to infer the word? for TTS, how do you intonate the sentence correctly if you don't know the second half? You'd have to re-train your entire model for the streaming case and create new data augmentations -- plus you'll probably sacrifice some performance even in the best case because your model simply has to deal with more uncertainty.

3

daidoji70 t1_j6kz2kq wrote

It sounds really boring. (edit: I really have never had a need for speech or voice so far in my career haha. Good luck on making tooling).

1

Vegetable-Skill-9700 t1_j6l1k7r wrote

Personally I find collecting and understanding to be really hard when it comes to speech. Like with images I can visualise a lot of them at once however with speech I'll have to listen to them one by one

2

pronunciaai t1_j6l49ij wrote

Yeah I work in the space (mispronunciation detection) and there is not a lack of frameworks, (speechbrain, NeMo, and thunder-speech being the more useful ones for custom stuff imo). The barrier to entry is all the stuff you have to learn to do audio ML, and all the pain points around stuff like CTC. Tutorials are more needed than frameworks to get more people actively working on speech and voice in my opinion.

6

prototypist t1_j6ljszc wrote

I just barely got into text NLP when I could run notebooks with a single GPU / CoLab and get interesting outputs. I've seen some great community models (such as Dhivehi language) made with Mozilla Common Voice data. But if I were going to collect a chunk of isiXhosa transcription data, and try to run it on a single GPU, that's hours of training to an initial checkpoint which just makes some muffled noises.At end of 2022 there was a possibility to fine-tune OpenAI Whisper, so if I tried again, I might start there. https://huggingface.co/blog/fine-tune-whisper

Also I never use Siri / OK Google / Alexa. I know it's a real industry but I never think of use cases for it.

2

GFrings t1_j6lr1tn wrote

Mel spectrograms so scary

2

the_Wallie t1_j6n236n wrote

I still do, but the points about complexity and roi remain the same. I get that you like this form of data and that's okay (actually, that's great!) , but not everybody has to adopt it because you find it exciting.

1