Submitted by jiamengial t3_10p66zc in MachineLearning

I've been working in the speech and voice space for a while now and am now building out some tooling in the space to make it easier for researchers/engineers/developers to build speech processing systems and features; I'd love to hear what people in ML struggle with when you're trying to build or work with speech processing for your projects/products (beyond speech-to-text APIs)

34

Comments

You must log in or register to comment.

tripple13 t1_j6ihcum wrote

GPUs my friend. GPUs. I pray everyday, one day, an H100 may come my way. And yet, everyday, I pray, no H100 is yet here to stay.

62

currentscurrents t1_j6m2ljf wrote

Is the H100 even out yet?

High hopes that it pushes down the cost of older chips like the A100.

3

5death2moderation t1_j6my8yj wrote

It is out and it's 3x more expensive than it's A100 equivalent was 2 years ago. The prices are not going down for a very long time, probably not until the next generation is out.

1

currentscurrents t1_j6o8ziy wrote

Oof. Nvidia has a stranglehold on the market and they know it. I hope AMD steps up its game.

0

blackkettle t1_j6itdxo wrote

How familiar are you with the existing frameworks out there for this topic space? There's a lot of active work here; I'm curious about what you are focusing on, and how that reflects against the shortcomings of existing frameworks:

- https://github.com/kaldi-asr/kaldi

- https://github.com/k2-fsa

- https://github.com/espnet/espnet

- https://github.com/speechbrain/speechbrain

- https://github.com/NVIDIA/NeMo

- https://github.com/microsoft/UniSpeech

- https://github.com/topics/wav2vec2 [bajillions of similar]

- https://github.com/BUTSpeechFIT/VBx

this list is of course incomplete, but there is a _lot_ of active work in this space and a lot of opensource. Recently you've also got larger and larger public datasets becoming available. The SOTA is really getting close to commoditization as well.

What sort of OSS intersection or area are you focusing on, and why?

19

pronunciaai t1_j6l49ij wrote

Yeah I work in the space (mispronunciation detection) and there is not a lack of frameworks, (speechbrain, NeMo, and thunder-speech being the more useful ones for custom stuff imo). The barrier to entry is all the stuff you have to learn to do audio ML, and all the pain points around stuff like CTC. Tutorials are more needed than frameworks to get more people actively working on speech and voice in my opinion.

6

jiamengial OP t1_j6j6ruc wrote

If anything this is what's motivating me; getting Kaldi (or any of these other repos) to compile and run on your own data is usually painful enough that it's putting off anyone who isn't already knowledgeable in the area, where wrappers such as pykaldi and Montreal Forced Aligner try to result a lot of problems, but only really add to it.

I've personally had great experiences with repo's like NeMo, though that was mainly through nailing myself to a specific commit in the main branch and heavily wrapping various classes I needed to use (I still have no idea what a manifest file format should look like)

The field is still incredibly recipe-heavy in terms of setting up systems and running them; if you were someone testing the waters with speech processing (especially if you want to go beyond STT or vanilla TTS), there little to nothing that compares to the likes of HuggingFace for the text side

3

fasttosmile t1_j6jk30j wrote

Everyone has been moving on from kaldi so it's a little weird to bring that up now.

If you're interested in a modern formats for speech data look into lhotse.

2

uhules t1_j6juq7x wrote

Lhotse is basically part of the "Kaldi 2.0 ecosystem" (K2/Lhotse/Icefall/Sherpa), you'll probably see people referring to the whole lot as Kaldi as well.

2

fasttosmile t1_j6jzvyw wrote

That does not make sense. You don't need kaldi to use the new libraries. And lhotse can be used totally independently of k2 or icefall.

−1

Maleficent_Cod_1055 t1_j6jkz4b wrote

Tbh if you're still doing anything like word alignment or phone alignment the first thing people bring up is still Kaldi. Will check out Lhotse!

1

jiamengial OP t1_j6j8c8c wrote

To go into your question further, one area that might be really interesting is open standards or formats for speech data; like the MLF formats in HTK and Kaldi but, like, modern, so that (to the point of some others here w.r.t. data storage costs) datasets can be hosted more centrally and people don't have to reformat them to their own data storage structures (which, let's face it, is basically someone's folder structure)

1

the_Wallie t1_j6igyba wrote

2 things.

  1. there is still a ton of room for valuable innovation with structured data
  2. the cost of processing is typically astronomical, and the return hard to quantify.

In short I see this tech as a very specific solution to a very specific set of problems.

16

jiamengial OP t1_j6mc7vs wrote

To challenge on this a little though; surely at some point people thought free form text was unstructured data?

1

the_Wallie t1_j6n236n wrote

I still do, but the points about complexity and roi remain the same. I get that you like this form of data and that's okay (actually, that's great!) , but not everybody has to adopt it because you find it exciting.

1

wintermute93 t1_j6ipwqu wrote

It's a lot harder to find/gather/create/curate a large high quality dataset of audio recordings relevant to a given task than it is for image or tabular data.

8

psma t1_j6j51oq wrote

Streaming inference support. Deploying ML models to work in real-time with little latency is a pain.

6

jiamengial OP t1_j6j95fq wrote

Presumably this would be for through certain protocols like Websockets and WebRTC? Or more like direct integration to Zoom?

1

psma t1_j6jhwdl wrote

Not sure how. If I have, e.g. a PyTorch model how do I deploy it for streaming data without having to rewrite it in another framework? (e.g. stateful convolutions, ability to receive an arbitrary number of samples as input, etc). It's doable, but mostly amounts to rewriting your model. This should be automated.

3

babua t1_j6khgfr wrote

I don't think it stops there either, streaming architecture probably breaks core assumptions of some speech models. e.g. for STT, when do you "try" to infer the word? for TTS, how do you intonate the sentence correctly if you don't know the second half? You'd have to re-train your entire model for the streaming case and create new data augmentations -- plus you'll probably sacrifice some performance even in the best case because your model simply has to deal with more uncertainty.

3

jiamengial OP t1_j6mbv97 wrote

That's a good point - CTC and attention mechanisms work on the basis that you've got the whole segment of audio

2

TikkunCreation t1_j6iwke2 wrote

The perception that it’ll be high latency and therefore annoying for users

5

gunshoes t1_j6im7go wrote

Atm, my hard drive failed and SSD doesn't come until Tuesday.

In actuallity, I work in the space and the main limitation is hardware. Most small problems still require a ton of storage space and Google Collab ain't giving me a terribyte for audio until I start paying tiers.

2

MrAcurite t1_j6jlqmi wrote

That I don't want to.

2

Brudaks t1_j6jqizr wrote

Availability of corpora for other languages.

If you care about much less resourced languages than English or the big ones, then you can generally get sufficient text to do interesting stuff, but working with speech becomes much more difficult due to very limited quantity of decent quality data.

2

nielsrolf t1_j6jvtq0 wrote

Inference time is an issue for me at the moment. I tried openai whisper on replicate and hosted it on banana.dev but both take too long. I would like to use it for a conversational bot, so 50s is too long to transcribe 7s of audio, but this is what I got so far.

2

like_a_tensor t1_j6k1nde wrote

I feel like there's a lot of signal processing math in speech and voice that I have zero background in. Even though everything is deep learning now, speech and voice architectures seem more complex than in other fields.

2

Vegetable-Skill-9700 t1_j6l1k7r wrote

Personally I find collecting and understanding to be really hard when it comes to speech. Like with images I can visualise a lot of them at once however with speech I'll have to listen to them one by one

2

prototypist t1_j6ljszc wrote

I just barely got into text NLP when I could run notebooks with a single GPU / CoLab and get interesting outputs. I've seen some great community models (such as Dhivehi language) made with Mozilla Common Voice data. But if I were going to collect a chunk of isiXhosa transcription data, and try to run it on a single GPU, that's hours of training to an initial checkpoint which just makes some muffled noises.At end of 2022 there was a possibility to fine-tune OpenAI Whisper, so if I tried again, I might start there. https://huggingface.co/blog/fine-tune-whisper

Also I never use Siri / OK Google / Alexa. I know it's a real industry but I never think of use cases for it.

2

GFrings t1_j6lr1tn wrote

Mel spectrograms so scary

2

RedditIsDoomed-22 t1_j6ju2p3 wrote

Computing cost, storing and processing speech data is so expensive.

1

daidoji70 t1_j6kz2kq wrote

It sounds really boring. (edit: I really have never had a need for speech or voice so far in my career haha. Good luck on making tooling).

1