Viewing a single comment thread. View all comments

jiamengial OP t1_j6j6ruc wrote

If anything this is what's motivating me; getting Kaldi (or any of these other repos) to compile and run on your own data is usually painful enough that it's putting off anyone who isn't already knowledgeable in the area, where wrappers such as pykaldi and Montreal Forced Aligner try to result a lot of problems, but only really add to it.

I've personally had great experiences with repo's like NeMo, though that was mainly through nailing myself to a specific commit in the main branch and heavily wrapping various classes I needed to use (I still have no idea what a manifest file format should look like)

The field is still incredibly recipe-heavy in terms of setting up systems and running them; if you were someone testing the waters with speech processing (especially if you want to go beyond STT or vanilla TTS), there little to nothing that compares to the likes of HuggingFace for the text side

3

fasttosmile t1_j6jk30j wrote

Everyone has been moving on from kaldi so it's a little weird to bring that up now.

If you're interested in a modern formats for speech data look into lhotse.

2

uhules t1_j6juq7x wrote

Lhotse is basically part of the "Kaldi 2.0 ecosystem" (K2/Lhotse/Icefall/Sherpa), you'll probably see people referring to the whole lot as Kaldi as well.

2

fasttosmile t1_j6jzvyw wrote

That does not make sense. You don't need kaldi to use the new libraries. And lhotse can be used totally independently of k2 or icefall.

−1

Maleficent_Cod_1055 t1_j6jkz4b wrote

Tbh if you're still doing anything like word alignment or phone alignment the first thing people bring up is still Kaldi. Will check out Lhotse!

1

jiamengial OP t1_j6j8c8c wrote

To go into your question further, one area that might be really interesting is open standards or formats for speech data; like the MLF formats in HTK and Kaldi but, like, modern, so that (to the point of some others here w.r.t. data storage costs) datasets can be hosted more centrally and people don't have to reformat them to their own data storage structures (which, let's face it, is basically someone's folder structure)

1