Viewing a single comment thread. View all comments

nmfisher t1_j67czbb wrote

Those models so exist (just search for "ASR seq2seq"), it's just that CTC has always been faster/more stable/more effective method for training since it avoids the need to learn specific alignments between input features and output "units" (phonemes/subwords/letters/whatever).

The view was that encoder/decoder modesl needed considerably more training data/longer training times, and usually underperformed. However, I just came across https://arxiv.org/pdf/2205.01086.pdf which found a method for fine-tuning a pre-trained seq2seq encoder that actually outperformed CTC in on small datasets, so that may no longer be the case.

1