Submitted by pvp239 t3_1035jt4 in MachineLearning

Speechbox is built on the premise that Whisper is good enough to pretty much transcribe any English speech. Furthermore, Whisper was trained to predict punctuated and orthographic text.

​

Speechbox leverages Whisper's quality to "unnormalize" audio transcriptions (see examples below) to make them more useful for further downstream applications while guaranteeing that the exact same words are being used.

"we are going to the san francisco beach" can have multiple meanings:

=>

  1. We are going to the San Francisco beach!
  2. We are going to the San Francisco beach?
  3. We are going to the San Francisco beach.

​

Speechbox will pick the correct one for you 😉

​

👉 GitHub: https://github.com/huggingface/speechbox

🤗 Demo: https://huggingface.co/spaces/speechbox/whisper-restore-punctuation

39

Comments

You must log in or register to comment.

sloganking t1_j2xnk3k wrote

Have whisper's hallucinations been improved yet? I know before, it could sometimes derail, and repeat itself nonsensically.

It's highs seem the highest, but it's lows are well.. nonsensical.

7

pvp239 OP t1_j2xoukt wrote

The way it's implemented, Whisper cannot hallucinate because it can only predict letters of the original normalized transcript or punctuation, so the algorithm in speechbox guarantees that Whisper cannot hallucinate (you can think of it as a very restricted beam search)

1

Franck_Dernoncourt t1_j2y328o wrote

Thanks! How does Speechbox' punctuation restoration compare to other existing models/codebases for punctuation restoration?

2