Submitted by pvp239 t3_1035jt4 in MachineLearning

Speechbox is built on the premise that Whisper is good enough to pretty much transcribe any English speech. Furthermore, Whisper was trained to predict punctuated and orthographic text.

​

Speechbox leverages Whisper's quality to "unnormalize" audio transcriptions (see examples below) to make them more useful for further downstream applications while guaranteeing that the exact same words are being used.

"we are going to the san francisco beach" can have multiple meanings:

=>

  1. We are going to the San Francisco beach!
  2. We are going to the San Francisco beach?
  3. We are going to the San Francisco beach.

​

Speechbox will pick the correct one for you 😉

​

👉 GitHub: https://github.com/huggingface/speechbox

🤗 Demo: https://huggingface.co/spaces/speechbox/whisper-restore-punctuation

39

Comments

You must log in or register to comment.

sloganking t1_j2xnk3k wrote

Have whisper's hallucinations been improved yet? I know before, it could sometimes derail, and repeat itself nonsensically.

It's highs seem the highest, but it's lows are well.. nonsensical.

7

pvp239 OP t1_j2xoukt wrote

The way it's implemented, Whisper cannot hallucinate because it can only predict letters of the original normalized transcript or punctuation, so the algorithm in speechbox guarantees that Whisper cannot hallucinate (you can think of it as a very restricted beam search)

1

WhoaEpic t1_j33383l wrote

If I had three MP3 files (two meetings / phone call) would this software be able to transcribe what is said?

0

Finslayer t1_j2y43fj wrote

Hi ,

How accurate are those corrections? Do you have any benchmarks? How fast it is?
When we were finetuning wav2vec2 models we hit this exact same problem and finetuned t5 model for the task https://huggingface.co/Finnish-NLP/t5-small-nl24-casing-punctuation-correction

2