Speechbox is built on the premise that Whisper is good enough to pretty much transcribe any English speech. Furthermore, Whisper was trained to predict punctuated and orthographic text.

Speechbox leverages Whisper's quality to "unnormalize" audio transcriptions (see examples below) to make them more useful for further downstream applications while guaranteeing that the exact same words are being used.

"we are going to the san francisco beach" can have multiple meanings:

We are going to the San Francisco beach!
We are going to the San Francisco beach?
We are going to the San Francisco beach.

Speechbox will pick the correct one for you 😉

👉 GitHub: https://github.com/huggingface/speechbox

🤗 Demo: https://huggingface.co/spaces/speechbox/whisper-restore-punctuation

Comments

You must log in or register to comment.

sloganking t1_j2xnk3k wrote on January 4, 2023 at 5:50 PM

Have whisper's hallucinations been improved yet? I know before, it could sometimes derail, and repeat itself nonsensically.

It's highs seem the highest, but it's lows are well.. nonsensical.

pvp239 OP t1_j2xoukt wrote on January 4, 2023 at 5:58 PM

The way it's implemented, Whisper cannot hallucinate because it can only predict letters of the original normalized transcript or punctuation, so the algorithm in speechbox guarantees that Whisper cannot hallucinate (you can think of it as a very restricted beam search)

WhoaEpic t1_j33383l wrote on January 5, 2023 at 6:50 PM

If I had three MP3 files (two meetings / phone call) would this software be able to transcribe what is said?

Franck_Dernoncourt t1_j2y328o wrote on January 4, 2023 at 7:24 PM

Thanks! How does Speechbox' punctuation restoration compare to other existing models/codebases for punctuation restoration?

pvp239 OP t1_j2z3obd wrote on January 4, 2023 at 11:10 PM

Haven't done a in-depth test yet, do you have some links to other existing codebases? :-)

Here are some tests: https://huggingface.co/datasets?other=speechbox_punc

Finslayer t1_j2y43fj wrote on January 4, 2023 at 7:31 PM

Hi ,

How accurate are those corrections? Do you have any benchmarks? How fast it is?
When we were finetuning wav2vec2 models we hit this exact same problem and finetuned t5 model for the task https://huggingface.co/Finnish-NLP/t5-small-nl24-casing-punctuation-correction

pvp239 OP t1_j2z3s0h wrote on January 4, 2023 at 11:10 PM

I have some examples / benchmarks here: https://huggingface.co/datasets?other=speechbox_punc

thetall0ne1 t1_j2y0hri wrote on January 4, 2023 at 7:09 PM

Reminds me of Machine Box (http://machinebox.io)

[deleted] t1_j3wpp6b wrote on January 11, 2023 at 4:22 PM

[removed]