Viewing a single comment thread. View all comments

sumane12 t1_jds5lwr wrote

That delay kills me, far too long. I'm guessing gpt5 will have to be multimodal with sound so can recognise words and doesn't need to process into text

69

NWCoffeenut t1_jdsgb83 wrote

I think a good part of the latency was with the TTS system. The actual text response for the most part came back reasonably quickly.

26

illathon t1_jdsoud8 wrote

No most implementations of whisper are slow.

2

itsnotlupus t1_jdt280v wrote

Whisper is the speech recognition component.
I don't think he said what he's using for TTS, might be MacOS' builtin thingy.

4

eggsnomellettes t1_jdt5dxl wrote

They're using elevenlabs, which isn't local and hence a slow API call

11

tortoise888 t1_jdtp8yj wrote

If we eventually get open source Elevenlabs quality models running locally it's gonna be insane.

1

ebolathrowawayy t1_jdvfmrk wrote

There's also Tortoise TTS which can be run locally but idk how fast it is.

1

stupidcasey t1_jdsff4l wrote

I expect gpt-5 or 6 to be super multimodal where they train it on anything and everything we have data for, audio shur video of course crossword puzzles hell yeah pong yup car driving why not, I think the only thing stopping us is it takes to long and we’ll have more processing power by then.

13

pokeuser61 t1_jdskrfs wrote

If you ran this on the hardware that gpt5 will require, it wouldn’t have a delay.

7

RedditLovingSun t1_jdtn0z9 wrote

It looks like from the title bar he's using whisper api for transcribing his audio to a text query. That has to send a API request with the audio out and wait for the text to come back over the internet. I'm sure a local audio text transcriber would be considerably faster

Edit nvm whisper can be run locally so he's probably doing that

4

itsnotlupus t1_jdt2igm wrote

The model text output is(/can be) a stream, so it ought to be possible to pipe that text stream into a warmed up TTS system and start getting audio before the text is fully generated.

3

Drown_The_Gods t1_jdww8zc wrote

Use Talon Voice. The developer has their own engine that blows Whisper out of the water. Never worry about speed again. Don’t thank me, but do chuck them a few dollars if you find it useful.

2