That delay kills me, far too long. I'm guessing gpt5 will have to be multimodal with sound so can recognise words and doesn't need to process into text


I think a good part of the latency was with the TTS system. The actual text response for the most part came back reasonably quickly.


No most implementations of whisper are slow.


Whisper is the speech recognition component.
I don't think he said what he's using for TTS, might be MacOS' builtin thingy.


They're using elevenlabs, which isn't local and hence a slow API call


If we eventually get open source Elevenlabs quality models running locally it's gonna be insane.


There's also Tortoise TTS which can be run locally but idk how fast it is.


I expect gpt-5 or 6 to be super multimodal where they train it on anything and everything we have data for, audio shur video of course crossword puzzles hell yeah pong yup car driving why not, I think the only thing stopping us is it takes to long and we’ll have more processing power by then.


If you ran this on the hardware that gpt5 will require, it wouldn’t have a delay.


It looks like from the title bar he's using whisper api for transcribing his audio to a text query. That has to send a API request with the audio out and wait for the text to come back over the internet. I'm sure a local audio text transcriber would be considerably faster

Edit nvm whisper can be run locally so he's probably doing that


The model text output is(/can be) a stream, so it ought to be possible to pipe that text stream into a warmed up TTS system and start getting audio before the text is fully generated.


Use Talon Voice. The developer has their own engine that blows Whisper out of the water. Never worry about speed again. Don’t thank me, but do chuck them a few dollars if you find it useful.


I just want a screen free phone that's basically just Jarvis. Read my texts to me, look shit up for me, keep track of and make appointments for me, give me stock quotes, tell me the news, just don't suck me into an infinite scroll anymore. If I need to see something cast it to a screen in my house. Done with phone screens.


Can't wait till we get there with a better alpaca model + local transcription and audio generation + chatgpt style plugins for operating apps. All possible today we just have to wait for it to be developed


That was my thought. No more phone. Just the smart watch.


I actually have a concept in my mind, don't have all the skills needed but will be learning things in the next few months, hopefully I'm not too late when I'm done making my idea into reality.


This is probably just my anxiety but I feel like anything we think of or try to execute is going to be eclipsed before it can be realized. We're going to go overnight from this moment to indistinguishable from human androids and FDVR. This past couple of weeks has been overwhelming in the extreme.


You're right but I think that issue isn't relevant to this, having a locally running AI would be useful regardless of other innovations, and there's something to say about cyberpunkness of such a device


With chat gpt type stuff how would it sound much different than a phone conversation? The whole idea is that the os responds to natural language, like talking to a personal assistant or secretary.


I was hoping that wearables (like a watch) could do this for me. Or at least force development in that direction.

(Seems to not be panning out… but i still have hope. I’d love to only carry a watch for most of my day. Initially I’d go through screen withdrawal but in the long run I think life would be better).


Yeah, I'd be surprised if we don't have something like that available publicly before the end of the year(if only cause big tech is slowly and unwieldy and things need to work their way through the proper paperwork


It is both public and open source


I should clarify, it will be a packaged product from a big tech person.

I could do this, sure, I can putz around on computers a bit, but once you can just click an "install" button in the Microsoft store, that's it


Big tech will offer it as a service instead of a locally-running system. That will mean latency, increased data use, and other... differences 😅


Oh, there will definitely be a ton of downsides, but convenience will not be one of them.


I'm like 100% certain that Apple, Google and Meta are making a JARVIS assistant that connects to AR glasses. It would be a revolutionary product and it's actually feasible imo.


TIL there was a B programming language


And before there was B, there was APL: A Programming Language. (This is not a joke.)


This looks like a slightly better version of siri imo


It's a significantly better version of Siri.

GPT-4 can borderline pass the Turing Test and Siri can barely do... anything?


Me: Siri, set my alarm for 7am.

Siri: Here is a list of videos titled Tom Tom Solo by River Banks!


At least you're getting an answer.

Working on that. Something went wrong. Please try again.


13b parameter llama is not as good as GPT4.


Samantha >>>>>>>>>>


ChatGPT states Samantha is the most accurate representation of AI in movies


I've been trying to set something similar up.


I built something exactly like this back when GPT3 API came out. Was pretty cool but eventually got bored with it because it couldn't do anything. I tried hooking it up to external apis to get real world live data but by the end everything was so complicated and slow that I gave up.

Hopefully with the GPT4 plugins we can now make something actually useful. It's gonna be awesome.


This is nothing special, just sounds like Google assistant


Marvel is cringe. Can we use some other name to compare stuff like this to?


You mean you don't like MODOK?

How about we just name it Dan? Dan's a cool guy.