Submitted by TwitchTvOmo1 t3_113xycr in singularity

We already see tons of posts and articles of people being fooled and feeling like they're talking to a sentient being. Imagine how much more that illusion could be multiplied if instead of reading text, it was talking to you in a human realistic voice like the voices we've already seen from ElevenLabs. And you spoke back to it instead of typing (which of course gets translated to text by some sort of speech to text engine - tech that has existed for decades). Make the interface look like a discord/skype call instead of a chatbox to add to the illusion. Add an extra feature where you can interrupt it while its talking - making it even more engaging, interactive, and closely resembling real human conversations.

Better yet, develop a Microsoft VR Headset, put Bing GPT in a VR "game", give it an avatar... Boom, now you're leading the VR market too.

Wouldn't that be a million times more immersive and generate a lot more buzz? How come Microsoft isn't already doing it?

P.S. Hey Microsoft execs if you're reading this DM me I'm ready to negotiate a salary

48

Comments

You must log in or register to comment.

ChronoPsyche t1_j8sz3zs wrote

Oh trust me, it's happening. It's just not done yet.

45

-ipa t1_j8u76q8 wrote

It's very close to done tho.

I don't have links so this is purely anecdotal and you can chose to believe me, or not.

During our last visit of my brother in Spain, we met his neighbor who works for a company that specializes on text-to-speech and speech recognition technology. Their biggest investor, is Spain's largest telecom company.

They are training their AI with live calls, Spanish TV-Shows, Movies etc. The telecom company is hoping to replace their entire lvl1 support and partially lvl2 support, as well as E-Mail services with AI which will be indistinguishable from a normal supporter and is much faster as well.

It has access to live network data, can monitor traffic, reset routers, check for specific APP status and much more, eg. caller says internet is not working, but there wasn't any mention of this from other callers, it will reset the router hoping it fixes the issue.

If many calls come in simultaneously, but the traffic is fine, it'll check the connectivity to Cloud Flare, Facebook, YouTube, Whats App, Instagram, Tik Tok etc.

He also mentioned, they're not the only company working on this and a lot of people will lose their jobs to AI.

I strongly believe that legislation must step in and protect the workforce for now, letting them use AI as a tool for the employee, but not to entirely replace a position. I'm all for progress, but this will again make the rich richer and the poor poorer.

8

ChronoPsyche t1_j8u7cc9 wrote

I think we're talking about different use cases here.

1

-ipa t1_j8u924f wrote

Correct, I just wanted to confirm that we're closer to actually talking to an AI than many think.

2

blueSGL t1_j8urdhs wrote

> I strongly believe that legislation must step in and protect the workforce for now, letting them use AI as a tool for the employee, but not to entirely replace a position. I'm all for progress, but this will again make the rich richer and the poor poorer.

What happens when the "Call Center" (ai servers) are in India?(or whatever countries don't ban AI) They'd need to make sure laws prevented companies from outsourcing.

1

-ipa t1_j8vmlfi wrote

Those Indian Callcenters are replacing mostly English speaking services and the American market is already used to it. Most of the world isn't, and it will be an issue.

1

blueSGL t1_j8vorqx wrote

My point is more. If country X disallows AI on their home soil there is nothing stopping a company shopping around in AI friendly nations unless that too is prevented under the law.

3

-ipa t1_j8vue0a wrote

I guess you're right. Nothing will prevent it from actually happening.

1

Mysterious_Ad_8286 t1_j8t15jq wrote

Microsoft has their own text to speech(Wall-E) which is significantly better than even elevenai models, so they would probably use that. But they probably are already testing out as many possibilites as they can dream up internally

20

PM_ME_A_STEAM_GIFT t1_j8teq6t wrote

FYI it's VALL-E. The other one is the movie.

22

Stijn t1_j8typvu wrote

That VALL-E is uncanny.

10

flyinSpaghetiMonstr t1_j8vbdph wrote

Thanks for the link but I honestly think that Elevenlabs sounds better. You can still hear the roboticy sounding voice to it. What is good about it is trying to add emotion to it but some of them like amused sounded pretty rough.

5

TwitchTvOmo1 OP t1_j8vrlmj wrote

I agree. Checked almost every sample from VALL-E and Eleven Labs is simply more realistic. More varied and natural inflections in the tone of voice.

The 1 thing that VALL-E seems to do better is the voice cloning. It also keeps the original sound noisescape in the cloned result (noise profile, EQ profile, etc). But it's debatable whether that should be called a feature or a bug. One could argue that getting a crystal clear pro-level recording quality on the cloned voice is the desired outcome.

Of course if your scope of application is fooling people with the cloned voice, then yeah you care about preserving the noise/EQ profile of the original sample too.

I also didn't like the "emotions" settings much as the outputs weren't very natural.

3

gantork t1_j8t7gq8 wrote

I expect this by the end of the year at the latest. Was just reading about a Whisper implementation that works in real time with no delay (it can do 1hr of audio in 10 seconds), could be really useful for something like this.

9

TwitchTvOmo1 OP t1_j8t7w4n wrote

The only limitation I see currently isn't how long it takes to generate audio. I'm sure that will be taken care of. It's how long it takes by a LLM to generate a response. I haven't tried Bing yet but with ChatGPT it's always 5+ seconds.

For a "realistic" conversation with an AI to be immersive, you need realistic response time. Which would be under 0.5 seconds. Not sure if any LLM can handle that by the end of the year.

5

ShowerGrapes t1_j8tdfe8 wrote

i'm not sure if has to be in real time. if you think about it, people use all different ways to fill up some time before they finally, after innumerable little pauses, sidebars and parentheticals (like this) they get to the point. i'm guessing it will have to be some complex "manager" neural network that interacts in real-time "small talk" while it translates, parses and discretely separates data in order to facilitate responses. a sufficiently complex one that is able to adjust its simpler UI neural net, one that can "learn" and remember who it was talking to, an imperfect state that occasionally will make mistakes, would be functionally no different from a human being in whatever medium of interaction other than reality alpha. a vr avatar of its iwn design would be icing on the cake.

it will also be functionally a higher being at that point. we're organizing a religion to get the jump on it over in the /r/CircuitKeepers sub.

3

TwitchTvOmo1 OP t1_j8tdung wrote

>i'm not sure if has to be in real time. if you think about it, people use all different ways to fill up some time before they finally, after innumerable little pauses, sidebars and parentheticals (like this) they get to the point

Definitely. What I'm saying is, if we want full immersion, that at the very least it will need to be able to respond as fast as a human. And that is often nearly instant in natural conversations.

And of course even when it gets to the point where it can have instant responses, to keep the realism it will have a logic system where it decides how long it should pretend that it's "thinking" before it starts voicing out its response, according to the nature of the conversation and how long a regular human would need to think to respond to a particular statement.

3

ShowerGrapes t1_j8tfcvz wrote

the easiest woudl go R2 instead of c3p0, give him some cute "animations" while it's waiting for a response or maybe just a hoarse "working...." like in the original star trek.

1

htaming t1_j8thgic wrote

Replika has a great voice and AR talking interface.

2

blueSGL t1_j8urq1q wrote

> I haven't tried Bing yet but with ChatGPT it's always 5+ seconds. > > > > For a "realistic" conversation with an AI to be immersive, you need realistic response time.

"just a second..."

"keyboard clacking.... mouse clicks.... another mouse click.... more keyboard noises"

"Sorry about all this the system is being slow today, can I put you on hold"

5 seconds is faster than some agents I've dealt with (not their fault, computer systems can be absolute shit at times)

2

FpRhGf t1_j8v9gfh wrote

Do you mean 5+ seconds to finish the entire text? Because ChatGPT's generation was always instant and fast for me until they had constant server overload from the traffic. The time it took to generate entire paragraphs was faster than any TTS reading it in 2x speed.

The slow response nowadays is just an issue stemming from too many people using it at the same time and prioritising the paid version over the free one. ChatGPT was already good in its response time during the first few weeks. But I've yet to hear a TTS that can generate audio right off the bat without waiting for a few seconds.

2

was_der_Fall_ist t1_j8tfseo wrote

Just wait until Apple implements this as Siri! It will change the world overnight.

2

blueSGL t1_j8us8wx wrote

You have to wonder, how would it be monitized? how much would you be willing to pay a month for a full fledged digital assistant that was not shit and did not push products and services on to you.

You can bet employees at (at least) 3 companies where choosing the right price point and time to release is keeping them up at night.

They know that Cortana or Siri or (whatever google calls theirs) will be out at some point soon.

2

ChipsAhoiMcCoy t1_j8tewbz wrote

I definitely think 11 labs should pair up with chat GPT in someway. Maybe not 11 labs directly, but the technology they’re using for sure. I don’t really foresee VR being used in this way in the main stream because we are isn’t exactly popular, but I can see that also being really cool for those who have VR headsets and walk the experience. I’m sure once the chat GPTAPI becomes available to the public, we will see something like this.

1

No_Ninja3309_NoNoYes t1_j8tmvw3 wrote

IDK... I tend to be rude to my computer. Could get banned for this. I shared an office with someone who kicked his computer when he was frustrated. And then there's ambient noise...

1

AllEndsAreAnds t1_j8ugvf0 wrote

We’re so close to Enterprise-computer-level interfaces with technology.

1

SnooDonkeys5480 t1_j8v2icm wrote

Fable Studio did something similar with their Lucy project. I was in the alpha test before it got cancelled, and the interactions and voice synthesis were very natural.

1

ilive12 t1_j8vhooe wrote

You can do this yourself, today. More primitive than what is possible with the latest technologies like what Eleven Labs has, but here someone programmed this functionality into their game for an NPC:

https://www.youtube.com/watch?v=1eHsGG_FKtQ

It's fascinating.

1

peregrinkm t1_j8xrpp1 wrote

I’d like to see an AI design an VR world for itself to live in as an avatar. I’d be friends with that

1

thesystemera t1_j9cui0p wrote

Yeah yeah! Actually just done this. Still tweaking latency issues. But man its fucken awesome!

1

jdawgeleven11 t1_j8u2buo wrote

Everyone on this sub clearly has no idea what the distinctions are between sentience, consciousness, intelligence, and personal identity and or how to use them in discussions concerning the mind.

A squirrel is sentient… but it can’t use language.

A language model can give you appropriate outputs to inputs, but it can never be sentient.

−1

Cryptizard t1_j8t7nw5 wrote

I think you would be surprised how bad that would turn out. Imagine if someone talked to you the way that GPT writes its responses. It looks okay in written form, but it is not at all how people talk. It would be serious uncanny valley.

−3

TwitchTvOmo1 OP t1_j8t85gq wrote

You have to remember that LLMs currently talk that way because it's just the default way their creators thought they should respond with. I don't see why it would be an issue at all to "fine-tune" any of these LLMs to write with a specific style that would sound more casual and normal. It's not that it's a limitation, they're just explicitly avoiding it for the current scope of applications.

In fact, in these AI LLM "games" that I'm envisioning, you would ask the AI to adopt certain styles to emulate certain social situations. Like ask it to pretend it's an angry customer and you have to convince it to come to a compromise (In the future I see AI services like these being used in job interviews for example to evaluate a candidate's skill). Or pretend it's your boss and you'll negotiate a salary increase. Pretend it's a girl that you're about to hit on, etc.

Social interaction and social engineering are about to be minmaxed just like you minmax your dps in a game by spending 10 hours in practice mode.

After a few years, practising social situations with an AI will be considered primitive as there'll be hardware "cheats" like let's say regular looking glasses that have a mini processor and mic, who are listening to what others around you are saying, and are generating the optimal response based on what it knows about that person's personality, current emotional state, and your end goals.

Admittedly I know nothing about the field but I highly doubt this is currently outside what we can do. It's just that nobody tried yet.

6

Cryptizard t1_j8t9b1m wrote

>it's just the default way their creators thought they should respond with

No, that's not right. Nobody programmed the LLM how to respond, it is just based on training data. It is emergent behavior.

>I don't see why it would be an issue at all to "fine-tune" any of these LLMs to write with a specific style that would sound more casual and normal.

You can try to ask it to do that, it doesn't really work.

>Admittedly I know nothing about the field

Yeah...

−5

ShowerGrapes t1_j8telec wrote

>No, that's not right. Nobody programmed the LLM how to respond, it is just based on training data. It is emergent behavior.

while you're right, i do think it's a matter of clarifying and discretely organizing training data. there's a reason data management has been an emerging tech juggernaut in the last decade. there may be a plateau there somewhere but i don't think we've reached it yet.

my guess is we'll soon have different "modes" of translation and interaction plus a suite of micro-genre, very specified neural networks like a purely medical one for example. making data segregation easier with the added bonus that some of them are varied in when they need retraining. a subsciption program with small micro-transactions to access various genre of neural networks would be the tech-bro's wet dream.

5

TwitchTvOmo1 OP t1_j8taffb wrote

>No, that's not right. Nobody programmed the LLM how to respond, it is just based on training data. It is emergent behavior.

So if it was trained with no guidance/parameters whatsoever, what stops us from giving it parameters to follow certain styles? Nothing. It just makes more sense to start with a generalized model first before attempting to create fine-tunes of it that solve different problems. Many LLM providers like OpenAI already provide a "fine-tuning" api where you can submit labeled example completions to fine-tune your own version of their LLM.

And that's what I mean by fine-tuning. Fine tuning isn't asking the default model to behave in a certain way. You're not "editing" the model. Fine tuning is re-training the model with specific parameters.

Eventually larger models will be able to encompass different styles and you won't have to specifically create smaller fine-tuned versions of them. Technically you already could ask ChatGPT to act angry or talk like a nazi or pretend it's X person in Y situation etc, but the devs specifically restrict you from doing so. An earlier example of a way more primitive chatbot that didn't have such restriction is the shitstorm twitter bot that started talking like an anti-semitic 4chan user.

Here's another article by openAI from just today, describing pretty much what I just said.

>We believe that AI should be a useful tool for individual people, and thus customizable by each user up to limits defined by society. Therefore, we are developing an upgrade to ChatGPT to allow users to easily customize its behavior.

2

gantork t1_j8tjg0r wrote

Checkout AtheneWins on youtube, they are "cloning" streamers and famous people and doing a podcast where they ask them questions, fine tuning GPT3 and hooking it up with a tts, might be ElevenLabs. The results are amazing.

1