Submitted by akshaysri0001 t3_10w79eo in MachineLearning

Hey everyone, I want to make a personal voice assistant who sounds exactly like a real person. I tried some TTS like tortoise TTS and coqui TTS, it done a good job but it takes too long time to perform. So is there any other good realistic sounding TTS which I can use with my own voice cloning training dataset? Also I'm a bit amazed by the TTS used by eleven labs, so can someone explain how can I achieve that level of real-time efficiency in a voice assistant?

11

Comments

You must log in or register to comment.

marcus_hk t1_j7lqpav wrote

I haven't been keeping up with TTS since Tacotron 2, but it seems Eleven Labs works fundamentally the same way.

As for real-time performance you may need to port your Python code to C++.

2

gunshoes t1_j7nj5co wrote

Fast speech 2 would be your best bet.

2

nmfisher t1_j7osgdc wrote

FS2 is fine for training a TTS model from scratch, but I haven't come across a good FS2 model for cloning (which is basically zero-shot TTS).

1

gunshoes t1_j7p91py wrote

You can throw GasTs or use a speaker embedding to influence the energy/ pitch outputs. The sound is meh but it works.

1

nmfisher t1_j7pawou wrote

That's why I added the qualifier "good" :)

3

theLanguageSprite t1_j7relsn wrote

You have to pay to use the api and it’s completely closed source but resemble.ai works pretty well

2

johnwireds t1_j7mxns2 wrote

Would also interest myself to clone my voice and have someone speak with my voice in real time?

1