Out-of-copyright books only of course.

Hi, I was wondering if I could fine tune a GPT3 model to take a book, likely in html, markdown, or plain text, and convert it to SSML. In order to do that, I would need a bunch of SSML files already hand made, and fine tune a model based on them. Then I've got some code to split that up and do formatting: pandoc, csplit, and then I could use aws polly or one of the others to do real good text to speech.

Anyone have a dataset?

References:

Comments

You must log in or register to comment.

geneing t1_j3e1573 wrote on January 7, 2023 at 10:16 PM

#1,304,812

I looked for it once years ago, but couldn't find any. I don't think it's needed anymore. Current TTS systems based on neural networks are really good at producing speech with the right intonation from just the text.

Intelligent_Rough_21 OP t1_j3enes5 wrote on January 8, 2023 at 12:51 AM

#1,305,663

Replying to geneing (#1,304,812)

I don’t think they take into account language context like completion models do. They just say words with limited memory. Hopefully research will unify them somehow.

geneing t1_j3exq14 wrote on January 8, 2023 at 2:07 AM

#1,306,112

Replying to Intelligent_Rough_21 (#1,305,663)

Having trained multiple TTS models, I disagree. It's actually quite impressive that prosody is quite accurate. Moreover, even homographs are surprisingly accurate (e.g. word "read" is pronounced with the correct tense if it can be deduced from the sentence)

Intelligent_Rough_21 OP t1_j3frivp wrote on January 8, 2023 at 6:14 AM

#1,307,311

Replying to geneing (#1,306,112)

Ok I’ll admit to only having used neural models not trained them. AWS Polly is incredibly monotoned last I used it.

geneing t1_j3g1gwa wrote on January 8, 2023 at 8:08 AM

#1,307,646

Replying to Intelligent_Rough_21 (#1,307,311)

Most likely you are using the original Polly method, which is based on gluing together sounds of different phonemes. That produces monotone speech.

Try Google wavenet. It's available through google cloud api just like Polly.

There's a neural version of Polly, but I never tried it.

Intelligent_Rough_21 OP t1_j3g2vvp wrote on January 8, 2023 at 8:26 AM

#1,307,702

Replying to geneing (#1,307,646)

Yeah I was using neural poly which is equivalent to wavenet. What I discovered is it will always say the same sentence, and usually the same word used in the same way, the same way, regardless of context clues. “My gosh.” Would always render exactly the same way. Really needs paragraph or dialogue driven context, as well as a bit of randomization. In a book where an author has a repetitive goto word or phrase it’s killer.

geneing t1_j3hzfpy wrote on January 8, 2023 at 6:40 PM

#1,310,376

Replying to Intelligent_Rough_21 (#1,307,702)

I think what you are looking for is called "expressive TTS". There have been a ton of papers in the last couple of years on the topic. Many provide code.

I've had some success with simply preserving the hidden state of the network from one sentence to the next.

SSML may not be expressive enough for your application.

Intelligent_Rough_21 OP t1_j3lkkbq wrote on January 9, 2023 at 12:10 PM

#1,315,229

Replying to geneing (#1,310,376)

Thanks for the reference I’ll look into it