Viewing a single comment thread. View all comments

5death2moderation t1_j3qzs9j wrote

>as it has in perceptual domains, like audio

citation needed

1

benanne OP t1_j3r3stl wrote

DiffWave and WaveGrad are two nice TTS examples (see e.g. here https://andrew.gibiansky.com/diffwave-and-wavegrad-overview/), Riffusion (https://www.riffusion.com/) is also a fun example. Advances in audio generation always tend to lag behind the visual domain a bit, because it's just inherently more unwieldy to work with (listening to 100 samples one by one takes a lot more time and patience than glancing at a 10x10 grid of images), but I'm pretty sure the takeover is also happening there.

If you're talking about text-to-audio in the vein of current text-to-image models, I'm pretty sure that's in the pipeline :)

3