benanne OP t1_j3r3stl wrote on January 10, 2023 at 2:34 PM

Reply to comment by 5death2moderation in [R] Diffusion language models by benanne

DiffWave and WaveGrad are two nice TTS examples (see e.g. here https://andrew.gibiansky.com/diffwave-and-wavegrad-overview/), Riffusion (https://www.riffusion.com/) is also a fun example. Advances in audio generation always tend to lag behind the visual domain a bit, because it's just inherently more unwieldy to work with (listening to 100 samples one by one takes a lot more time and patience than glancing at a 10x10 grid of images), but I'm pretty sure the takeover is also happening there.

If you're talking about text-to-audio in the vein of current text-to-image models, I'm pretty sure that's in the pipeline :)