Submitted by benanne t3_107g3yf in MachineLearning
benanne OP t1_j3r3stl wrote
Reply to comment by 5death2moderation in [R] Diffusion language models by benanne
DiffWave and WaveGrad are two nice TTS examples (see e.g. here https://andrew.gibiansky.com/diffwave-and-wavegrad-overview/), Riffusion (https://www.riffusion.com/) is also a fun example. Advances in audio generation always tend to lag behind the visual domain a bit, because it's just inherently more unwieldy to work with (listening to 100 samples one by one takes a lot more time and patience than glancing at a 10x10 grid of images), but I'm pretty sure the takeover is also happening there.
If you're talking about text-to-audio in the vein of current text-to-image models, I'm pretty sure that's in the pipeline :)
Viewing a single comment thread. View all comments