LM-based; in contrast to other recent audio generation experiments which worked from transcribed text or midi notes, AudioLM works directly based on the audio signal, resulting in outstanding consistency and high fidelity sound.

Google blog post from yesterday: https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html

Demo clip on Youtube: https://www.youtube.com/watch?v=_xkZwJ0H9IU

Abstract:

>We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

Comments

E_Snap t1_irg5x3z wrote on October 7, 2022 at 9:11 PM

That is incredible. And to think I was watching musicians joke about graphic artists’ job security just a few days ago.

[deleted] t1_irmr3gv wrote on October 9, 2022 at 1:28 PM

[deleted]

progressgang t1_irggpsb wrote on October 7, 2022 at 10:41 PM

It’s not the achievements themselves that astound me with ML, it’s the rate at which they happen. The cycle between “prediction by expert in field” and “we created something to fulfil that prediction” gets shorter and shorter at a crazy rate. The craziest part is, the vast majority are entirely unaware of this happening. There is, without doubt, a massive opportunity to capitalise on that information gap.

aidv t1_irgma87 wrote on October 7, 2022 at 11:31 PM

This year alone feels like an explosion of ML

rebleed t1_irgox34 wrote on October 7, 2022 at 11:56 PM

And it is only going to get faster.

yaosio t1_irhc6mf wrote on October 8, 2022 at 3:40 AM

When somebody makes something like Copilot that can write code by itself that's going to be really cool.

PC-Bjorn t1_isnhsqd wrote on October 17, 2022 at 8:46 AM

All your dream apps can become reality!

Flag_Red t1_irfrost wrote on October 7, 2022 at 7:20 PM

This really does pass the audio-continuation-turing test.

hiptobecubic t1_irh16x7 wrote on October 8, 2022 at 1:50 AM

I didn't even realize this was a demo until it got to the piano and i noticed the "generated" text. This thing is crazy

jazmaan t1_irgo3n5 wrote on October 7, 2022 at 11:48 PM

Funny thing is when I first got into AI Art and ML it was through a question I asked on Reddit almost two years ago. And its still my dream.

"Would it be possible to train an AI on high quality recordings of Jimi Hendrix live in concert, and then have the AI listen to a crappy audience bootleg and make it sound like a high quality recording?"

AI Art was still in its infancy back then but the people who offered their opinions on my question were the same ones on the cutting edge of VqGAN+Clip. It still looks like the answer to my question is "Someday but probably not within the next five years". But hope springs eternal! Someday that crappy recording of Jimi in Phoenix (one of the best sets he ever played) may be transformed into something that sounds as good as Jimi at Woodstock!

PC-Bjorn t1_isnicrn wrote on October 17, 2022 at 8:54 AM

Soon, we might be upscaling beyond higher bitrate, -depth and fidelity and into multi channel reproductions, or maybe even into individual streams for each instrument and actor on stage as well as a volumetric model for the stage layout itself, allowing us to render the experience as how it would be when experienced from any coordinate on - or around - the stage.

Pair that with a realtime, hardware-accelerated reproduction of the visual experience of being there, based on a network trained on photos from the concert and we'll all be able to go to Woodstock in 1969.

BackgroundFeeling707 t1_irgnvop wrote on October 7, 2022 at 11:46 PM

When can we play with such a thing?

yaosio t1_irhc9zo wrote on October 8, 2022 at 3:41 AM

We will have to wait for somebody else to do an open source version.

valdanylchuk OP t1_iri3civ wrote on October 8, 2022 at 10:19 AM

…and prepare a suitable dataset, and train the model. Those are huge parts of the effort.

With big companies teasing stuff like this (AlphaZero, GPT-3, DALL-E, etc.) all the time, I wonder if it is possible for the open community to come up with some modern day equivalent of GNU/GPL with a non-profit GPU time donation fund to make practical open source replicas of important projects.

[deleted] t1_irfjbl4 wrote on October 7, 2022 at 6:16 PM

[deleted]

[deleted] t1_irfwai3 wrote on October 7, 2022 at 7:56 PM

[removed]