Submitted by valdanylchuk t3_xy3zfe in MachineLearning

LM-based; in contrast to other recent audio generation experiments which worked from transcribed text or midi notes, AudioLM works directly based on the audio signal, resulting in outstanding consistency and high fidelity sound.

Google blog post from yesterday: https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html

Demo clip on Youtube: https://www.youtube.com/watch?v=_xkZwJ0H9IU

Paper: https://arxiv.org/abs/2209.03143

Abstract:

>We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

100

Comments

You must log in or register to comment.

E_Snap t1_irg5x3z wrote

That is incredible. And to think I was watching musicians joke about graphic artists’ job security just a few days ago.

29

progressgang t1_irggpsb wrote

It’s not the achievements themselves that astound me with ML, it’s the rate at which they happen. The cycle between “prediction by expert in field” and “we created something to fulfil that prediction” gets shorter and shorter at a crazy rate. The craziest part is, the vast majority are entirely unaware of this happening. There is, without doubt, a massive opportunity to capitalise on that information gap.

25

aidv t1_irgma87 wrote

This year alone feels like an explosion of ML

13

rebleed t1_irgox34 wrote

And it is only going to get faster.

8

yaosio t1_irhc6mf wrote

When somebody makes something like Copilot that can write code by itself that's going to be really cool.

5

PC-Bjorn t1_isnhsqd wrote

All your dream apps can become reality!

1

Flag_Red t1_irfrost wrote

This really does pass the audio-continuation-turing test.

20

hiptobecubic t1_irh16x7 wrote

I didn't even realize this was a demo until it got to the piano and i noticed the "generated" text. This thing is crazy

1

jazmaan t1_irgo3n5 wrote

Funny thing is when I first got into AI Art and ML it was through a question I asked on Reddit almost two years ago. And its still my dream.

"Would it be possible to train an AI on high quality recordings of Jimi Hendrix live in concert, and then have the AI listen to a crappy audience bootleg and make it sound like a high quality recording?"

AI Art was still in its infancy back then but the people who offered their opinions on my question were the same ones on the cutting edge of VqGAN+Clip. It still looks like the answer to my question is "Someday but probably not within the next five years". But hope springs eternal! Someday that crappy recording of Jimi in Phoenix (one of the best sets he ever played) may be transformed into something that sounds as good as Jimi at Woodstock!

13

PC-Bjorn t1_isnicrn wrote

Soon, we might be upscaling beyond higher bitrate, -depth and fidelity and into multi channel reproductions, or maybe even into individual streams for each instrument and actor on stage as well as a volumetric model for the stage layout itself, allowing us to render the experience as how it would be when experienced from any coordinate on - or around - the stage.

Pair that with a realtime, hardware-accelerated reproduction of the visual experience of being there, based on a network trained on photos from the concert and we'll all be able to go to Woodstock in 1969.

2

BackgroundFeeling707 t1_irgnvop wrote

When can we play with such a thing?

4

yaosio t1_irhc9zo wrote

We will have to wait for somebody else to do an open source version.

3

valdanylchuk OP t1_iri3civ wrote

…and prepare a suitable dataset, and train the model. Those are huge parts of the effort.

With big companies teasing stuff like this (AlphaZero, GPT-3, DALL-E, etc.) all the time, I wonder if it is possible for the open community to come up with some modern day equivalent of GNU/GPL with a non-profit GPU time donation fund to make practical open source replicas of important projects.

3