Comments

You must log in or register to comment.

tahansa t1_j63f88w wrote

Incredible stuff.

Gotta get them copyright things solved with those visual NNs before these audio models hit the mainstream.

The progress of these audio models getting me much more stoked than those of the image models.

28

tahansa t1_j63fqca wrote

"Is it a memorization machine or can it create new songs?"

​

From the paper:
"Memorization analysis. Figure 3 reports both exact and
approximate matches when the length of the semantic token
prompt is varied between 0 and 10 seconds. We observe
that the fraction of exact matches always remains very
small (< 0.2%), even when using a 10 second prompt to
generate a continuation of 5 seconds. Figure 3 also includes results for approximate matches, using τ = 0.85.
We can see a higher number of matches detected with this
methodology, also when using only MuLan tokens as input
(prompt length T = 0) and the fraction of matching examples increases as the length of the prompt increases. We
inspect these matches more closely and observe that those
with the lowest matching score correspond to sequences
characterized by a low level of token diversity. Namely, the
average empirical entropy of a sample of 125 semantic tokens is 4.6 bits, while it drops to 1.0 bits when considering
sequences detected as approximate matches with matching
score less than 0.5. We include a sample of approximate
matches obtained with T = 0 in the accompanying material.
Note that acoustic modeling carried out by the second stage
introduces further diversity in the generated samples, also
when the semantic tokens match exactly."

15

TFenrir t1_j63np8c wrote

chatGPT (so take it with many grains of salt)

> The paper is discussing a machine that can create new songs or music. They are testing to see if the machine is able to memorize songs or if it can come up with new ones. They are looking at how well the machine does when given different amounts of information to work with. They found that even when given a lot of information, the machine is not able to create exact copies of songs. However, it can create similar songs. They also found that when the machine is given very little information, the songs it creates are not very diverse. They include examples of the machine's output in the accompanying material.

30

Boring_Party8508 t1_j65ti3z wrote

Does anyone found any access to the code or paper for this MusicLM?

2

MrCheeze t1_j661b7r wrote

This seems like a major increase in quality compared to past attempts. And with long term coherency too, check out those 5 minute tracks.

And if that wasn't up, we even got an additional mode that lets you provide a melody of your own and ask for an arrangement. Should be very useful for composition.

Assuming that these results aren't cherrypicked or otherwise misleading, I'd be very excited to try to make music with an open replication of this.

9

Screye t1_j6688sf wrote

I am done man. How is someone supposed to keep up with this pace of research ?

14

picardythird t1_j66kza2 wrote

Whenever I see music generation models, I immediately go to the "classical" examples (or as close to classical as are provided). The reason for this is that while some genres such as techno, drum 'n' bass, 8-bit, and hip hop are "simple" (from a music theory perspective), and other genres such as ambient, relaxing jazz, swing, and dream pop are vague enough that the model can get by just from spitting out the right general timbre, generating classical music requires understanding of structure, style, and form.

Frankly, I'm not particularly impressed. For the piano snippets, it seems to have mixed in sounds from strings, and both the "professional piano player" and "crazy fast piano player" snippets are basically just random notes with no particular structure. Meanwhile, the "opera" snippet uses piano sounds, which are non-idiomatic to opera. The "string quartet" snippets are not idiomatic to the style of a string quartet (in particular, the "camptown races" snippet completely falls apart at the end, and the "fingerstyle guitar" snippet barely even sounds like string instruments).

I'm also not especially convinced by the Painting Caption Conditioning section. I suspect that there is quite a bit of Barnum Effect going on here; the captions are primed to be accepted as corresponding to the "correct" paintings because they are presented that way, but this is just a framing device. As a self-experiment, play a track from any of the paintings, and look at any of the other paintings. Can you really say that the track could not feasibly correspond to the "other" painting? (Also, as someone who has literally written a piece of music inspired by the Caspar David Friedrich painting, I find myself unconvinced by the model's interpretation... but this is a wholly subjective critique).

This is not to say that the model is not impressive in other ways. Its ability to mimic the styles of different genres is quite good (although the "swing" example in the Long Generation section loses focus halfway through), and the style transfer elements are quite interesting as well. However, music generation models have a long way to go when it comes to idiomatic understanding of the structural elements of music.

22

Maximum-Nectarine-13 t1_j67czqw wrote

Here is a recent and similar text-to-music work, the generated music sounds better to me than musiclm. Check the Waveform model in https://noise2music.github.io/

It doesn't have the full paper yet. Copy its abstract here.

>We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music.
>
>We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.

3

knestleknox t1_j67ezg0 wrote

As someone who works a lot with both music and ML, I'm really excited to see these multi-modal approaches. The image description -> music generation was really cool to see. But it would be incredible to see a (good/large) multi-modal model that can go from audio -> image. Free album artwork and visualizations for all my songs.

3

Mysterious_Tekro t1_j67wmxf wrote

Most machine learning technologies for music are like telling an music theorist to design a circuit board. the results are hilarious. you need a synth architect.

2

ginsunuva t1_j67wxvf wrote

Who’s annotating music with these weird, non-intuitive text descriptions for training?

1

Taenk t1_j68a468 wrote

> Whenever I see music generation models, I immediately go to the "classical" examples (or as close to classical as are provided). The reason for this is that while some genres such as techno, drum 'n' bass, 8-bit, and hip hop are "simple" (from a music theory perspective), and other genres such as ambient, relaxing jazz, swing, and dream pop are vague enough that the model can get by just from spitting out the right general timbre, generating classical music requires understanding of structure, style, and form.

> Frankly, I'm not particularly impressed. […]

> […]

> This is not to say that the model is not impressive in other ways. Its ability to mimic the styles of different genres is quite good (although the "swing" example in the Long Generation section loses focus halfway through), and the style transfer elements are quite interesting as well. However, music generation models have a long way to go when it comes to idiomatic understanding of the structural elements of music.

It feels similar to earlier LLMs: It is, by today's standards, extremely easy to generate a model that generates vaguely correct looking text in the sense that the words have reasonable length and the characters have a reasonable distribution. Only at later stages do the models manage to output vaguely correct words with minor spelling mistakes. At that point the grammar is still complete nonsense, as well as the semantics. Only very recently did LLMs manage to stay coherent over larger blocks of text.

Relatedly, diffusor-based image generation has a similar thing going on: Textures are frighteningly great. Image composition and logic not so much.

I think for music generating models we are at the stage where they get the texture and syllables right, that is the overall sound, but not at the stage where image composition and grammer is quite there, that is chord progression, melody, themes and overall composition.

4

sobo5o t1_j6a48bh wrote

That 808 on the rap song after the death metal song hard af.

1

sobo5o t1_j6a9f95 wrote

>we have no plans to release models at this point

Thank you teasing, Google.

3

starstruckmon t1_j6d3lsr wrote

I can guarantee the next paper out of this Google team is going to be a diffusion model ( instead of AudioLM ) conditioned on MuLan embeddings.

The strength of the Google model is the text understanding which is coming from the MuLan embeddings. While the strength of the work you highlighted is the quality from the diffusion model.

It's the obvious next step following the same path as Dalle1->Dalle2.

1

spb1 t1_j6dmxbm wrote

>Frankly, I'm not particularly impressed. For the piano snippets, it seems to have mixed in sounds from strings, and both the "professional piano player" and "crazy fast piano player" snippets are basically just random notes with no particular structure. Meanwhile, the "opera" snippet uses piano sounds, which are non-idiomatic to opera. The "string quartet" snippets are not idiomatic to the style of a string quartet (in particular, the "camptown races" snippet completely falls apart at the end, and the "fingerstyle guitar" snippet barely even sounds like string instruments).

I think we have to factor in the rate at which AI is improving. Listening to something like this for the first time shouldn't be a way to decide on a definitive opinion on AI music - rather a glimpse at the early stages and see generally what can be done. Consider where this technology will be in 5 years, could easily be a significant game changer for music in various fields.

2