Viewing a single comment thread. View all comments

Taenk t1_j68a468 wrote

> Whenever I see music generation models, I immediately go to the "classical" examples (or as close to classical as are provided). The reason for this is that while some genres such as techno, drum 'n' bass, 8-bit, and hip hop are "simple" (from a music theory perspective), and other genres such as ambient, relaxing jazz, swing, and dream pop are vague enough that the model can get by just from spitting out the right general timbre, generating classical music requires understanding of structure, style, and form.

> Frankly, I'm not particularly impressed. […]

> […]

> This is not to say that the model is not impressive in other ways. Its ability to mimic the styles of different genres is quite good (although the "swing" example in the Long Generation section loses focus halfway through), and the style transfer elements are quite interesting as well. However, music generation models have a long way to go when it comes to idiomatic understanding of the structural elements of music.

It feels similar to earlier LLMs: It is, by today's standards, extremely easy to generate a model that generates vaguely correct looking text in the sense that the words have reasonable length and the characters have a reasonable distribution. Only at later stages do the models manage to output vaguely correct words with minor spelling mistakes. At that point the grammar is still complete nonsense, as well as the semantics. Only very recently did LLMs manage to stay coherent over larger blocks of text.

Relatedly, diffusor-based image generation has a similar thing going on: Textures are frighteningly great. Image composition and logic not so much.

I think for music generating models we are at the stage where they get the texture and syllables right, that is the overall sound, but not at the stage where image composition and grammer is quite there, that is chord progression, melody, themes and overall composition.

4