Viewing a single comment thread. View all comments

picardythird t1_j66kza2 wrote

Whenever I see music generation models, I immediately go to the "classical" examples (or as close to classical as are provided). The reason for this is that while some genres such as techno, drum 'n' bass, 8-bit, and hip hop are "simple" (from a music theory perspective), and other genres such as ambient, relaxing jazz, swing, and dream pop are vague enough that the model can get by just from spitting out the right general timbre, generating classical music requires understanding of structure, style, and form.

Frankly, I'm not particularly impressed. For the piano snippets, it seems to have mixed in sounds from strings, and both the "professional piano player" and "crazy fast piano player" snippets are basically just random notes with no particular structure. Meanwhile, the "opera" snippet uses piano sounds, which are non-idiomatic to opera. The "string quartet" snippets are not idiomatic to the style of a string quartet (in particular, the "camptown races" snippet completely falls apart at the end, and the "fingerstyle guitar" snippet barely even sounds like string instruments).

I'm also not especially convinced by the Painting Caption Conditioning section. I suspect that there is quite a bit of Barnum Effect going on here; the captions are primed to be accepted as corresponding to the "correct" paintings because they are presented that way, but this is just a framing device. As a self-experiment, play a track from any of the paintings, and look at any of the other paintings. Can you really say that the track could not feasibly correspond to the "other" painting? (Also, as someone who has literally written a piece of music inspired by the Caspar David Friedrich painting, I find myself unconvinced by the model's interpretation... but this is a wholly subjective critique).

This is not to say that the model is not impressive in other ways. Its ability to mimic the styles of different genres is quite good (although the "swing" example in the Long Generation section loses focus halfway through), and the style transfer elements are quite interesting as well. However, music generation models have a long way to go when it comes to idiomatic understanding of the structural elements of music.

22

Taenk t1_j68a468 wrote

> Whenever I see music generation models, I immediately go to the "classical" examples (or as close to classical as are provided). The reason for this is that while some genres such as techno, drum 'n' bass, 8-bit, and hip hop are "simple" (from a music theory perspective), and other genres such as ambient, relaxing jazz, swing, and dream pop are vague enough that the model can get by just from spitting out the right general timbre, generating classical music requires understanding of structure, style, and form.

> Frankly, I'm not particularly impressed. […]

> […]

> This is not to say that the model is not impressive in other ways. Its ability to mimic the styles of different genres is quite good (although the "swing" example in the Long Generation section loses focus halfway through), and the style transfer elements are quite interesting as well. However, music generation models have a long way to go when it comes to idiomatic understanding of the structural elements of music.

It feels similar to earlier LLMs: It is, by today's standards, extremely easy to generate a model that generates vaguely correct looking text in the sense that the words have reasonable length and the characters have a reasonable distribution. Only at later stages do the models manage to output vaguely correct words with minor spelling mistakes. At that point the grammar is still complete nonsense, as well as the semantics. Only very recently did LLMs manage to stay coherent over larger blocks of text.

Relatedly, diffusor-based image generation has a similar thing going on: Textures are frighteningly great. Image composition and logic not so much.

I think for music generating models we are at the stage where they get the texture and syllables right, that is the overall sound, but not at the stage where image composition and grammer is quite there, that is chord progression, melody, themes and overall composition.

4

spb1 t1_j6dmxbm wrote

>Frankly, I'm not particularly impressed. For the piano snippets, it seems to have mixed in sounds from strings, and both the "professional piano player" and "crazy fast piano player" snippets are basically just random notes with no particular structure. Meanwhile, the "opera" snippet uses piano sounds, which are non-idiomatic to opera. The "string quartet" snippets are not idiomatic to the style of a string quartet (in particular, the "camptown races" snippet completely falls apart at the end, and the "fingerstyle guitar" snippet barely even sounds like string instruments).

I think we have to factor in the rate at which AI is improving. Listening to something like this for the first time shouldn't be a way to decide on a definitive opinion on AI music - rather a glimpse at the early stages and see generally what can be done. Consider where this technology will be in 5 years, could easily be a significant game changer for music in various fields.

2