Submitted by happyhammy t3_zm51z0 in MachineLearning

My theory:

  • no good datasets, as opposed to image datasets like LAION
  • harder/illegal to get music datasets. Shady methods are usually required to get large music datasets (like torrenting). The only music datasets I've found are classical, and even then, very limited as performances of classical music are still copyrighted.

Therefore, large companies like OpenAI/Google are unable to take the risk in making a good generative music AI due to legal reasons. Startups have a better chance because they have less to lose and can better hide the fact that they trained their model with copyrighted material.

Other than that, I don't believe audio is more challenging to process than images because the complete audio file can be reduced to its spectrogram, which is just a 2D image.

TLDR: No good datasets

4

Comments

You must log in or register to comment.

abriec t1_j0b6frn wrote

What is “good” music?

Certainly not the full picture, but imo one of the reasons we don’t see music generation taking off the same way as image/text is it’s more difficult to evaluate and therefore benchmark and iterate.

It faces similar challenges as generative modelling in other modalities, but is arguably more subjective, time-consuming, and require more training if using human evaluation. A layperson can easily tell if an image or text is “good”, it’s more complex for music once it gets above a certain minimum quality threshold.

From a business perspective it’s harder to sell too given the scope of applications (relative to language and vision), as interesting as the problem sounds to us.

Plus, echoing the other comment, I feel it’s reductionistic to flatten music into spectrograms when there are interlaying elements. My intuition is it’ll be better to model dependencies between individual “tracks” as well. I’m sure there’s extensive work on music generation with good results already, just not quite in the spotlight yet.

3

Osemwaro t1_j0c0fp2 wrote

If by "a layperson can easily tell if an image or text is “good”", you mean a layperson can easily tell if the image depicts a physically plausible or photo-realistic scene, or if the text makes sense, then I agree that music is harder in this sense. The closest musical analogy for these quality issues is perhaps telling whether or not the instruments sound realistic, and laypeople don't spend enough time focusing on the sound of real instruments to be really good at this.

But if you're talking about judging artistic merit, then I don't think a layperson is any better at doing this with images and text than they are with music. Artistic judgement is extremely subjective across all fields of artistic expression, and experts in these fields often disagree with each other, or with the general public, about what's good and what isn't. E.g. compare the popularity of Fifty Shades of Grey to its critical reception.

There's a massive commercial demand for music in TV, films, advertising, games, theatre and online video creation too, so I don't think it would be that hard to make a business case for it, if the data was readily available.

1

fimari t1_j0ci3h6 wrote

I disagree - there is probably more usable music than usable text - but music has a Time dimension, that's a hard problem see moving images and music is really sensitive to errors a wrong tone destroys everything while our visual system is quite forgiving. We like AI pictures even if they are totally incongruous but we probably ditch music if there is a screeching sound added. Music is quite mathematical in nature ML can take a shot after it mastes the five finger problem ;)

1

Ronny_Jotten t1_j0cizez wrote

Your theories are somewhat naive. Large companies like Google have no problem getting access to all the music they want. And nobody tries to "hide the fact that they trained their model with copyrighted material". The current state of AI training seems to be that copyright is irrelevant, and it's fair use - though we'll see whether that holds up in court. Nearly everything in LAION is copyrighted images scraped from the web, and they are used without permission for training. Furthermore, anyone can use the Million Song Dataset, and get access to the actual tracks through an API.

Million-song dataset: take it, it’s free | Ars Technica

On the other hand, the idea of turning audio into a 2D spectrogram image, and using the same tools as image-generating AIs, is also naive. Music generation requires a very different approach. There are a multitude of AI music-generation projects, some using GANs. So far, the results have not been as astonishing as the image generators. But that's only a matter of degree, and probably a matter of time.

4

evanthebouncy t1_j0d1op6 wrote

I work a lot with human AI communication, here's my take.

The issue is our judgement (think value function) on what's good. It's less to do with what the AI can actually do, but more with how it is being judged by people.

Random blotches of colors shapes in an interesting way on a canvas is modern art. It's non intrusive and fun to look at. A painting with less than perfect details such as having goblin hands with 6 fingers (as they often do in AI generated arts) isnt a big deal, as long as the overall painting is cool looking.

A music phrase with 1 wrong note, one missed tempo, one sound out of the groove would sound absolutely garbage. We expect music to uphold this high quality all the way through, all 5 minutes. No 'mistakes' are allowed. So any details the AI gets 'wrong' will be particularly jarring. You can mitigate some of the low level errors by forcing AI to produce music within a DSL such as MIDI, but the overall issue of cohesion will be there.

Overall, generative AI lacks control or finesse over the details, lacks logical cohesion. These aren't problems for paintings as much as music.

1

Single_Instruction45 t1_j0dbqyb wrote

I've been researching how to make good music with AI and the following points come up constantly.

Firstly, music generation is very different than image generation. When generating an image you have one idea or concept to generate (I'm simplifying, but bear with me) while generating music you have many ideas generated at the same time (melody, rhythm tracks, harmony) all progressing in time.

Secondly, music is mostly a symbolic language that translates to sound. When trying to capture an audio file most of this symbolic data is hard to retrieve in its original form. There are good algorithms to translate this stuff to midi, but we are far from perfect on that side.

When comparing music generation to text generation, the context of words in a sentence only has to compare the elements that are before and after. For music, this is much more complex as we have to consider polyphony, other instruments as well as harmony. This is a much harder problem to tackle.

Finally, as many others have indicated measuring how good music is a very subjective task.

Taking that all into account, I feel that the research for music generation in the AI world is lacking, but I feel that tackling this problem is a very hard one and might produce AI architectures even better than what we have now. That's why I believe a lot more research should be done on this subject.

1

Ronny_Jotten t1_j0hgi63 wrote

It depends what you mean by "AI", but there are already generative music systems that produce far better music than that.

Spectral analysis/resynthesis is certainly important. There have long been tools like MetaSynth that let you do image processing of spectrograms. It's interesting that the "riffusion" project works at all, and it's a valuable piece of research. I can imagine the technique being useful for musicians as a way to generate novel sounds to be incorporated in larger compositions.

But it's difficult to see how it can be used successfully on entire, already-mixed-down pieces, to generate a complete piece of music in that way. Although it can produce some interesting and strange loops, it's hard to call the output that riffusion produces "music" in the sense of an overall composition, and I'm skeptical that this basic technique can be tweaked to do so. I could be wrong, but I still think it's a naive approach, and any actually listenable music-generation system will be based on rather different principles.

3

Eriane t1_j19q775 wrote

Google has something called Perceiver AR but I don't know much about it. It's hard to find any information about it honestly other than the github and I have no idea how to get it installed on a windows machine.

I have seen some really neat commercial applications but have no idea if they actually work or not. Here we have a video of TwoSetViolin listening to some amazing classical songs made by A.I. https://www.youtube.com/watch?v=R69JYEfCSeI What A.I. generated this? I have no idea.

Andrew Southworth talks about various apps you can use to help with music and voices https://www.youtube.com/watch?v=A656GMQt5d0

Music isn't complicated, it's all the same, that's why everything can be written on music sheet and played the same anywhere by anyone who can read and play. Pop songs all sound like pop songs, country songs all sound like country. They all use the same beats, instruments and so on to make the songs. In a sense I can imagine music composition to be far, far easier than drawings but at the same time the complexities of machine learning could make it harder.

I don't know exactly why music generation is lagging behind but perhaps it's like a language and chatGPT-3's model is something like 800gb+ in size and it's not quite there yet. Maybe music generation is hard like generating fingers in an image lol

1