tahansa

tahansa t1_j63fqca wrote

"Is it a memorization machine or can it create new songs?"

​

From the paper:
"Memorization analysis. Figure 3 reports both exact and
approximate matches when the length of the semantic token
prompt is varied between 0 and 10 seconds. We observe
that the fraction of exact matches always remains very
small (< 0.2%), even when using a 10 second prompt to
generate a continuation of 5 seconds. Figure 3 also includes results for approximate matches, using τ = 0.85.
We can see a higher number of matches detected with this
methodology, also when using only MuLan tokens as input
(prompt length T = 0) and the fraction of matching examples increases as the length of the prompt increases. We
inspect these matches more closely and observe that those
with the lowest matching score correspond to sequences
characterized by a low level of token diversity. Namely, the
average empirical entropy of a sample of 125 semantic tokens is 4.6 bits, while it drops to 1.0 bits when considering
sequences detected as approximate matches with matching
score less than 0.5. We include a sample of approximate
matches obtained with T = 0 in the accompanying material.
Note that acoustic modeling carried out by the second stage
introduces further diversity in the generated samples, also
when the semantic tokens match exactly."

15

tahansa t1_j63f88w wrote

Incredible stuff.

Gotta get them copyright things solved with those visual NNs before these audio models hit the mainstream.

The progress of these audio models getting me much more stoked than those of the image models.

28