Viewing a single comment thread. View all comments

HGFlyGirl t1_j2aen6d wrote

I trained a model to find duplicate music files in my brother's huge collection of digital music. He was frustrated by so many duplicates that still had different file names, file sizes and tags. We couldn't find any existing software that could do it - because they were all just looking for matches on those parameters. The model ended up working quite well.

5

iantimmis t1_j2bb6x0 wrote

How did you set it up?

1

HGFlyGirl t1_j2cpl6v wrote

For pairs of files, I took their filename length, calculated the Levenshtein distance between them, their size in bytes and their duration in Ticks.

I used the ML.NET AutoML API to train a binary classifier.

1

Apprehensive_Maize_4 t1_j2du7wc wrote

>duplicates

fdupes or any of the programs here didn't work for you?

https://www.tecmint.com/find-and-delete-duplicate-files-in-linux/

1

HGFlyGirl t1_j2ewwry wrote

Tried a few of these things. The problem was that a lot of the songs had been ripped from CD's using different software. So, some would be called things like track01.mp3 with a duplicate with a completely different file name. These could also have different byte lengths and durations. Then there are the ones that come from the original recording, the live version and/or the compilation album - which often differ a bit in all the parameters.

1