Viewing a single comment thread. View all comments

Oceanboi t1_j8l8tst wrote

I’m guessing your company won’t have the resources or data to train a CNN to convergence from scratch, so read up on some common CNNs that people use for audio transfer learning (EfficientNet has worked well for me, as did ResNet50, albeit less so). Once you can implement one pre trained model, you can implement most of them fairly easily to see which one suits your task best. Also read up on Sharan et al 2019 and 2021 as he benchmarks numerous image representations, model architectures, and network fusion techniques. While results may very, empirically it is a great starting point although I was not able to achieve his results given his model architecture. Pay less attention to the actual architecture he talks about because you’ll most likely be doing transfer learning where you’ll be importing a model and it’s weights. For preprocessing look into either MATLAB for their Auditory Modeling toolbox and if you’re using python look into librosa, torchaudio, and brian2hears for more complex filterbank models.

1