I see a lot of image models (ImageNet, ResNet, etc) that are being used for transfer learning in the audio classification domain. I only see one audio specific model that many people use for audio: YAMNet.

I would think taking a network trained on a specific visual domain and repurposing its classifier head to solve an audio problem using cochleagrams or spectrograms would be inappropriate, given the edges and shapes found in say, a flower, mean nothing when comparing patterns cross spectral visual representation of audio.

I would also thinking taking ResNet and training the entire model (all parameters in the convolutional base AND the classifier head) would simply be starting from a nonsensical point in terms of saved weights, and you may be better off starting from scratch.

Am I missing something about transfer learning here? Or am I spot on in thinking its a bit inappropriate given the domain problems are different?

My project is to compare different cochlear models (filters, such as DNLR, Gammachirp, Gammatone, etc) in Brian2Hears (python library) as inputs to a CNN. I need to identify a good model or set of model architectures that I can use as my baseline to compare performances. YAMNet unfortunately takes the raw audio as an input, and converts it to spectrogram as part of the model training loop (I think), so it would not be usable in its final format for my experiment.

Comments

You must log in or register to comment.

asdfzzz2 t1_ixjh3ph wrote on November 23, 2022 at 10:10 PM

> Am I missing something about transfer learning here?

Theoretical answer: Both images and spectrograms have continuous curved lines as a signal, and therefore some transfer should happen.

Practical answer: If it works, it works.

Oceanboi OP t1_ixkfcoh wrote on November 24, 2022 at 2:38 AM

So all we really know is that if a model has been trained on some previous task, there’s some arbitrary probability that it can be used for another problem, regardless of image contents or problem domain?

Think_Olive_1000 t1_ixmm3a2 wrote on November 24, 2022 at 4:28 PM

Yes, but how well it works will be limited by whether you can find exploit a similarity between the tasks.

Tangentially related: when openai were training their speech recognition model 'whisper' they found that when they trained the model to perform translation it also inexplicably increased the models performance in plain english transcription.

Ok_Construction470 t1_ixlephi wrote on November 24, 2022 at 8:55 AM

Note that spectrograms are NOT images though; the elements values can be negative but for images it can’t

Having said that, I work in the audio domain and have applied a computer vision transformer, the Shifted window Swin one, to the domain of audio, in particular the spectrograms extracted from the raw waveform

This was the OG paper https://arxiv.org/abs/2202.00874

They used the pretrained model too

hadaev t1_ixltjo0 wrote on November 24, 2022 at 12:24 PM

Just rescale it to -1, 1 like people do for image

GeneralBh t1_ixkahfh wrote on November 24, 2022 at 1:58 AM

Could you please point to some papers for "I see a lot of image models (ImageNet, ResNet, etc) that are being used for transfer learning in the audio classification domain"?

I thought YAMNet was trained on Audioset from scratch. Could you please point to the paper which uses a pretrained image model to train the YAMNet?

Oceanboi OP t1_ixkc2dn wrote on November 24, 2022 at 2:11 AM

It is trained on AudioSet. I listed YAMNet to highlight the lack of large audio models when compared to image models. And highlight the problem that it limits your data input due to its architecture.

Also, I mainly see transfer learning for CNN in kaggle notebooks, and could find a few papers where an image net is used as one of the models being used.

https://arxiv.org/pdf/2007.07966

https://research.google/pubs/pub45611/

These are just a few but it seems decently common.

GeneralBh t1_ixkimkd wrote on November 24, 2022 at 3:05 AM

thank you! There are few works on huge audio models e.g. https://arxiv.org/pdf/2109.13226.pdf that might be interesting.

hadaev t1_ixlti0s wrote on November 24, 2022 at 12:24 PM

> and you may be better off starting from scratch.

Basically you compare random wights and good trained weights. Why latter should be worst?

Oceanboi OP t1_ixlzu9h wrote on November 24, 2022 at 1:31 PM

Maybe not always, but couldn't you argue that good trained weights for one task may not carry over well to another?

asdfzzz2 t1_ixm6knh wrote on November 24, 2022 at 2:30 PM

Is there any reason for them to be worse than random weights? Because if not, then you have no reason not to use pretrained models.

hadaev t1_ixmajnt wrote on November 24, 2022 at 3:02 PM

To add to it, non random weight might be worse for tiny/simple models.

But modern vision models should be fine with it.

Like bert text weights is a good starting point for images classification.

NLP_doofus t1_ixmjp7q wrote on November 24, 2022 at 4:10 PM

Dunno much about audio event classification specifically, but why not use some architecture for speech like APC, MockingJay, Wave2Vec, HuBERT, etc.? These would also convert raw waveform to mel-spectrograms. But I don't see why preprocessing raw waveform is a big deal, as most tasks seem to benefit from it. I'd be surprised that if large image nets perform better than any of these models, but I'd love to hear about it if they do. I'd imagine at a minimum you'd need to finetune to your data regardless.

Oceanboi OP t1_ixpeeaf wrote on November 25, 2022 at 6:36 AM

The problem is that I am testing different data representations of audio, so the pre processing is what I want to experiment with.

NLP_doofus t1_ixpms9z wrote on November 25, 2022 at 8:31 AM

Ah I missed that point, sorry. So you want to start from scratch for all models? Otherwise it seems like there would be confounding variables that you're testing (e.g., pretrained data set size). I've worked with some of the models I mentioned and I think if you're just changing input/output shapes it shouldn't matter when starting from scratch. Unless there is something fundamental about losing the time axis in speech from your data representations, because these models are autoregressive or masked modeling approaches for representation learning.

Oceanboi OP t1_ixtiolt wrote on November 26, 2022 at 5:44 AM

Not from scratch for all, I simply want to take a base model or a set of base model architectures and compare how different audio representations (cochleagram, and other cochlear models) perform in terms of accuracy/model performance. That’s what got me to look into transfer learning and hence the question! I need some constant set of models to use for my comparisons.