NLP_doofus

NLP_doofus t1_ixpms9z wrote

Ah I missed that point, sorry. So you want to start from scratch for all models? Otherwise it seems like there would be confounding variables that you're testing (e.g., pretrained data set size). I've worked with some of the models I mentioned and I think if you're just changing input/output shapes it shouldn't matter when starting from scratch. Unless there is something fundamental about losing the time axis in speech from your data representations, because these models are autoregressive or masked modeling approaches for representation learning.

1

NLP_doofus t1_ixmjp7q wrote

Dunno much about audio event classification specifically, but why not use some architecture for speech like APC, MockingJay, Wave2Vec, HuBERT, etc.? These would also convert raw waveform to mel-spectrograms. But I don't see why preprocessing raw waveform is a big deal, as most tasks seem to benefit from it. I'd be surprised that if large image nets perform better than any of these models, but I'd love to hear about it if they do. I'd imagine at a minimum you'd need to finetune to your data regardless.

1