Eresbonitaguey t1_j6qic3n wrote on February 1, 2023 at 4:14 AM

Possibly not the ideal solution but I would suggest taking sections of the spectrogram as images (perhaps with overlap) and feeding that into a multi-label classifier. If you’re after a bounding box then the upper and lower bounds should be apparent based on the location of your classes within the spectrogram i.e. sound intensity occurs at similar frequency. If transfer learning from a general image model I would advise against using false colour to generate the three channels and instead would generate different types of spectrograms (Reassignment method/Multi-tapered/etc.) Due to the nature of spectrograms you don’t really want scale invariance so segmentation models that use feature pyramids can be problematic. I found decent success using Compact Convolutional Transformers but that may not be what you need for your task.