Can someone suggest a machine learning model that will segment audio spectrogram to multiple classes. I have labeled data of heart beats. S1, S2, systole and diastole. How to train a segmentation model ?

Comments

You must log in or register to comment.

Eresbonitaguey t1_j6qic3n wrote on February 1, 2023 at 4:14 AM

#1,694,247

Possibly not the ideal solution but I would suggest taking sections of the spectrogram as images (perhaps with overlap) and feeding that into a multi-label classifier. If you’re after a bounding box then the upper and lower bounds should be apparent based on the location of your classes within the spectrogram i.e. sound intensity occurs at similar frequency. If transfer learning from a general image model I would advise against using false colour to generate the three channels and instead would generate different types of spectrograms (Reassignment method/Multi-tapered/etc.) Due to the nature of spectrograms you don’t really want scale invariance so segmentation models that use feature pyramids can be problematic. I found decent success using Compact Convolutional Transformers but that may not be what you need for your task.

1bir t1_j6s0ito wrote on February 1, 2023 at 2:16 PM

#1,695,278

Possible solution:

train minirocket/hydra, which were designed for time series classification, on the labelled dataset (probably as four one-vs-many problems, eg s1 vs the rest, s2 vs the rest etc)
you'll get sets of 1D convolutional kernels; these can be convolved with time series of any length
only one of these should 'fire' strongly for each different heartbeat phase, so you should get univariate signals for each phase
convolve these kernel sets with your unsegmented data
segment the data based on the strongest signal corresponding to the relevant phase of the heartbeat.

You may need to apply some transformations to the signals to get this to work well though (eg softmax &/ smoothing, or some kind of changepoint detection, which I don't know much about).

jiamengial t1_j6sj3l2 wrote on February 1, 2023 at 4:19 PM

#1,695,930

Using something like a CTC loss might be a good shout - you could basically say you're doing "speech recognition", but instead of recognising (sub)words you're recognising classes

uhules t1_j6sp5wk wrote on February 1, 2023 at 4:57 PM

#1,696,153

Replying to jiamengial (#1,695,930)

CTC is better suited for unaligned sequences, if OP has precise timings for the sound events, plain frame-wise classification should work better.

uhules t1_j6spu3f wrote on February 1, 2023 at 5:01 PM

#1,696,189

What kind of model would work in this case is heavily dependent on data availability and the quality of your annotation. Check these datasets from Papers With Code and see whether any one of those is similar enough to your setting, and pick models or code from their leaderboards.

jiamengial t1_j6t854s wrote on February 1, 2023 at 6:52 PM

#1,696,929

Replying to uhules (#1,696,153)

That's true, was thinking that flat frame-wise predictions could lead to incorrect mid-segment predictions, which might be an annoying model error to get

True-Measurement-358 t1_j6tur8x wrote on February 1, 2023 at 9:11 PM

#1,697,800

Depending on the requirements of your use case, you could also consider using a statistical model for change point detection, like this example: https://centre-borelli.github.io/ruptures-docs/examples/music-segmentation/