jiamengial t1_j6sj3l2 wrote on February 1, 2023 at 4:19 PM

Using something like a CTC loss might be a good shout - you could basically say you're doing "speech recognition", but instead of recognising (sub)words you're recognising classes

uhules t1_j6sp5wk wrote on February 1, 2023 at 4:57 PM

CTC is better suited for unaligned sequences, if OP has precise timings for the sound events, plain frame-wise classification should work better.

jiamengial t1_j6t854s wrote on February 1, 2023 at 6:52 PM

That's true, was thinking that flat frame-wise predictions could lead to incorrect mid-segment predictions, which might be an annoying model error to get