shadow_fax1024 t1_j1oc69w wrote on December 26, 2022 at 1:50 AM

#1,064,889

1 has worked for me in the past ..you may need to generate more samples using various techniques like add a small noise in audio then taking its spectrogram,cut - mix a random portion of spectrograms, etc..

Helveticus99 OP t1_j1pu570 wrote on December 26, 2022 at 12:48 PM

#1,071,204

Replying to shadow_fax1024 (#1,064,889)

Thank you u/shadow_fax1024. Did you use a RNN or a plain CNN? Did you also had that long audio files (40min - 60min)? I'm not sure about how such long audio files can be used in a RNN.

shadow_fax1024 t1_j1pxmnn wrote on December 26, 2022 at 1:29 PM

#1,071,698

I used plain cnn with and without attention ..I had to handle long audio files in training as well as inference

sigmoid_amidst_relus t1_j1qbo3x wrote on December 26, 2022 at 3:35 PM

#1,073,620

MFCCs are more prone to noise than melspectrograms.

They're better for classical methods (primarily due to their dimensionality in my opinion) but most recent papers don't use MFCCs: they either go raw waveform or melspectrograms.

You have a 400-600 hour dataset, and utterance level labels. For the task at hand option 2 is the best, and there's a lot of variations of it you can try.

You can experiment with a frame-wise setup: repeat the utterance level label for every frame computed from the utterance, train on a frame and it's corresponding label at a time. Or take a sequence of frames and their label. Or take crops of different sizes from each utterance. There's a lot of options. At test time, just aggregate labels for a single utterance.

Use whatever DNN you want to use, I suggest starting simple with CNNs.

Either way, you'll have to experiment and see how this affects performance.

It's possible that an utterance level "Mental state" classifier might end up leveraging semantic information because your dataset doesn't have enough speakers. It's even easier to do that if your dataset is an "acted" dataset. You'll end up doing well on the benchmark but your model is not learning jackshit. Therein lies a big problem with the speech emotion classification domain. If the model doesn't do this, it will be aggregating actual emotional state information over time anyway, so why not be more efficient.

Mental state is not reflected just in the entire utterance, if a person is say, sad, it's reflected throughout their voice on a sub-speech segment level.

Edit

Also, apart from all this jazz, there's also the option of using features from large pretrained models and training downstream classifiers on them.

Also, finally, read papers (esp if you're in the industry). I've basically regurgitated current state of the field. All this information is there. I'm not being passive aggressive or cheeky here; if you're in the industry, I know where you're coming from, but you'll have to roll your sleeves up and read papers. If you're in research, again, I know where you're coming from, but it's the way you gotta do

Helveticus99 OP t1_j1qjdxl wrote on December 26, 2022 at 4:34 PM

#1,074,654

Replying to shadow_fax1024 (#1,071,698)

Thank you u/shadow_fax1024. How did you handle audio files with different length? And how did you handle the long audio files exactly? I think creating a Mel-Spectrograms over long audio files won't work.

Helveticus99 OP t1_j1qmmtg wrote on December 26, 2022 at 4:58 PM

#1,075,083

Replying to sigmoid_amidst_relus (#1,073,620)

Thank you so much for your input u/sigmoid_amidst_relus. I will consider Mel-Spectrograms instead of MFCCs. Do you know what the maximum size of a Mel-Spectrogram is in terms of seconds it covers?

With mental state I'm not referring to emotions that change fast but to more a long-term state that is reflected in the whole 1 hour recording. Thus, I think repeating the label for every frame might not work well. I might have to extract features over the full recording. That's also why I think an autoencoder can be problematic.

I could divide the recording into frames and stack the Mel-Spectrograms of the frames (using a 3D CNN). The problem is that I will end up with a huge number of frames. Same problem when considering a RNN, I will end up with a huge time series.

Using features from a large pretrained model is interesting. Can you recommend a pretrained model that is suitable for feature extraction from long recordings?

shadow_fax1024 t1_j1scqqd wrote on December 27, 2022 at 12:43 AM

#1,083,273

Replying to Helveticus99 (#1,074,654)

You could split the file into chunk of n seconds ..n seconds you need to find ..which ever fits for your dataset..for mine 4 sec chunk was good enough...also you could use a peak detector first and then chunk the file n/2 seconds either side from the peak and have some overlapping window there too..so that you won't loose information ..