sigmoid_amidst_relus t1_j1qbo3x wrote on December 26, 2022 at 3:35 PM

MFCCs are more prone to noise than melspectrograms.

They're better for classical methods (primarily due to their dimensionality in my opinion) but most recent papers don't use MFCCs: they either go raw waveform or melspectrograms.

You have a 400-600 hour dataset, and utterance level labels. For the task at hand option 2 is the best, and there's a lot of variations of it you can try.

You can experiment with a frame-wise setup: repeat the utterance level label for every frame computed from the utterance, train on a frame and it's corresponding label at a time. Or take a sequence of frames and their label. Or take crops of different sizes from each utterance. There's a lot of options. At test time, just aggregate labels for a single utterance.

Use whatever DNN you want to use, I suggest starting simple with CNNs.

Either way, you'll have to experiment and see how this affects performance.

It's possible that an utterance level "Mental state" classifier might end up leveraging semantic information because your dataset doesn't have enough speakers. It's even easier to do that if your dataset is an "acted" dataset. You'll end up doing well on the benchmark but your model is not learning jackshit. Therein lies a big problem with the speech emotion classification domain. If the model doesn't do this, it will be aggregating actual emotional state information over time anyway, so why not be more efficient.

Mental state is not reflected just in the entire utterance, if a person is say, sad, it's reflected throughout their voice on a sub-speech segment level.

Edit

Also, apart from all this jazz, there's also the option of using features from large pretrained models and training downstream classifiers on them.

Also, finally, read papers (esp if you're in the industry). I've basically regurgitated current state of the field. All this information is there. I'm not being passive aggressive or cheeky here; if you're in the industry, I know where you're coming from, but you'll have to roll your sleeves up and read papers. If you're in research, again, I know where you're coming from, but it's the way you gotta do

Helveticus99 OP t1_j1qmmtg wrote on December 26, 2022 at 4:58 PM

Thank you so much for your input u/sigmoid_amidst_relus. I will consider Mel-Spectrograms instead of MFCCs. Do you know what the maximum size of a Mel-Spectrogram is in terms of seconds it covers?

With mental state I'm not referring to emotions that change fast but to more a long-term state that is reflected in the whole 1 hour recording. Thus, I think repeating the label for every frame might not work well. I might have to extract features over the full recording. That's also why I think an autoencoder can be problematic.

I could divide the recording into frames and stack the Mel-Spectrograms of the frames (using a 3D CNN). The problem is that I will end up with a huge number of frames. Same problem when considering a RNN, I will end up with a huge time series.

Using features from a large pretrained model is interesting. Can you recommend a pretrained model that is suitable for feature extraction from long recordings?