Comments

You must log in or register to comment.

shadow_fax1024 t1_j1oc69w wrote

1 has worked for me in the past ..you may need to generate more samples using various techniques like add a small noise in audio then taking its spectrogram,cut - mix a random portion of spectrograms, etc..

1

shadow_fax1024 t1_j1pxmnn wrote

I used plain cnn with and without attention ..I had to handle long audio files in training as well as inference

1

sigmoid_amidst_relus t1_j1qbo3x wrote

MFCCs are more prone to noise than melspectrograms.

They're better for classical methods (primarily due to their dimensionality in my opinion) but most recent papers don't use MFCCs: they either go raw waveform or melspectrograms.

You have a 400-600 hour dataset, and utterance level labels. For the task at hand option 2 is the best, and there's a lot of variations of it you can try.

You can experiment with a frame-wise setup: repeat the utterance level label for every frame computed from the utterance, train on a frame and it's corresponding label at a time. Or take a sequence of frames and their label. Or take crops of different sizes from each utterance. There's a lot of options. At test time, just aggregate labels for a single utterance.

Use whatever DNN you want to use, I suggest starting simple with CNNs.

Either way, you'll have to experiment and see how this affects performance.

It's possible that an utterance level "Mental state" classifier might end up leveraging semantic information because your dataset doesn't have enough speakers. It's even easier to do that if your dataset is an "acted" dataset. You'll end up doing well on the benchmark but your model is not learning jackshit. Therein lies a big problem with the speech emotion classification domain. If the model doesn't do this, it will be aggregating actual emotional state information over time anyway, so why not be more efficient.

Mental state is not reflected just in the entire utterance, if a person is say, sad, it's reflected throughout their voice on a sub-speech segment level.

Edit

Also, apart from all this jazz, there's also the option of using features from large pretrained models and training downstream classifiers on them.

Also, finally, read papers (esp if you're in the industry). I've basically regurgitated current state of the field. All this information is there. I'm not being passive aggressive or cheeky here; if you're in the industry, I know where you're coming from, but you'll have to roll your sleeves up and read papers. If you're in research, again, I know where you're coming from, but it's the way you gotta do

2

Helveticus99 OP t1_j1qmmtg wrote

Thank you so much for your input u/sigmoid_amidst_relus. I will consider Mel-Spectrograms instead of MFCCs. Do you know what the maximum size of a Mel-Spectrogram is in terms of seconds it covers?

With mental state I'm not referring to emotions that change fast but to more a long-term state that is reflected in the whole 1 hour recording. Thus, I think repeating the label for every frame might not work well. I might have to extract features over the full recording. That's also why I think an autoencoder can be problematic.

I could divide the recording into frames and stack the Mel-Spectrograms of the frames (using a 3D CNN). The problem is that I will end up with a huge number of frames. Same problem when considering a RNN, I will end up with a huge time series.

Using features from a large pretrained model is interesting. Can you recommend a pretrained model that is suitable for feature extraction from long recordings?

1

shadow_fax1024 t1_j1scqqd wrote

You could split the file into chunk of n seconds ..n seconds you need to find ..which ever fits for your dataset..for mine 4 sec chunk was good enough...also you could use a peak detector first and then chunk the file n/2 seconds either side from the peak and have some overlapping window there too..so that you won't loose information ..

1

shadow_fax1024 t1_j1sdf0y wrote

You could also look into different approaches taken by participants in kaggle competition: birdclef..here the problem is somewhat similar to yours

1