bklawa t1_ir0wlyp wrote on October 4, 2022 at 3:11 PM

Some ideas:

Down sample the audio to lower sample rate (if it is 48Khz, perhaps try 8Khz). This really depends on the task (music, speech, other general audio recordings...).
You don't need to feed the whole spectrogram of 30 min to the model for classification. A alternative would be to reduce the time axis by applying the mean or max for example, at the end you will end up with a very small vector. Otherwise you can also do it over splits of 1 mins segments to try keeping more information. But this will definitely help reducing the model size.
You can clip the portions of the audio track that are "silent" or under a certain energy threshold before applying the steps above.

Hope this helps