DBCon

DBCon t1_iqyonbz wrote

Without knowing much about the subject, my immediate thought goes to spectral analysis.

Start with creating a spectrogram of the waveform. Essentially get the spectral components of the audio over time, much like running an FFT at different time steps. Then, identify the fundamental frequency of speech, which is probably close to the dominant frequency in the signal. A speaker’s fundamental frequency will likely stay within a small bandwidth. Maybe 50 Hz. If you have two similar speakers, you will probably have to look at secondary and tertiary dominant frequencies. There may even be an advantage to breaking the signals down using PCA first. You can additionally make a matched spectral filter that is sensitive to specific speakers.

You will need some logic to tell when speakers are done speaking or if multiple speakers are speaking over each other. An ML model can help with this to reduce processing overhead.

A quick google search shows that the study of unsupervised ML models for speaker detection has been around for a while. While spectral and Fourier analysis has been optimized for decades, emerging ML methods might be more reliable for highly complex auditory environments.

1