Submitted by Ok-Air4027 t3_xuog93 in MachineLearning

I am working on a speech to text project and I want to get different voices recognised to know which person said what and note it down as a conversation to text with names of speakers . I did not found any parameter to actually distinguish human voices mathematically . Is there a way to do so . There can be any number of people in conversation .

1

Comments

You must log in or register to comment.

gulab__jamun t1_iqxugpd wrote

You can use pyannote python library. It will identify different speakers from audio and will create small audio files with those speakers.

1

DBCon t1_iqyonbz wrote

Without knowing much about the subject, my immediate thought goes to spectral analysis.

Start with creating a spectrogram of the waveform. Essentially get the spectral components of the audio over time, much like running an FFT at different time steps. Then, identify the fundamental frequency of speech, which is probably close to the dominant frequency in the signal. A speaker’s fundamental frequency will likely stay within a small bandwidth. Maybe 50 Hz. If you have two similar speakers, you will probably have to look at secondary and tertiary dominant frequencies. There may even be an advantage to breaking the signals down using PCA first. You can additionally make a matched spectral filter that is sensitive to specific speakers.

You will need some logic to tell when speakers are done speaking or if multiple speakers are speaking over each other. An ML model can help with this to reduce processing overhead.

A quick google search shows that the study of unsupervised ML models for speaker detection has been around for a while. While spectral and Fourier analysis has been optimized for decades, emerging ML methods might be more reliable for highly complex auditory environments.

1