suflaj t1_ivuam4k wrote on November 10, 2022 at 5:39 PM

You mean automatic speech recognition? Yeah, there are models for that, Google probably has the best proprietary one but from what I understand it is still a work in progress, despite ex. Whisper releasing recently.

Prestigious_Boat_386 t1_ivuhxf7 wrote on November 10, 2022 at 6:26 PM

Think they want thecombination of that and splitting up when different people talk and assigning what's said to the person saying it.

Which isn't THAT hard when you already can recognice whos who, sometimes you could even just use main pitch & formants and silent segments. It's just quite a niche application.

Snickersman6 t1_ivuitev wrote on November 10, 2022 at 6:32 PM

Right it's that secondary splitting of the text that I don't know if it's possible.

suflaj t1_ivuj62a wrote on November 10, 2022 at 6:34 PM

Yeah, as said previously, Google is a master of it - ex. look at Pixel 7 ASR.

I believe it's still called ASR.

Snickersman6 t1_ivum0nx wrote on November 10, 2022 at 6:52 PM

You mentioned automatic speech recognition which is not what I was really asking about, I was asking about speaker diarization. The link below goes over the differences. It may be a part of ASR, but I don't know if it's does that on it's own as part of the speech recognition.

https://deepgram.com/blog/what-is-speaker-diarization/

suflaj t1_ivumz3h wrote on November 10, 2022 at 6:58 PM

It has not been marketed as such because it's built on top of ASR. Hence, you search for ASR and then look for its features. The same way you look for object detection, and if you need segmentation, you look if it has a detector that does segmentation. A layman looking for a solution does not search for specific terms and marketers know this.

Be as it be, the answer remains the same - Google offers the most advanced and performant solution, it markets it as ASR or how they call it text to speech, with this so called diarization being one feature of it.

Garbage-Shoddy t1_ivvrgu4 wrote on November 10, 2022 at 11:36 PM

I don’t think machine transcription is such a niche application

atlvet t1_ivwrghj wrote on November 11, 2022 at 4:24 AM

Not sure how they’re doing it but software like Chorus.ai can log in to Zoom meetings and transcribe them. I don’t know if they’re doing it by identifying which attendees feed is speaking somehow or if they just get a straight video/audio feed and can pick out different speakers.

would it be possible to train something that processes a video and outputs a text script like the following? Teacher: That is the topic we will be covering today. Student 1: What about the part of the lesson we didnt go over yesterday.

Comments