Submitted by CeFurkan t3_10r9biu in MachineLearning

Currently I am using Davinci Resolve free edition to manually cut / remove no speech parts, or the parts where I take a breath

It is extremely time consuming

I am pretty sure this can be done via AI

For example whisper is able to detect where we use filler words such as umh, um, uh etc

That would be awesome to automatically remove these parts from a video

Just direct me where to look thank you

2

Comments

You must log in or register to comment.

I_Am_The_Sevit t1_j6vf5jg wrote

Theres a Carykh video about something similar. The GitHub is linked in the description. https://youtu.be/DQ8orIurGxw

2

Miguel33Angel t1_j6x9bnf wrote

Yeah you would just need to add something to remove filler words as well

Doing it with whisper, given a list of filler words would be easy enough I think

1

SnooWords6686 t1_j6ulih8 wrote

Why do you want a video without speech ?

1

CeFurkan OP t1_j6ulm3r wrote

no just remove filler words. such as um uh etc . also the parts where i take breath

1

Agreeable_Dog6536 t1_j6uuq7r wrote

He's asking the opposite - remove the bits with no speech.

I used to do more or less this same thing manually, years ago, for a corporate vlog in which people drove around all day fixing pipe leaks and occasionally commented on what they'd done - they wanted the clips where they commented, edited together.

I basically just looked at the audio waveform and figured out where I should probably cut, and then listened to it to narrow it down.

If someone hasn't already trained an AI for this, they should.

1

SnooWords6686 t1_j6um1ru wrote

Good . hope you can solve it 🙂

1

Ok_Dependent1131 t1_j6uv7o5 wrote

The company that makes snagit has software that does it... but not free...

1

txhwind t1_j6v41wn wrote

Try speech recognition model with timeline alignment output, then cut parts not aligned to words or aligned to filler words.

1

doctorjuice t1_j6vdwpb wrote

There are ways to robustly remove all silent spaces and breaths but filler words is less robust. Would you still find that useful?

1