wjldw12138

wjldw12138 t1_j9ni2gq wrote

Hi everyone, I am looking for something like CLIP in speech area, which could measure the distance between text and speech (Mel-spectrum).

I found speech-CLIP before but unfortunately, its input for speech is raw wave rather than Mel-spectrum (same with HuBERT). I would be so appreciate if you can provide some information about that!

1