Extract one voice from audio clip #1

prasannapattam · 2024-01-26T17:59:56Z

I am trying to extract voice of the hero on a audio which also contains other voices and background music

Can your project can do this?

I tried the following:

Prepared dataset of 50 audio clips (no background music) of the target actor with 10 sec duration (using slicer.py)
Stardew_valley folder contains background music (20 audio clips of 30 sec duration)
Using train.py trained 20 epochs
I used infer.py to extract the voice of this actor from an audio clip of 60 sec

The extracted output contained all the voices (not just the voice I trained).

Did I miss anything?

med1844 · 2024-01-27T22:52:32Z

This is still a WIP project (and currently I have little time to work on this). So, yes I would say this is normal behavior.

For comparison, I have not yet succeeded on 5.5 hours of target speaker + ~10 hours of noise. The noise is not removed very well, especially for tonic noise. The average SI-SNR is only -5.9dB. If you listen to it, the result should be mixed with some kind of bitcrush effect.

The current hypothesis is that the model needs to see way much more data before we can finetune it to get high SI-SNR with little data. Training from scratch with little amount of data doesn't seems to teach model well about the target. And I don't have computation resources required to train a good pretrained model.

For now I would recommend you to use MVSEP MDX23C model in UVR5 to do general purpose speech/vocal extraction, then manually filter out segments of target speaker. Their SDR improvement is insane.

prasannapattam · 2024-01-28T06:24:57Z

I am looking for automated way to extract voice of lead actor. How is DPTNet with respect to extracting target voice. What are the other alternatives to DPTNet?

med1844 · 2024-01-28T07:19:35Z

I chose DPTNet because of it's high performance on TSE tasks with relatively low #param. But on the hybrid task this repository is investigating into, which mixes both speech enhancement and TSE, with only very little data available, the performance is bad.

Regarding lead actor extraction, if you want to automatically detect the lead actor, afaik there's no out-of-the-box solution. Most TSE research has been done on 8kHz and 16kHz sample rate dataset, which is impossible to use in production.

However, if the lead actor's voice is never mixed with other speakers, you may try this model (sry I couldn't find an English version for either the model or the website). I have never used it, but according to the description you should be able to find speaker id or something similar for each sentence in the inference result. You may utilize that result to find the speaker with longest total speech duration to tell the lead actor, if that's what lead means. You can then use the start and end timestamp reported by the model to extract the lead actor's voice.

If the lead actor's voice is mixed with others, and you don't mind the output to be resampled to 8kHz (only information < 4kHz would be retained), you can take a look at this Conv-TasNet implementation, which comes with a pretrained model that you can use.

I'm not a professional researcher in this area, so please take my suggestions with a grain of salt.

prasannapattam · 2024-01-28T07:48:43Z

Thanks for your suggestions. I will try these two models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract one voice from audio clip #1

Extract one voice from audio clip #1

prasannapattam commented Jan 26, 2024

med1844 commented Jan 27, 2024

prasannapattam commented Jan 28, 2024

med1844 commented Jan 28, 2024 •

edited

Loading

prasannapattam commented Jan 28, 2024

Extract one voice from audio clip #1

Extract one voice from audio clip #1

Comments

prasannapattam commented Jan 26, 2024

med1844 commented Jan 27, 2024

prasannapattam commented Jan 28, 2024

med1844 commented Jan 28, 2024 • edited Loading

prasannapattam commented Jan 28, 2024

med1844 commented Jan 28, 2024 •

edited

Loading