Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract one voice from audio clip #1

Open
prasannapattam opened this issue Jan 26, 2024 · 4 comments
Open

Extract one voice from audio clip #1

prasannapattam opened this issue Jan 26, 2024 · 4 comments

Comments

@prasannapattam
Copy link

I am trying to extract voice of the hero on a audio which also contains other voices and background music

Can your project can do this?

I tried the following:

  • Prepared dataset of 50 audio clips (no background music) of the target actor with 10 sec duration (using slicer.py)
  • Stardew_valley folder contains background music (20 audio clips of 30 sec duration)
  • Using train.py trained 20 epochs
  • I used infer.py to extract the voice of this actor from an audio clip of 60 sec

The extracted output contained all the voices (not just the voice I trained).

Did I miss anything?

@med1844
Copy link
Member

med1844 commented Jan 27, 2024

This is still a WIP project (and currently I have little time to work on this). So, yes I would say this is normal behavior.

For comparison, I have not yet succeeded on 5.5 hours of target speaker + ~10 hours of noise. The noise is not removed very well, especially for tonic noise. The average SI-SNR is only -5.9dB. If you listen to it, the result should be mixed with some kind of bitcrush effect.

The current hypothesis is that the model needs to see way much more data before we can finetune it to get high SI-SNR with little data. Training from scratch with little amount of data doesn't seems to teach model well about the target. And I don't have computation resources required to train a good pretrained model.

For now I would recommend you to use MVSEP MDX23C model in UVR5 to do general purpose speech/vocal extraction, then manually filter out segments of target speaker. Their SDR improvement is insane.

@prasannapattam
Copy link
Author

I am looking for automated way to extract voice of lead actor. How is DPTNet with respect to extracting target voice. What are the other alternatives to DPTNet?

@med1844
Copy link
Member

med1844 commented Jan 28, 2024

I chose DPTNet because of it's high performance on TSE tasks with relatively low #param. But on the hybrid task this repository is investigating into, which mixes both speech enhancement and TSE, with only very little data available, the performance is bad.

Regarding lead actor extraction, if you want to automatically detect the lead actor, afaik there's no out-of-the-box solution. Most TSE research has been done on 8kHz and 16kHz sample rate dataset, which is impossible to use in production.

However, if the lead actor's voice is never mixed with other speakers, you may try this model (sry I couldn't find an English version for either the model or the website). I have never used it, but according to the description you should be able to find speaker id or something similar for each sentence in the inference result. You may utilize that result to find the speaker with longest total speech duration to tell the lead actor, if that's what lead means. You can then use the start and end timestamp reported by the model to extract the lead actor's voice.

If the lead actor's voice is mixed with others, and you don't mind the output to be resampled to 8kHz (only information < 4kHz would be retained), you can take a look at this Conv-TasNet implementation, which comes with a pretrained model that you can use.

I'm not a professional researcher in this area, so please take my suggestions with a grain of salt.

@prasannapattam
Copy link
Author

Thanks for your suggestions. I will try these two models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants