Skip to content

Experimental model based on DPTNet, aiming to extract a specific speaker's voice from noisy audio with high quality at 48kHz.

Notifications You must be signed in to change notification settings

pipilapilayu/TargetSpeakerEnhance

Repository files navigation

Target Speaker Speech Enhancement

Caution

This project is not ready for production use. Please consult other solutions.

This is a model aiming to extract ONLY ONE speaker's speech from all kinds of noises, including other people's voice, game SFX, noise, etc.

Designed to serve as infrastructural model for downstream tasks, such as ASR, SVC, SVS, TTS, etc.

Goals

  • Works with limited clean speech data
  • Removes overlapping speech from other speakers
  • Consistent to different recording devices
  • Low computational cost

How to use

WARNING: The project is still in it's early stages. All information provided here is for documentation purpose only and is subject to change without notice.

Data preparation

You need clean data and dirty data to train this model. Here's an example file layout:

datasets
├── clean
│   └── spk0
│       └── bootstrap
│           ├── src_0_0.wav
│           ├── src_0_1.wav
│           ├── src_0_10.wav
│           ├── src_0_100.wav
│           ├── ...
│           └── src_5_99.wav
└── dirty
    ├── other
    │   └── stardew_valley
    │       ├── src_0.wav
    │       └── src_1.wav
    └── spk0
        └── stardew_valley
            ├── src_0.wav
            ├── src_1.wav
            └── src_2.wav

Here we use spk0's clean data (bootstrap) and other's stardew_valley noise to build an extractor specifically targeted to extract spk0's speech mixed with stardew valley music, sfx and ambient noise.

Make sure that all files in clean/spk0 have length in 5-15s, otherwise you may get CUDA OOM. You can use slicer.py to slice some long, clean audio that contains only the target speaker's voice into segments that meets the length requirement. You can find more information about the slicer in the Appendix I section.

Training

python train.py

Unfortunately we still doesn't support command line as we are still prototyping things. You must edit the TrainArgs object passed into the train function in train.py.

We implemented both Noise2Clean and Noise2Noise training, see dptnet_modules.N2NDPTNetModule for implementation details.

Inference

Please add checkpoint path after filename, like this:

python infer.py ./exp/test_b1_w16_d2_train/checkpoints/model-epoch=09.ckpt

If you want to infer a file, you need to manually edit the infer.py and remove the glob.glob() and manually specify the path to the file. You will also need to specify the checkpoint from training phase when initializing InferArgs.

Data preparation

See dataset.py for implementation details.

Here's what happened inside of MixedAudioDataset:

  • Load a clean wav $c$, then apply offset to shift it left or right, while keeping it's shape by zero padding on the left or right side.
  • Randomly pick a segment with the same length as the clean wav, from random file in all provided dirty folders. We denote it as $d$.
  • Calculate max $w$ such that $\forall i \in [0, \lvert c \rvert), c_i + w d_i \in [-1, 1]$.
  • Return $c + v w_{\max} d$, where $v = 1 - r^2$, and $r \in \mathcal{U}_{[0, 1]}$.

Here's what MixedAudioDataset produced:

preview

This plot is generated by calling python dataset.py. Feel free to modify the __main__ part to plot other things.

Other data augmentation schemes are under research.

Structure

We use DPTNet as the speech extractor for now.

Todo

  • Try frequency-domain solutions (e.g. diffusion-based approach)
  • Separate model-specific args from training args and infer args
  • Look into Noise2Noise
    • Implement n2n training scheme
  • Look into more noise2clean models & research papers
  • Look into WavLM
  • Models to look into
    • Conv-TasNet
    • Diffusion-based speech enhancement
  • Low-priorities
  • And most importantly, read more papers...

References & Acknowdgements

Appendix I: Slicer

First, we compute the sliding RMS for given input wav $y$. Then, given a threshold, we get set of audio clips $\mathcal{C}$ such that $\forall c \in \mathcal{C}$, $\mathrm{rms}[l_c, r_c] > t$ where $\mathrm{rms}$ is the rms curve, $l_c$ and $r_c$ are left-right boundaries respectively.

Given $\mathcal{C}$ we could compute the complementary set of silence segments $\mathcal{S}$ such that:

$$ \begin{aligned} \lvert \mathcal{S} \rvert &= n + 1 \\ l_{s_i} &= r_{c_{i - 1}} \\ r_{s_i} &= l_{c_{i}} \\ l_{s_0} &= 0 \\ r_{s_n} &= \lvert y\rvert \end{aligned} $$

The problem could now be converted to selecting some silence segments to split the input audio such that each split is 5-15s away from the last split. We use dynamic programming (or memorized search-order optimized searching) to solve this problem.

Let $\mathrm{dp}[i]$ be the maximum number of clips covered in set $\mathcal{C}_{:i-1}$ when $s_i$ is chosen as the split point. The transition equation is:

$$ \mathrm{dp}[i] = \max\left(\max_{l_{s_i} - r_{s_j} \in [\textrm{lower}, \textrm{upper})} (\mathrm{dp}[j] + i - j), \max_{l_{s_i} - r_{s_j} \notin [\textrm{lower}, \textrm{upper})}(\mathrm{dp}[j])\right) $$

See slicer.py:calc_dp for more implementation details, e.g. how to build equivalent but much faster conditions for each of the subexpressions in the outermost max function.

About

Experimental model based on DPTNet, aiming to extract a specific speaker's voice from noisy audio with high quality at 48kHz.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages