Caution
This project is not ready for production use. Please consult other solutions.
This is a model aiming to extract ONLY ONE speaker's speech from all kinds of noises, including other people's voice, game SFX, noise, etc.
Designed to serve as infrastructural model for downstream tasks, such as ASR, SVC, SVS, TTS, etc.
- Works with limited clean speech data
- Removes overlapping speech from other speakers
- Consistent to different recording devices
- Low computational cost
WARNING: The project is still in it's early stages. All information provided here is for documentation purpose only and is subject to change without notice.
You need clean data and dirty data to train this model. Here's an example file layout:
datasets
├── clean
│ └── spk0
│ └── bootstrap
│ ├── src_0_0.wav
│ ├── src_0_1.wav
│ ├── src_0_10.wav
│ ├── src_0_100.wav
│ ├── ...
│ └── src_5_99.wav
└── dirty
├── other
│ └── stardew_valley
│ ├── src_0.wav
│ └── src_1.wav
└── spk0
└── stardew_valley
├── src_0.wav
├── src_1.wav
└── src_2.wav
Here we use spk0's clean data (bootstrap) and other's stardew_valley noise to build an extractor specifically targeted to extract spk0's speech mixed with stardew valley music, sfx and ambient noise.
Make sure that all files in clean/spk0
have length in 5-15s, otherwise you may get CUDA OOM. You can use slicer.py
to slice some long, clean audio that contains only the target speaker's voice into segments that meets the length requirement. You can find more information about the slicer in the Appendix I section.
python train.py
Unfortunately we still doesn't support command line as we are still prototyping things. You must edit the TrainArgs
object passed into the train
function in train.py
.
We implemented both Noise2Clean
and Noise2Noise
training, see dptnet_modules.N2NDPTNetModule
for implementation details.
Please add checkpoint path after filename, like this:
python infer.py ./exp/test_b1_w16_d2_train/checkpoints/model-epoch=09.ckpt
If you want to infer a file, you need to manually edit the infer.py
and remove the glob.glob()
and manually specify the path to the file. You will also need to specify the checkpoint from training phase when initializing InferArgs
.
See dataset.py
for implementation details.
Here's what happened inside of MixedAudioDataset
:
- Load a clean wav
$c$ , then apply offset to shift it left or right, while keeping it's shape by zero padding on the left or right side. - Randomly pick a segment with the same length as the clean wav, from random file in all provided dirty folders. We denote it as
$d$ . - Calculate max
$w$ such that$\forall i \in [0, \lvert c \rvert), c_i + w d_i \in [-1, 1]$ . - Return
$c + v w_{\max} d$ , where$v = 1 - r^2$ , and$r \in \mathcal{U}_{[0, 1]}$ .
Here's what MixedAudioDataset
produced:
This plot is generated by calling python dataset.py
. Feel free to modify the __main__
part to plot other things.
Other data augmentation schemes are under research.
We use DPTNet
as the speech extractor for now.
- Try frequency-domain solutions (e.g. diffusion-based approach)
- Separate model-specific args from training args and infer args
- Look into Noise2Noise
- Implement n2n training scheme
- Look into more noise2clean models & research papers
- Look into WavLM
- Models to look into
- Conv-TasNet
- Diffusion-based speech enhancement
- Low-priorities
- Apply sigma-reparam to the transformers
- Substitute transformer block with RWKV
- And most importantly, read more papers...
First, we compute the sliding RMS for given input wav
Given
The problem could now be converted to selecting some silence segments to split the input audio such that each split is 5-15s away from the last split. We use dynamic programming (or memorized search-order optimized searching) to solve this problem.
Let
See slicer.py:calc_dp
for more implementation details, e.g. how to build equivalent but much faster conditions for each of the subexpressions in the outermost max
function.