A collection of data importers for various audio sources. A loose manual data pipeline.
pip install dataimporters
The audio sources have to be provided manually (for now).
The scripts expect a data directory containing the audio folders:
root
|- data/
|- original/ (where you have to place the soundbanks)
|- intermediate/ (generated)
|- dataset/ (generated)
We use nbdev, which compiles all notebooks into a package. The source is at the nbs
folder.
To create a new dataset package, we simply:
- Define and process all sources,
- import the
Dataset
, - give it the sources we'd like to include and the path to our data,
- call
Dataset.compile
This will process all sources and build a final dataset.zip
file.
The library is flexible, but here's the simplest and most common action we perform:
For annotations, see nbs/12_review.ipynb
notebook.
#hide_ouput
from DataImporters.core import load_version
DATA_DIR = "data/"
VERSION = load_version()
VERSION
18
from DataImporters.sources.core import process
from DataImporters.sources.space_divers_mini import SpaceDiversMini
from DataImporters.sources.footsteps_one_ppsfx import FootstepsOnePpsfx
from DataImporters.sources.footsteps_two_ppsfx import FootstepsTwoPpsfx
from DataImporters.sources.edward import Edward
from DataImporters.sources.barefoot_metal_sonniss import BarefootMetalSonniss
from DataImporters.sources.custom_fsd import CustomFsd
all_sources = [
SpaceDiversMini(),
FootstepsOnePpsfx(),
FootstepsTwoPpsfx(),
Edward(),
BarefootMetalSonniss(),
CustomFsd()
]
for source in all_sources:
process(source, DATA_DIR, VERSION)
Below are two examples, one creates a large dataset with the automatic processors, the other is a more balanced dataset, manually annotated.
Choose one to run and then jump to Verify Output
from DataImporters.dataset import Dataset, DatasetPaths
DATASET_NAME = "large"
# Same as `all_sources` excluding SpaceDiversMini
sources = [
FootstepsOnePpsfx(),
FootstepsTwoPpsfx(),
Edward(),
BarefootMetalSonniss(),
CustomFsd()
]
paths = DatasetPaths(DATA_DIR, DATASET_NAME)
metadata = Dataset(sources, paths).compile()
metadata.shape[0]
Warning: 206 duplicate rows found. Some rows were dropped (all files copied).
1646
from DataImporters.dataset import Dataset, DatasetPaths
DATASET_NAME = "small_balanced"
ANNOTATION_PATH = os.path.join(DATA_DIR, "annotations", DATASET_NAME + ".csv")
sources = [
CustomFsd()
]
paths = DatasetPaths(DATA_DIR, DATASET_NAME, ANNOTATION_PATH)
metadata = Dataset(sources, paths).compile()
metadata.shape[0]
Warning: 207 duplicate rows found. Some rows were dropped (all files copied).
284
Dataset.compile
will return the newly created metadata (which has already been saved to DATA_PATH
).
We can use it to confirm we did indeed copy all files. Since the metadata aggregates all the source metadata, if a file is missing, it will still be in the metadata.
On the other hand, this will also let us know when a file has been deleted from the source, but still exists in the dataset folder.
import os
assert len(os.listdir(paths.audio_output_path)) == len(metadata)
Everything is looking good, we should bump the version
.
#hide_output
from DataImporters.core import bump_version
bump_version()
If the assertion fails, this could be due to:
- Genuine failure to copy
- Some files in the target folder need deleting
- Please delete them, no code yet
- Hash conflict (same content from different sources)
- In this case, we must debug the sources and make sure there are no duplicates
dataset/
|- README.md
|- metadata.csv
|- audio/
|- Long list of audio files, filenames are the xxhash64 of the content.
metadata.csv
contains a list of all the files in the dataset and their labels.
filename | category | label | extra | source | version |
---|---|---|---|---|---|
File name, assumes all files inside audio folder | Single major category name | Escaped (“”) comma separated list of labels, in snake_case | Extra text/details available for this row (unstructured) | Name of original sound library, snake_case | Version of the last change. Limited to last change only |
version
is a simple incremental integer. If you need to check if a file changed/added simply check if the row version
is higher than the last version
you ran. Deletes are not supported yet.
Here's an example from the sample code ran earlier:
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filename | category | label | extra | source | version | |
---|---|---|---|---|---|---|
0 | 7620671de38cc6d1.wav | Wood | Creaky,Door,Close,Wooden,Squeaky,Squeaking,Woo... | NaN | custom_fsd | 18 |
3 | 12193cedf99e9427.wav | Wood | Knock,Wood,Knocking,Knock | NaN | custom_fsd | 18 |
flowchart TD
sa[(Source A)] --> pa([Normalise data and create CSV]);
pa --> ia[(Intermediate A)];
sb[(Source B)] --> pb([Normalise data and create CSV]);
pb --> ib[(Intermediate B)];
ia & ib & a(WIP: Manual annotations by hash) --> c([Compile])
c-- Some rows can be rejected at this stage --> d[(Dataset)];
Each loader outputs:
- a CSV, which is then compiled into a single metadata.csv
- the files into an intermediate folder
The process above is done so that:
- Each source is independent
- We can easily compile a final dataset with different sources
- Easier to make the split consistent across runs