Skip to content

Latest commit

 

History

History
95 lines (64 loc) · 22.6 KB

seamless_align_README.md

File metadata and controls

95 lines (64 loc) · 22.6 KB

Seamless - Speech to Speech and Speech to Text Metadata

This document contains metadata information for reconstructing the dataset we used for training our models.

Format

The metadata format is similar to NLLB bitext format with some small differences.

The metadata files are tab separated, gzip files. Each file corresponds to one alignment direction.

File naming convention:

  • for text, we use 3 letters: e.g. fra, eng, tur
  • for audio, we use 2 letters and a 'A': e.g. frA, enA, trA

For example, the direction eng-trA corresponds to information for reconstructing English text with Turkish speech alignments. Similarly, enA-jpn corresponds to "English speech with Japanese text", and enA-frA corresponds to "English speech with French speech".

Each line has 11 columns.

For Audio, the columns correspond to:

- `cc_warc`: The warc file reference containing the public audio url
- `cc_sha`: not used
- `audio_speeh_segment_url`: space separated audio reference. See below.
- `cc_lineno`: not used
- `paragraph_digest`: expected duration of the whole audio file (without start/end frame trimming)
- `sentence_digest`: not used
- `text_lid_score`: not used
- `laser_score`: score of the alignment
- `direction`: direction, e.g. `enA-jpn`
- `side`: side, e.g. `enA` or `jpn`
- `line_no`: alignment number

audio_speeh_segment_url is a space separated audio reference. It has the following format: <url> <start_frame> <end_frame>, where start_frame and end_frame correspond to the segment that needs to be extracted from the audio file that is referenced at <url>, resampled at 16000 Hz.

For text, the columns are similar to NLLB format (except being tab separated here):

  • If the metadata comes from Common Crawl:

    • cc_warc: the reference to the Common Crawl WET file
    • cc_sha: the document sha1 in the WET file
    • cc_document_url: the url of the document referenced in the WET file
    • cc_lineno: the line number in the document referenced in the WET file
    • paragraph_digest: xxhash.xxh3_64_intdigest of the paragraph
    • sentence_digest: xxhash.xxh3_64_intdigest of the sentence
    • text_lid_score: language identification score, when available
    • laser_score: score of the alignment
    • direction: direction, e.g. enA-jpn
    • side: side, e.g. enA or jpn
    • line_no: alignment number
  • If the metadata comes from other corpus:

    • corpus: corpus name
    • cc_sha: not used
    • cc_document_url: not used
    • lineno: line number in the document
    • paragraph_digest: xxhash.xxh3_64_intdigest of the paragraph
    • sentence_digest: xxhash.xxh3_64_intdigest of the sentence
    • text_lid_score: language identification score, when available
    • laser_score: score of the alignment
    • direction: direction, e.g. enA-jpn
    • side: side, e.g. enA or jpn
    • line_no: alignment number

Data

Update: 25 Sep 2023

We are publishing updated metadata with the expected duration of the original audio file in the column paragraph_digest (originally not used for audio).

arb-enA ben-enA cat-enA dan-enA enA-est enA-fin enA-jpn enA-mlt enA-nld enA-pol enA-por enA-ron enA-slk enA-swe enA-swh enA-tur enA-ukr enA-urd enA-vie arA-enA arA-eng beA-enA caA-enA caA-eng csA-enA csA-eng cyA-enA cyA-eng daA-enA daA-eng deA-enA deA-eng enA-esA enA-fiA enA-frA enA-hiA enA-idA enA-itA enA-knA enA-koA enA-mtA enA-nlA enA-plA enA-ptA enA-rnA enA-ruA enA-skA enA-svA enA-swA enA-taA enA-teA enA-tgA enA-thA enA-trA enA-ukA enA-urA enA-uzA enA-viA enA-zhA eng-esA eng-fiA eng-frA eng-hiA eng-idA eng-itA eng-knA eng-koA eng-mtA eng-nlA eng-plA eng-ptA eng-rnA eng-ruA eng-skA eng-swA eng-taA eng-teA eng-tgA eng-thA eng-trA eng-ukA eng-urA eng-uzA eng-viA eng-zhA

You can find the legacy metadata (without duration information) here:

Legacy Data

arb-enA ben-enA cat-enA dan-enA enA-est enA-fin enA-jpn enA-mlt enA-nld enA-pol enA-por enA-ron enA-slk enA-swe enA-swh enA-tur enA-ukr enA-urd enA-vie arA-enA arA-eng beA-enA caA-enA caA-eng csA-enA csA-eng cyA-enA cyA-eng daA-enA daA-eng deA-enA deA-eng enA-esA enA-fiA enA-frA enA-hiA enA-idA enA-itA enA-knA enA-koA enA-mtA enA-nlA enA-plA enA-ptA enA-rnA enA-ruA enA-skA enA-svA enA-swA enA-taA enA-teA enA-tgA enA-thA enA-trA enA-ukA enA-urA enA-uzA enA-viA enA-zhA eng-esA eng-fiA eng-frA eng-hiA eng-idA eng-itA eng-knA eng-koA eng-mtA eng-nlA eng-plA eng-ptA eng-rnA eng-ruA eng-skA eng-swA eng-taA eng-teA eng-tgA eng-thA eng-trA eng-ukA eng-urA eng-uzA eng-viA eng-zhA

Download script

You can use the wet_lines script to download and gather aligned text information from the metadata. This script can be found here.

Example usage:

zcat seamless.dataset.metadata.public.enA-swA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | wet_lines

Based on metadata information it receives from stdin, wet_lines will download the corpora, find the paragraph and print the input with an additional column which corresponds to the text of the paragraph.

In order to retrieve the sentences from these paragraphs, one can use the sentence splitter available here. It will print the input (metadata + paragraph) with an additional column which corresponds to the text of the sentence.

Reconstructing sentences from metadata:

xzcat metadatafile.xz | egrep ^crawl-data | wet_lines | python -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()"