Merge pull request #1 from athena-team/master

merge from athena
athena-team · Jan 26, 2021 · 0f2a263 · 0f2a263
2 parents 90175ba + a2f51db
commit 0f2a263
Show file tree

Hide file tree

Showing 11 changed files with 494 additions and 58 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,3 @@
-
-
-
 # Athena
 
 *Athena* is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice Conversion, Speaker Recognition, etc).
@@ -32,9 +29,12 @@ All of our models are implemented in Tensorflow>=2.0.1. For ease of use, we prov
     - [5.1) WFST graph creation](#51-wfst-graph-creation)
     - [5.2) WFST decoding](#52-wfst-decoding)
   - [6) Deployment](#6-deployment)
-  - [7) Results](#7-results)
-    - [7.1) ASR](#71-asr)
-  - [8) Directory Structure](#8-directory-structure)
+  - [7) Self-supervised speech representation learning](#7-self-supervised-speech-representation-learning)
+    - [7.1) MPC](#71-mpc)
+    - [7.2) Speech SimCLR](#72-speech-simclr)
+  - [8) Results](#8-results)
+    - [8.1) ASR](#81-asr)
+  - [9) Directory Structure](#9-directory-structure)
 
 ## 2) Key Features
 
@@ -111,7 +111,7 @@ python athena/main.py examples/translate/spa-eng-example/transformer.json
 ```bash
 source tools/env.sh
 python examples/translate/spa-eng-example/prepare_data.py examples/translate/spa-eng-example/data/train.csv
-horovodrun -np 4 -H localhost:4 athena/horovod_main.py examples/translate/spa-eng-example/transformer.json
+horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/translate/spa-eng-example/transformer.json
 ```
 
 ### Notes
@@ -320,21 +320,39 @@ $ ./asr
 
 Detailed implementation is described [here](deploy/README.md).
 
-## 7) Results
+## 7) Self-supervised speech representation learning
+
+### 7.1) MPC
+Masked Predictive Coding (MPC) uses masked reconstruction objective to perform predictive coding on transformer based models. It achieved significant improvements on various speech recognition datasets. For more information, please refer to following paper(s).
+
+[Improving Transformer-based Speech Recognition Using Unsupervised Pre-training](https://arxiv.org/abs/1910.09932.pdf)
+
+[A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition](https://arxiv.org/pdf/2005.09862.pdf)
+
+MPC models can be trained by running ```python athena/main.py examples/asr/*/configs/mpc.json```. To use pretrained MPC model in ASR training, simply set the "pretrained_model" section in ASR json config to the checkpoint dir of MPC model and proceed training.
+
+### 7.2) Speech SimCLR
+Speech SimCLR is a new self-supervised objective for speech representation learning. During training, Speech SimCLR applies augmentation on raw speech and its spectrogram. Its objective is the combination of contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstruction loss of input representation. For more information, please refer to following paper(s).
+
+[Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning](https://arxiv.org/abs/2010.13991.pdf)
+
+For now, pre-training with Speech SimCLR is only supported for Librispeech. You can run it with ```python athena/main.py examples/asr/librispeech/configs/speech_simclr.json```. For feature extraction, simply run ```python athena/inference.py examples/asr/librispeech/configs/speech_simclr.json```. The pre-trained Speech SimCLR models can be found [here](https://drive.google.com/file/d/1YYFmtB1RHRuw8s7lPWLxjihye9ssI5ax/view?usp=sharing).
+
+## 8) Results
 
-### 7.1) ASR
+### 8.1) ASR
 
 Language  | Model Name | Training Data | Hours of Speech | Error Rate
 :-----------: | :------------: | :----------: |  -------: | -------:
-English  | Transformer | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h | 3.1%(WER)
+English  | Transformer | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h | 3.1% (WER)
 English  | Transformer | [Switchboard Dataset](https://catalog.ldc.upenn.edu/LDC97S62) | 260h | 8.6% (WER) |
 English  | Transformer | [TIMIT Dataset](https://catalog.ldc.upenn.edu/LDC93S1) | 3 h | 16.8% (PER) |
 Mandarin | Transformer | HKUST Dataset | 151 h | 22.75% (CER)
 Mandarin | Transformer | [AISHELL Dataset](http://www.openslr.org/33/) | 178 h | 6.6% (CER)
 
 To compare with other published results, see [wer_are_we.md](docs/tutorials/wer_are_we.md).
 
-## 8) Directory Structure
+## 9) Directory Structure
 
 Below is the basic directory structure for Athena
 

diff --git a/athena/data/datasets/speech_synthesis.py b/athena/data/datasets/speech_synthesis.py
@@ -115,6 +115,10 @@ def __getitem__(self, index):
         duration_index = []
         for index, duration in enumerate(durations):
             duration_index.extend(list([index]) * int(duration))
+        duration_index = duration_index[: audio_feat_length]
+        if 0 < len(duration_index) < audio_feat_length:
+            expanded_index = list([duration_index[-1]]) * int(audio_feat_length - len(duration_index))
+            duration_index.extend(expanded_index)
         return {
             "input": text,
             "input_length": text_length,

diff --git a/athena/data/text_featurizer.py b/athena/data/text_featurizer.py
@@ -20,8 +20,8 @@
 import re
 import warnings
 from collections import defaultdict
-import sentencepiece as spm
 import tensorflow as tf
+import tensorflow_text as text
 from ..utils.hparam import register_and_parse_hparams
 
 
@@ -119,33 +119,31 @@ def encode(self, sentence):
         return [self.stoi[token] for token in sentence.strip().split(' ')]
 
 class SentencePieceFeaturizer:
-    """SentencePieceFeaturizer
+    """SentencePieceFeaturizer using tensorflow-text api
     """
 
     def __init__(self, spm_file):
         self.unk_index = 0
-        self.sp = spm.SentencePieceProcessor()
-        if spm_file is not None:
-            self.sp.Load(spm_file)
+        self.model = open(spm_file, "rb").read()
+        self.sp = text.SentencepieceTokenizer(model=self.model)
 
     def load_model(self, model_file):
         """load sentence piece model
         """
-        self.sp.Load(model_file)
+        self.sp = text.SentencepieceTokenizer(model=open(model_file, "rb").read())
 
     def __len__(self):
-        return self.sp.GetPieceSize()
+        return self.sp.vocab_size()
 
     def encode(self, sentence):
         """convert a sentence to a list of ids by sentence piece model
         """
-        sentence = sentence.upper()
-        return self.sp.EncodeAsIds(sentence)
+        return self.sp.tokenize(sentence)
 
     def decode(self, ids):
         """convert a list of ids to a sentence
         """
-        return self.sp.DecodeIds(ids)
+        return self.sp.detokenize(ids)
 
 class TextTokenizer:
     """TextTokenizer

diff --git a/athena/loss.py b/athena/loss.py
@@ -668,3 +668,26 @@ def ClassifyLoss(target_label_reshaped, domain_out_real):
     return domain_real_loss
 
 
+
+class ContrastiveLoss(tf.keras.losses.Loss):
+    """
+    Contrastive Loss for SimCLR Model
+    """
+    def __init__(self, temperature=1.0, normalization=True, name="ContrastiveLoss", ps=None):
+        super().__init__(name=name)
+
+        self.temperature = temperature
+        self.norm = normalization
+        self.cross_entropy = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
+        self.ps = ps
+
+    def gpu_cross_replica_concat(self, tensor):
+        num_replicas = self.ps.size()
+        ext_tensor = tf.scatter_nd(
+            indices=[[hvd.rank()]],
+            updates=[tensor],
+            shape=[num_replicas, tf.shape(tensor)[0], tf.shape(tensor)[1]])
+
+        ext_tensor = hvd.allreduce(ext_tensor, average=False)
+        return tf.reshape(ext_tensor, [-1, tf.shape(ext_tensor)[2]])
+
diff --git a/athena/models/fastspeech.py b/athena/models/fastspeech.py
@@ -249,6 +249,8 @@ def call(self, samples, training: bool = None):
         _, input_mask = create_multihead_mask(None, None, samples['input'], reverse=True)
         encoder_output = self.encoder(x0, input_mask, training=training) # [batch, x_steps, d_model]
         teacher_outs, duration_indexes, duration_sequences = self.duration_calculator(samples)
+        if teacher_outs is None:
+            teacher_outs = tf.zeros_like(samples['output'])
         pred_duration_sequences = self.duration_predictor(encoder_output, training=training)
         output_length = samples['output_length']
         before_outs, after_outs = self._feedforward_decoder(encoder_output, duration_indexes,
@@ -292,36 +294,55 @@ def inference(self, phoneme_sequences, duration_sequences, alpha=1.0):
             duration_sequences = tf.cast(
                 tf.math.round(tf.cast(duration_sequences, dtype=tf.float32) * alpha),
                 dtype=tf.int32)
-        batch = tf.shape(phoneme_sequences)[0]
         x_steps = tf.shape(phoneme_sequences)[1]
-        d_model = tf.shape(phoneme_sequences)[2]
+        batch = tf.shape(phoneme_sequences)[0]
         # the duration value of one phoneme at least is one
-        total_durations = tf.reduce_sum(duration_sequences, axis=1) # [batch]
-        max_duration = tf.reduce_max(total_durations)
-
-        def expand_phoneme(batch_i):
-            phoneme_sequence = phoneme_sequences[batch_i]
-            duration_sequence = duration_sequences[batch_i]
-            total_duration = total_durations[batch_i]
-            expanded_array = tf.TensorArray(tf.float32, size=0, dynamic_size=True)
-            expanded_array = expanded_array.write(0, tf.zeros([d_model]))
-            for step_i in tf.range(x_steps):
-                duration = duration_sequence[step_i]
-                if duration == 0:
-                    continue
-                phoneme = phoneme_sequence[step_i]
-                expanded_phoneme = tf.expand_dims(phoneme, axis=0)
-                expanded_phoneme = tf.tile(expanded_phoneme, [duration, 1]) # [duration, d_model]
-                expanded_array = expanded_array.unstack(tf.concat([expanded_array.stack(),
-                                                                   expanded_phoneme], axis=0))
-            expanded_array = tf.concat([expanded_array.stack(),
-                                        tf.zeros([max_duration - total_duration, d_model])],
-                                       axis=0)[1:] # [max_duration, d_model]
-            return expanded_array
+
+        # max_duration_i: the max duration of all the indexes
+        max_duration_i = tf.reduce_max(duration_sequences)
+        background = tf.ones([batch, x_steps, max_duration_i], dtype=tf.int32) * -1
+
+        # durations represents frame size for each index, shape: [batch, x_steps, max_duration_i]
+        durations = tf.tile(duration_sequences[:, :, tf.newaxis], [1, 1, max_duration_i])
+        duration_order_array = tf.range(max_duration_i)
+        duration_order_array = tf.tile(duration_order_array[tf.newaxis, tf.newaxis, :],
+                                       [batch, x_steps, 1])
+        step_order_array = tf.range(x_steps)
+        step_order_array = tf.tile(step_order_array[tf.newaxis, :, tf.newaxis],
+                                   [batch, 1, max_duration_i])
+        #duration_indexes examples:
+        #if we discard the batch dimension, duration_indexes will be like:
+        #    [[0, 0, -1, -1],
+        #     [1, 1, 1, -1],
+        #     [2, -1, -1, -1]]
+        #And the corresponding duration sequence is [2, 3, 1]
+        duration_indexes = tf.where(duration_order_array < durations, step_order_array, background)
+        duration_indexes = tf.reshape(duration_indexes, [batch, -1]) # [batch, x_steps*max_duration_i]
+
+        valid_durations = tf.cast((duration_indexes != -1), dtype=tf.int32)
+        total_duration_batches = tf.reduce_sum(valid_durations, axis=-1) # [batch]
+        y_steps = tf.reduce_max(total_duration_batches)
+
+        def validate_duration_sequences(batch_i):
+            duration_index_per_batch = duration_indexes[batch_i] # [x_step * max_duration_i]
+            # it is used to discard the index with the value of -1
+            valid_duration_index_per_batch = valid_durations[batch_i] # [x_step * max_duration_i]
+            valid_duration_index_per_batch = tf.cast(valid_duration_index_per_batch, dtype=tf.bool)
+            duration_index_per_batch = duration_index_per_batch[valid_duration_index_per_batch]
+            total_duration_per_batch = total_duration_batches[batch_i]
+            padding = tf.ones([y_steps - total_duration_per_batch], dtype=tf.int32) * -1
+            valid_duration_seqs = tf.concat([duration_index_per_batch, padding], axis=0) # [y_steps]
+            batch_padding = tf.ones([y_steps, 1], dtype=tf.int32) * batch_i # [y_steps]
+            valid_duration_seqs = tf.concat([batch_padding, valid_duration_seqs[:, tf.newaxis]],
+                                            axis=-1)
+            return valid_duration_seqs
 
         batches = tf.range(batch)
-        expanded_phone_list = tf.map_fn(expand_phoneme, batches, dtype=tf.float32,
-                                        parallel_iterations=128)
+        # duration_indexes, shape: [batch, y_steps, 2]
+        batch_duration_indexes = tf.map_fn(validate_duration_sequences, batches, parallel_iterations=128)
+        # phoneme_sequences, shape: [batch, x_steps, d_model]
+        # expanded_phone_list, shape: [batch, y_steps, d_model]
+        expanded_phone_list = tf.gather_nd(phoneme_sequences, batch_duration_indexes)
         return expanded_phone_list
 
     def call(self, phoneme_sequences, duration_indexes, output_length):
@@ -375,14 +396,13 @@ def call(self, samples):
         """
         y_steps = tf.reduce_max(samples['output_length'])
         x_steps = tf.reduce_max(samples['input_length'])
+        batch = tf.shape(samples['input_length'])[0]
         teacher_outs = None
         if self.teacher_type is None:
-            duration_index = samples['duration']
-            weights_argmax = duration_index[:, :y_steps] # [batch, y_steps]
-            if tf.shape(weights_argmax)[1] < y_steps:
-                padding = tf.zeros([tf.shape(weights_argmax)[0], y_steps - tf.shape(weights_argmax)[1]],
-                                   dtype=tf.int32)
-                weights_argmax = tf.concat([weights_argmax, padding], axis=1)
+            weights_argmax = samples['duration']
+            if tf.shape(weights_argmax)[1] == 0:
+                # for initialization
+                weights_argmax = tf.ones([batch, y_steps], dtype=tf.int32)
         elif self.teacher_type == 'tts_transformer':
             teacher_outs, attn_weights = self._calculate_transformer_attentions(samples)
             weights_argmax = tf.cast(tf.argmax(attn_weights, axis=-1), dtype=tf.int32)