Skip to content

Commit

Permalink
Merge pull request #1 from athena-team/master
Browse files Browse the repository at this point in the history
merge from athena
  • Loading branch information
shuaijiang authored Jan 26, 2021
2 parents 90175ba + a2f51db commit 0f2a263
Show file tree
Hide file tree
Showing 11 changed files with 494 additions and 58 deletions.
40 changes: 29 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@



# Athena

*Athena* is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice Conversion, Speaker Recognition, etc).
Expand Down Expand Up @@ -32,9 +29,12 @@ All of our models are implemented in Tensorflow>=2.0.1. For ease of use, we prov
- [5.1) WFST graph creation](#51-wfst-graph-creation)
- [5.2) WFST decoding](#52-wfst-decoding)
- [6) Deployment](#6-deployment)
- [7) Results](#7-results)
- [7.1) ASR](#71-asr)
- [8) Directory Structure](#8-directory-structure)
- [7) Self-supervised speech representation learning](#7-self-supervised-speech-representation-learning)
- [7.1) MPC](#71-mpc)
- [7.2) Speech SimCLR](#72-speech-simclr)
- [8) Results](#8-results)
- [8.1) ASR](#81-asr)
- [9) Directory Structure](#9-directory-structure)

## 2) Key Features

Expand Down Expand Up @@ -111,7 +111,7 @@ python athena/main.py examples/translate/spa-eng-example/transformer.json
```bash
source tools/env.sh
python examples/translate/spa-eng-example/prepare_data.py examples/translate/spa-eng-example/data/train.csv
horovodrun -np 4 -H localhost:4 athena/horovod_main.py examples/translate/spa-eng-example/transformer.json
horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/translate/spa-eng-example/transformer.json
```

### Notes
Expand Down Expand Up @@ -320,21 +320,39 @@ $ ./asr
Detailed implementation is described [here](deploy/README.md).
## 7) Results
## 7) Self-supervised speech representation learning
### 7.1) MPC
Masked Predictive Coding (MPC) uses masked reconstruction objective to perform predictive coding on transformer based models. It achieved significant improvements on various speech recognition datasets. For more information, please refer to following paper(s).
[Improving Transformer-based Speech Recognition Using Unsupervised Pre-training](https://arxiv.org/abs/1910.09932.pdf)
[A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition](https://arxiv.org/pdf/2005.09862.pdf)
MPC models can be trained by running ```python athena/main.py examples/asr/*/configs/mpc.json```. To use pretrained MPC model in ASR training, simply set the "pretrained_model" section in ASR json config to the checkpoint dir of MPC model and proceed training.
### 7.2) Speech SimCLR
Speech SimCLR is a new self-supervised objective for speech representation learning. During training, Speech SimCLR applies augmentation on raw speech and its spectrogram. Its objective is the combination of contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstruction loss of input representation. For more information, please refer to following paper(s).
[Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning](https://arxiv.org/abs/2010.13991.pdf)
For now, pre-training with Speech SimCLR is only supported for Librispeech. You can run it with ```python athena/main.py examples/asr/librispeech/configs/speech_simclr.json```. For feature extraction, simply run ```python athena/inference.py examples/asr/librispeech/configs/speech_simclr.json```. The pre-trained Speech SimCLR models can be found [here](https://drive.google.com/file/d/1YYFmtB1RHRuw8s7lPWLxjihye9ssI5ax/view?usp=sharing).
## 8) Results
### 7.1) ASR
### 8.1) ASR
Language | Model Name | Training Data | Hours of Speech | Error Rate
:-----------: | :------------: | :----------: | -------: | -------:
English | Transformer | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h | 3.1%(WER)
English | Transformer | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h | 3.1% (WER)
English | Transformer | [Switchboard Dataset](https://catalog.ldc.upenn.edu/LDC97S62) | 260h | 8.6% (WER) |
English | Transformer | [TIMIT Dataset](https://catalog.ldc.upenn.edu/LDC93S1) | 3 h | 16.8% (PER) |
Mandarin | Transformer | HKUST Dataset | 151 h | 22.75% (CER)
Mandarin | Transformer | [AISHELL Dataset](http://www.openslr.org/33/) | 178 h | 6.6% (CER)
To compare with other published results, see [wer_are_we.md](docs/tutorials/wer_are_we.md).
## 8) Directory Structure
## 9) Directory Structure
Below is the basic directory structure for Athena
Expand Down
4 changes: 4 additions & 0 deletions athena/data/datasets/speech_synthesis.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,10 @@ def __getitem__(self, index):
duration_index = []
for index, duration in enumerate(durations):
duration_index.extend(list([index]) * int(duration))
duration_index = duration_index[: audio_feat_length]
if 0 < len(duration_index) < audio_feat_length:
expanded_index = list([duration_index[-1]]) * int(audio_feat_length - len(duration_index))
duration_index.extend(expanded_index)
return {
"input": text,
"input_length": text_length,
Expand Down
18 changes: 8 additions & 10 deletions athena/data/text_featurizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
import re
import warnings
from collections import defaultdict
import sentencepiece as spm
import tensorflow as tf
import tensorflow_text as text
from ..utils.hparam import register_and_parse_hparams


Expand Down Expand Up @@ -119,33 +119,31 @@ def encode(self, sentence):
return [self.stoi[token] for token in sentence.strip().split(' ')]

class SentencePieceFeaturizer:
"""SentencePieceFeaturizer
"""SentencePieceFeaturizer using tensorflow-text api
"""

def __init__(self, spm_file):
self.unk_index = 0
self.sp = spm.SentencePieceProcessor()
if spm_file is not None:
self.sp.Load(spm_file)
self.model = open(spm_file, "rb").read()
self.sp = text.SentencepieceTokenizer(model=self.model)

def load_model(self, model_file):
"""load sentence piece model
"""
self.sp.Load(model_file)
self.sp = text.SentencepieceTokenizer(model=open(model_file, "rb").read())

def __len__(self):
return self.sp.GetPieceSize()
return self.sp.vocab_size()

def encode(self, sentence):
"""convert a sentence to a list of ids by sentence piece model
"""
sentence = sentence.upper()
return self.sp.EncodeAsIds(sentence)
return self.sp.tokenize(sentence)

def decode(self, ids):
"""convert a list of ids to a sentence
"""
return self.sp.DecodeIds(ids)
return self.sp.detokenize(ids)

class TextTokenizer:
"""TextTokenizer
Expand Down
23 changes: 23 additions & 0 deletions athena/loss.py
Original file line number Diff line number Diff line change
Expand Up @@ -668,3 +668,26 @@ def ClassifyLoss(target_label_reshaped, domain_out_real):
return domain_real_loss



class ContrastiveLoss(tf.keras.losses.Loss):
"""
Contrastive Loss for SimCLR Model
"""
def __init__(self, temperature=1.0, normalization=True, name="ContrastiveLoss", ps=None):
super().__init__(name=name)

self.temperature = temperature
self.norm = normalization
self.cross_entropy = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
self.ps = ps

def gpu_cross_replica_concat(self, tensor):
num_replicas = self.ps.size()
ext_tensor = tf.scatter_nd(
indices=[[hvd.rank()]],
updates=[tensor],
shape=[num_replicas, tf.shape(tensor)[0], tf.shape(tensor)[1]])

ext_tensor = hvd.allreduce(ext_tensor, average=False)
return tf.reshape(ext_tensor, [-1, tf.shape(ext_tensor)[2]])

84 changes: 52 additions & 32 deletions athena/models/fastspeech.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,8 @@ def call(self, samples, training: bool = None):
_, input_mask = create_multihead_mask(None, None, samples['input'], reverse=True)
encoder_output = self.encoder(x0, input_mask, training=training) # [batch, x_steps, d_model]
teacher_outs, duration_indexes, duration_sequences = self.duration_calculator(samples)
if teacher_outs is None:
teacher_outs = tf.zeros_like(samples['output'])
pred_duration_sequences = self.duration_predictor(encoder_output, training=training)
output_length = samples['output_length']
before_outs, after_outs = self._feedforward_decoder(encoder_output, duration_indexes,
Expand Down Expand Up @@ -292,36 +294,55 @@ def inference(self, phoneme_sequences, duration_sequences, alpha=1.0):
duration_sequences = tf.cast(
tf.math.round(tf.cast(duration_sequences, dtype=tf.float32) * alpha),
dtype=tf.int32)
batch = tf.shape(phoneme_sequences)[0]
x_steps = tf.shape(phoneme_sequences)[1]
d_model = tf.shape(phoneme_sequences)[2]
batch = tf.shape(phoneme_sequences)[0]
# the duration value of one phoneme at least is one
total_durations = tf.reduce_sum(duration_sequences, axis=1) # [batch]
max_duration = tf.reduce_max(total_durations)

def expand_phoneme(batch_i):
phoneme_sequence = phoneme_sequences[batch_i]
duration_sequence = duration_sequences[batch_i]
total_duration = total_durations[batch_i]
expanded_array = tf.TensorArray(tf.float32, size=0, dynamic_size=True)
expanded_array = expanded_array.write(0, tf.zeros([d_model]))
for step_i in tf.range(x_steps):
duration = duration_sequence[step_i]
if duration == 0:
continue
phoneme = phoneme_sequence[step_i]
expanded_phoneme = tf.expand_dims(phoneme, axis=0)
expanded_phoneme = tf.tile(expanded_phoneme, [duration, 1]) # [duration, d_model]
expanded_array = expanded_array.unstack(tf.concat([expanded_array.stack(),
expanded_phoneme], axis=0))
expanded_array = tf.concat([expanded_array.stack(),
tf.zeros([max_duration - total_duration, d_model])],
axis=0)[1:] # [max_duration, d_model]
return expanded_array

# max_duration_i: the max duration of all the indexes
max_duration_i = tf.reduce_max(duration_sequences)
background = tf.ones([batch, x_steps, max_duration_i], dtype=tf.int32) * -1

# durations represents frame size for each index, shape: [batch, x_steps, max_duration_i]
durations = tf.tile(duration_sequences[:, :, tf.newaxis], [1, 1, max_duration_i])
duration_order_array = tf.range(max_duration_i)
duration_order_array = tf.tile(duration_order_array[tf.newaxis, tf.newaxis, :],
[batch, x_steps, 1])
step_order_array = tf.range(x_steps)
step_order_array = tf.tile(step_order_array[tf.newaxis, :, tf.newaxis],
[batch, 1, max_duration_i])
#duration_indexes examples:
#if we discard the batch dimension, duration_indexes will be like:
# [[0, 0, -1, -1],
# [1, 1, 1, -1],
# [2, -1, -1, -1]]
#And the corresponding duration sequence is [2, 3, 1]
duration_indexes = tf.where(duration_order_array < durations, step_order_array, background)
duration_indexes = tf.reshape(duration_indexes, [batch, -1]) # [batch, x_steps*max_duration_i]

valid_durations = tf.cast((duration_indexes != -1), dtype=tf.int32)
total_duration_batches = tf.reduce_sum(valid_durations, axis=-1) # [batch]
y_steps = tf.reduce_max(total_duration_batches)

def validate_duration_sequences(batch_i):
duration_index_per_batch = duration_indexes[batch_i] # [x_step * max_duration_i]
# it is used to discard the index with the value of -1
valid_duration_index_per_batch = valid_durations[batch_i] # [x_step * max_duration_i]
valid_duration_index_per_batch = tf.cast(valid_duration_index_per_batch, dtype=tf.bool)
duration_index_per_batch = duration_index_per_batch[valid_duration_index_per_batch]
total_duration_per_batch = total_duration_batches[batch_i]
padding = tf.ones([y_steps - total_duration_per_batch], dtype=tf.int32) * -1
valid_duration_seqs = tf.concat([duration_index_per_batch, padding], axis=0) # [y_steps]
batch_padding = tf.ones([y_steps, 1], dtype=tf.int32) * batch_i # [y_steps]
valid_duration_seqs = tf.concat([batch_padding, valid_duration_seqs[:, tf.newaxis]],
axis=-1)
return valid_duration_seqs

batches = tf.range(batch)
expanded_phone_list = tf.map_fn(expand_phoneme, batches, dtype=tf.float32,
parallel_iterations=128)
# duration_indexes, shape: [batch, y_steps, 2]
batch_duration_indexes = tf.map_fn(validate_duration_sequences, batches, parallel_iterations=128)
# phoneme_sequences, shape: [batch, x_steps, d_model]
# expanded_phone_list, shape: [batch, y_steps, d_model]
expanded_phone_list = tf.gather_nd(phoneme_sequences, batch_duration_indexes)
return expanded_phone_list

def call(self, phoneme_sequences, duration_indexes, output_length):
Expand Down Expand Up @@ -375,14 +396,13 @@ def call(self, samples):
"""
y_steps = tf.reduce_max(samples['output_length'])
x_steps = tf.reduce_max(samples['input_length'])
batch = tf.shape(samples['input_length'])[0]
teacher_outs = None
if self.teacher_type is None:
duration_index = samples['duration']
weights_argmax = duration_index[:, :y_steps] # [batch, y_steps]
if tf.shape(weights_argmax)[1] < y_steps:
padding = tf.zeros([tf.shape(weights_argmax)[0], y_steps - tf.shape(weights_argmax)[1]],
dtype=tf.int32)
weights_argmax = tf.concat([weights_argmax, padding], axis=1)
weights_argmax = samples['duration']
if tf.shape(weights_argmax)[1] == 0:
# for initialization
weights_argmax = tf.ones([batch, y_steps], dtype=tf.int32)
elif self.teacher_type == 'tts_transformer':
teacher_outs, attn_weights = self._calculate_transformer_attentions(samples)
weights_argmax = tf.cast(tf.argmax(attn_weights, axis=-1), dtype=tf.int32)
Expand Down
Loading

0 comments on commit 0f2a263

Please sign in to comment.