Models, code, and papers for "Shinji Watanabe":

Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition

Nov 12, 2018
Hiroshi Seki, Takaaki Hori, Shinji Watanabe

Attention-based encoder decoder network uses a left-to-right beam search algorithm in the inference step. The current beam search expands hypotheses and traverses the expanded hypotheses at the next time step. This traversal is implemented using a for-loop program in general, and it leads to speed down of the recognition process. In this paper, we propose a parallelism technique for beam search, which accelerates the search process by vectorizing multiple hypotheses to eliminate the for-loop program. We also propose a technique to batch multiple speech utterances for off-line recognition use, which reduces the for-loop program with regard to the traverse of multiple utterances. This extension is not trivial during beam search unlike during training due to several pruning and thresholding techniques for efficient decoding. In addition, our method can combine scores of external modules, RNNLM and CTC, in a batch as shallow fusion. We achieved 3.7 x speedup compared with the original beam search algorithm by vectoring hypotheses, and achieved 10.5 x speedup by further changing processing unit to GPU.


  Click for Model/Code and Paper
Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling

Nov 10, 2018
Hainan Xu, Shuoyang Ding, Shinji Watanabe

In recent years, end-to-end models have become popular for application in automatic speech recognition. Compared to hybrid approaches, which perform the phone-sequence to word conversion based on a lexicon, an end-to-end system models text directly, usually as a sequence of characters or sub-word features. We propose a sub-word modeling method that leverages the pronunciation information of a word. Experiments show that the proposed method can greatly improve upon the character-based baseline, and also outperform commonly used byte-pair encoding methods.


  Click for Model/Code and Paper
End-to-end Speech Recognition with Word-based RNN Language Models

Aug 08, 2018
Takaaki Hori, Jaejin Cho, Shinji Watanabe

This paper investigates the impact of word-based RNN language models (RNN-LMs) on the performance of end-to-end automatic speech recognition (ASR). In our prior work, we have proposed a multi-level LM, in which character-based and word-based RNN-LMs are combined in hybrid CTC/attention-based ASR. Although this multi-level approach achieves significant error reduction in the Wall Street Journal (WSJ) task, two different LMs need to be trained and used for decoding, which increase the computational cost and memory usage. In this paper, we further propose a novel word-based RNN-LM, which allows us to decode with only the word-based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM. We demonstrate the efficacy of the word-based RNN-LMs using a larger corpus, LibriSpeech, in addition to WSJ we used in the prior work. Furthermore, we show that the proposed model achieves 5.1 %WER for WSJ Eval'92 test set when the vocabulary size is increased, which is the best WER reported for end-to-end ASR systems on this benchmark.


  Click for Model/Code and Paper
Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning

Jan 31, 2017
Suyoun Kim, Takaaki Hori, Shinji Watanabe

Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4-14.6% relative improvements in Character Error Rate (CER).


  Click for Model/Code and Paper
Multi-Head Decoder for End-to-End Speech Recognition

Jul 28, 2018
Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then, they are integrated into a single attention. On the other hand, instead of the integration in the attention level, our proposed method uses multiple decoders for each attention and integrates their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, different attention functions are used for each head, leading to the improvement of the recognition performance with an ensemble effect. To evaluate the effectiveness of our proposed method, we conduct an experimental evaluation using Corpus of Spontaneous Japanese. Experimental results demonstrate that our proposed method outperforms the conventional methods such as location-based and multi-head attention models, and that it can capture different speech/linguistic contexts within the attention-based encoder-decoder framework.


  Click for Model/Code and Paper
Non-Autoregressive Transformer Automatic Speech Recognition

Nov 10, 2019
Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak

Recently very deep transformers start showing outperformed performance to traditional bi-directional long short-term memory networks by a large margin. However, to put it into production usage, inference computation cost and latency are still serious concerns in real scenarios. In this paper, we study a novel non-autoregressive transformers structure for speech recognition, which is originally introduced in machine translation. During training input tokens fed to the decoder are randomly replaced by a special mask token. The network is required to predict those mask tokens by taking both context and input speech into consideration. During inference, we start from all mask tokens and the network gradually predicts all tokens based on partial results. We show this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to difficult ones. Some preliminary results on Aishell and CSJ benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed Kaldi nnet3 and chain model setup and is quite closed to the performance of the start-of-the-art end-to-end model.


  Click for Model/Code and Paper
Multilingual End-to-End Speech Translation

Oct 31, 2019
Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

In this paper, we propose a simple yet effective framework for multilingual end-to-end speech translation (ST), in which speech utterances in source languages are directly translated to the desired target languages with a universal sequence-to-sequence architecture. While multilingual models have shown to be useful for automatic speech recognition (ASR) and machine translation (MT), this is the first time they are applied to the end-to-end ST problem. We show the effectiveness of multilingual end-to-end ST in two scenarios: one-to-many and many-to-many translations with publicly available data. We experimentally confirm that multilingual end-to-end ST models significantly outperform bilingual ones in both scenarios. The generalization of multilingual training is also evaluated in a transfer learning scenario to a very low-resource language pair. All of our codes and the database are publicly available to encourage further research in this emergent multilingual ST topic.

* Accepted to ASRU 2019 

  Click for Model/Code and Paper
Towards Online End-to-end Transformer Automatic Speech Recognition

Oct 25, 2019
Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute self-attention. We have proposed a block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism. An additional context embedding vector handed over from the previously processed block helps to encode not only local acoustic information but also global linguistic, channel, and speaker attributes. In this paper, we extend it towards an entire online E2E ASR system by introducing an online decoding process inspired by monotonic chunkwise attention (MoChA) into the Transformer decoder. Our novel MoChA training and inference algorithms exploit the unique properties of Transformer, whose attentions are not always monotonic or peaky, and have multiple heads and residual connections of the decoder layers. Evaluations of the Wall Street Journal (WSJ) and AISHELL-1 show that our proposed online Transformer decoder outperforms conventional chunkwise approaches.

* arXiv admin note: text overlap with arXiv:1910.07204 

  Click for Model/Code and Paper
Transformer ASR with Contextual Block Processing

Oct 16, 2019
Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-to-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in that the entire input sequence is required to compute self-attention. In this paper, we propose a new block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism. An additional context embedding vector handed over from the previously processed block helps to encode not only local acoustic information but also global linguistic, channel, and speaker attributes. We introduce a novel mask technique to implement the context inheritance to train the model efficiently. Evaluations of the Wall Street Journal (WSJ), Librispeech, VoxForge Italian, and AISHELL-1 Mandarin speech recognition datasets show that our proposed contextual block processing method outperforms naive block processing consistently. Furthermore, the attention weight tendency of each layer is analyzed to clarify how the added contextual inheritance mechanism models the global information.

* Accepted for ASRU 2019 

  Click for Model/Code and Paper
Massively Multilingual Adversarial Speech Recognition

Apr 03, 2019
Oliver Adams, Matthew Wiesner, Shinji Watanabe, David Yarowsky

We report on adaptation of multilingual end-to-end speech recognition models trained on as many as 100 languages. Our findings shed light on the relative importance of similarity between the target and pretraining languages along the dimensions of phonetics, phonology, language family, geographical location, and orthography. In this context, experiments demonstrate the effectiveness of two additional pretraining objectives in encouraging language-independent encoder representations: a context-independent phoneme objective paired with a language-adversarial classification objective.

* Accepted at NAACL-HLT 2019 

  Click for Model/Code and Paper
End-to-End Monaural Multi-speaker ASR System without Pretraining

Nov 05, 2018
Xuankai Chang, Yanmin Qian, Kai Yu, Shinji Watanabe

Recently, end-to-end models have become a popular approach as an alternative to traditional hybrid models in automatic speech recognition (ASR). The multi-speaker speech separation and recognition task is a central task in cocktail party problem. In this paper, we present a state-of-the-art monaural multi-speaker end-to-end automatic speech recognition model. In contrast to previous studies on the monaural multi-speaker speech recognition, this end-to-end framework is trained to recognize multiple label sequences completely from scratch. The system only requires the speech mixture and corresponding label sequences, without needing any indeterminate supervisions obtained from non-mixture speech or corresponding labels/alignments. Moreover, we exploited using the individual attention module for each separated speaker and the scheduled sampling to further improve the performance. Finally, we evaluate the proposed model on the 2-speaker mixed speech generated from the WSJ corpus and the wsj0-2mix dataset, which is a speech separation and recognition benchmark. The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams. From the results, the proposed model leads to ~10.0% relative performance gains in terms of CER and WER respectively.

* submitted to ICASSP2019 

  Click for Model/Code and Paper
Weakly Supervised Deep Recurrent Neural Networks for Basic Dance Step Generation

Jul 03, 2018
Nelson Yalta, Shinji Watanabe, Kazuhiro Nakadai, Tetsuya Ogata

A deep recurrent neural network with audio input is applied to model basic dance steps. The proposed model employs multilayered Long Short-Term Memory (LSTM) layers and convolutional layers to process the audio power spectrum. Then, another deep LSTM layer decodes the target dance sequence. This end-to-end approach has an auto-conditioned decode configuration that reduces accumulation of feedback error. Experimental results demonstrate that, after training using a small dataset, the model generates basic dance steps with low cross entropy and maintains a motion beat F-measure score similar to that of a baseline dancer. In addition, we investigate the use of a contrastive cost function for music-motion regulation. This cost function targets motion direction and maps similarities between music frames. Experimental result demonstrate that the cost function improves the motion beat f-score.

* 5 pages, 5 figures 

  Click for Model/Code and Paper
Multi-Modal Data Augmentation for End-to-End ASR

Jun 18, 2018
Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, Shinji Watanabe

We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using \emph{symbolic} input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input and enables seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on character error rate (CER), and as much as 7-10\% relative word error rate (WER) improvement over a baseline both with and without an external language model.

* 5 Pages, 1 Figure, accepted at INTERSPEECH 2018 

  Click for Model/Code and Paper
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

Mar 28, 2018
Jon Barker, Shinji Watanabe, Emmanuel Vincent, Jan Trmal

The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning. This paper introduces the 5th CHiME Challenge, which considers the task of distant multi-microphone conversational ASR in real home environments. Speech material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech and recorded by 6 Kinect microphone arrays and 4 binaural microphone pairs. The challenge features a single-array track and a multiple-array track and, for each track, distinct rankings will be produced for systems focusing on robustness with respect to distant-microphone capture vs. systems attempting to address all aspects of the task including conversational language modeling. We discuss the rationale for the challenge and provide a detailed description of the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR.


  Click for Model/Code and Paper
Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

Jun 08, 2017
Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10\% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

* Accepted for INTERSPEECH 2017 

  Click for Model/Code and Paper
Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

Nov 21, 2017
Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan

Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set.

* 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 271-275 
* in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 

  Click for Model/Code and Paper
Multichannel End-to-end Speech Recognition

Mar 14, 2017
Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey

The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.


  Click for Model/Code and Paper
A practical two-stage training strategy for multi-stream end-to-end speech recognition

Oct 23, 2019
Ruizhi Li, Gregory Sell, Xiaofei Wang, Shinji Watanabe, Hynek Hermansky

The multi-stream paradigm of audio processing, in which several sources are simultaneously considered, has been an active research area for information fusion. Our previous study offered a promising direction within end-to-end automatic speech recognition, where parallel encoders aim to capture diverse information followed by a stream-level fusion based on attention mechanisms to combine the different views. However, with an increasing number of streams resulting in an increasing number of encoders, the previous approach could require substantial memory and massive amounts of parallel data for joint training. In this work, we propose a practical two-stage training scheme. Stage-1 is to train a Universal Feature Extractor (UFE), where encoder outputs are produced from a single-stream model trained with all data. Stage-2 formulates a multi-stream scheme intending to solely train the attention fusion module using the UFE features and pretrained components from Stage-1. Experiments have been conducted on two datasets, DIRHA and AMI, as a multi-stream scenario. Compared with our previous method, this strategy achieves relative word error rate reductions of 8.2--32.4%, while consistently outperforming several conventional combination methods.

* submitted to ICASSP 2019 

  Click for Model/Code and Paper
End-to-End Neural Speaker Diarization with Permutation-Free Objectives

Sep 12, 2019
Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe

In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem. Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference. Because of the benefit, our model can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi-speaker segment labels. We evaluated the proposed method on simulated speech mixtures. The proposed method achieved diarization error rate of 12.28%, while a conventional clustering-based system produced diarization error rate of 28.77%. Furthermore, the domain adaptation with real-recorded speech provided 25.6% relative improvement on the CALLHOME dataset. Our source code is available online at https://github.com/hitachi-speech/EEND.

* Accepted to INTERSPEECH 2019 

  Click for Model/Code and Paper