Automatic Speech Recognition and Understanding Workshop
December 8-12, 2013 | Olomouc, Czech Republic
Show all abstracts
LM_01: K-COMPONENT RECURRENT NEURAL NETWORK LANGUAGE MODELS USING CURRICULUM LEARNING
Show abstract
Conventional n-gram language models are known for their lack of a sufficient ability to capture long-distance dependencies and their brittleness with respect to within-domain variations. In this paper, we propose a k-component recurrent neural network language model using curriculum learning (CLKRNNLM) to address these issues. Based on a Dutch-language corpus, we investigate three methods of curriculum learning that exploit dedicated component models for specific sub-domains. Under the oracle situation, in which context information is available during testing, the experimental results verify our three hypotheses. The first one is that the domain dedicated models perform better than the general models on their specific domain. The second one is that curriculum learning can be used to train recurrent neural network language models (RNNLMs) from general patterns to specific patterns. The third one is that curriculum learning as an implicit method to adjust the weights for general patterns and specific patterns performs better than linear interpolation. Under the condition that context information is unknown in the testing, CL-KRNNLM also improves the conventional RNNLM by 13% relative in terms of word prediction accuracy. The proposed CL-KRNNLM is additionally tested by rescoring the N-best list on standard speech recognition benchmark data set. In the rescoring task, the context domains in the training are clustered using the combination of the Latent Dirichlet Allocation with a k-means clustering method.
LM_02: LEARNING A SUBWORD VOCABULARY BASED ON UNIGRAM LIKELIHOOD
Show abstract
Using words as vocabulary units for tasks like speech recognition is infeasible for many morphologically rich languages, including Finnish. Thus, subword units are commonly used for language modeling. This work presents a novel algorithm for creating a subword vocabulary, based on the unigram likelihood of a text corpus. The method is evaluated with entropy measure and a Finnish LVCSR task. Unigram entropy of the text corpus is shown to be a good indicator for the quality of higher order n-gram models, also resulting in high speech recognition accuracy.
LM_03: EFFECTIVE PSEUDO-RELEVANCE FEEDBACK FOR LANGUAGE MODELING IN SPEECH RECOGNITION
Show abstract
A part and parcel of any automatic speech recognition (ASR) system is language modeling (LM), which helps to constrain the acoustic analysis, guide the search through multiple candidate word strings, and quantify the acceptability of the final output hypothesis given an input utterance. Despite the fact that the n-gram model remains the predominant one, a number of novel and ingenious LM methods have been developed to complement or be used in place of the n-gram model. A more recent line of research is to leverage information cues gleaned from pseudo-relevance feedback (PRF) to derive an utterance-regularized language model for complementing the n-gram model. This paper presents a continuation of this general line of research and its main contribution is two-fold. First, we explore an alternative and more efficient formulation to construct such an utterance-regularized language model for ASR. Second, the utilities of various utterance-regularized language models are analyzed and compared extensively. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task demonstrate that our proposed language models can offer substantial improvements over the baseline n-gram system, and achieve performance competitive to, or better than, some state-of-the-art language models.
LM_04: LEARNING BETTER LEXICAL PROPERTIES FOR RECURRENT OOV WORDS
Show abstract
Out-of-vocabulary (OOV) words can appear more than once in a conversation or over a period of time. Such multiple instances of the same OOV word provide valuable information for learning the lexical properties of the word. Therefore, we investigated how to estimate better pronunciation, spelling and part-of-speech (POS) label for recurrent OOV words. We first identified recurrent OOV words from the output of a hybrid decoder by applying a bottom-up clustering approach. Then, multiple instances of the same OOV word were used simultaneously to learn properties of the OOV word. The experimental results showed that the bottom-up clustering approach is very effective at detecting the recurrence of OOV words. Furthermore, by using evidence from multiple instances of the same word, the pronunciation accuracy, recovery rate and POS label accuracy of recurrent OOV words can be substantially improved.
LM_05: JOINT TRAINING OF INTERPOLATED EXPONENTIAL N-GRAM MODELS
Show abstract
For many speech recognition tasks, the best language model performance is achieved by collecting text from multiple sources or domains, and interpolating language models built separately on each individual corpus. When multiple corpora are available, it has also been shown that when using a domain adaptation technique such as EzAdapt, the performance on each individual domain can be improved by training a joint model across all of the corpora. In this paper, we explore whether improving each domain model via joint training also improves performance when interpolating the models together. We show that the diversity of the individual models is an important consideration, and propose a method for adjusting diversity to optimize overall performance. We present results using word n-gram models and Model M, a class-based n-gram model, and demonstrate improvements in both perplexity and word-error rate relative to state-of-the-art results on a Broadcast News transcription task.
LM_06: MIXTURE OF MIXTURE N-GRAM LANGUAGE MODELS
Show abstract
This paper presents a language model adaptation technique to build a single static language model from a set of language models each trained on a separate text corpus while aiming to maximize the likelihood of an adaptation data set given as a development set of sentences. The proposed model can be considered as a mixture of mixture language models. The mixture model at the top level is a sentence-level mixture model where each sentence is assumed to be drawn from one of a discrete set of topic or task clusters. After selecting a cluster, each n-gram is assumed to be drawn from one of the given n-gram language models. We estimate cluster mixture weights and n-gram language model mixture weights for each cluster using the expectation-maximization (EM) algorithm to seek the parameter estimates maximizing the likelihood of the development sentences. This mixture of mixture models can be represented efficiently as a static n-gram language model using the previously proposed Bayesian language model interpolation technique. We show a significant improvement with this technique (both perplexity and WER) compared to the standard one level interpolation scheme.
AM_01: COMPACT ACOUSTIC MODELING BASED ON ACOUSTIC MANIFOLD USING A MIXTURE OF FACTOR ANALYZERS
Show abstract
A compact acoustic model for speech recognition is proposed based on nonlinear manifold modeling of the acoustic feature space. Acoustic features of the speech signal is assumed to form a low-dimensional manifold, which is modeled by a mixture of factor analyzers. Each factor analyzer describes a local area of the manifold using a low-dimensional linear model. For an HMM-based speech recognition system, observations of a particular state are constrained to be located on part of the manifold, which may cover several factor analyzers. For each tied-state, a sparse weight vector is obtained through an iteration shrinkage algorithm, in which the sparseness is determined automatically by the training data. For each nonzero component of the weight vector, a low-dimensional factor is estimated for the corresponding factor model according to the maximum a posteriori (MAP) criterion, resulting in a compact state model. Experimental results show that compared with the conventional HMM-GMM system and the SGMM system, the new method not only contains fewer parameters, but also yields better recognition results.
AM_02: A GENERALIZED DISCRIMINATIVE TRAINING FRAMEWORK FOR SYSTEM COMBINATION
Show abstract
This paper proposes a generalized discriminative training framework for system combination, which encompasses acoustic modeling (Gaussian mixture models and deep neural networks) and discriminative feature transformation. To improve the performance by combining base systems with complementary systems, complementary systems should have reasonably good performance while tending to have different outputs compared with the base system. Although it is difficult to balance these two somewhat opposite targets in conventional heuristic combination approaches, our framework provides a new objective function that enables to adjust the balance within a sequential discriminative training criterion. We also show a close relationship between the proposed method and boosting methods. Experiments on highly noisy middle vocabulary speech recognition task (2nd CHiME challenge track 2) and LVCSR task (Corpus of Spontaneous Japanese) show the effectiveness of the proposed method, compared with a conventional system combination approach.
AM_03: ACOUSTIC MODELING USING TRANSFORM-BASED PHONE-CLUSTER ADAPTIVE TRAINING
Show abstract
In this paper, we propose a new acoustic modeling technique called the Phone-Cluster Adaptive Training. In this approach, the parameters of context-dependent states are obtained by the linear interpolation of several monophone cluster models, which are themselves obtained by adaptation using linear transformation of a canonical Gaussian Mixture Model (GMM). This approach is inspired from the Cluster Adaptive Training (CAT) for speaker adaptation and the Subspace Gaussian Mixture Model (SGMM). The parameters of the model are updated in an adaptive training framework. The interpolation vectors implicitly capture the phonetic context information. The proposed approach shows substantial improvement over the Continuous Density Hidden Markov Model (CDHMM) and a similar performance to that of the SGMM, while using a significantly less number of parameters than both the CDHMM and the SGMM.
AM_04: SPEAKER ADAPTATION OF NEURAL NETWORK ACOUSTIC MODELS USING I-VECTORS
Show abstract
We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.
AM_05: NEIGHBOUR SELECTION AND ADAPTATION FOR RAPID SPEAKER-DEPENDENT ASR
Show abstract
Speaker dependent (SD) ASR systems have significantly lower word error rates (WER) compared to speaker independent (SI) systems. However, SD systems require sufficient training data from the target speaker, which is impractical to collect in a short time. We present a technique for training SD models using just few minutes of speaker's data. We compensate for the lack of adequate speaker-specific data by selecting neighbours from a database of existing speakers who are acoustically close to the target speaker. We evaluate various neighbour selection algorithms on a large-scale medical transcription task and report significant reduction in WER using only 5 mins of speaker-specific data. We conduct a detailed analysis of various factors such as gender and accent in the neighbour selection. Finally, we study neighbour selection adaptation in the context of discriminative objective functions.
Dec_01: EFFICIENT NEARLY ERROR-LESS LVCSR DECODING BASED ON INCREMENTAL FORWARD AND BACKWARD PASSES
Show abstract
We show that most search errors can be identified by aligning the results of a symmetric forward and backward decoding pass. Based on this knowledge, we introduce an efficient high-level decoding architecture which yields virtually no search errors, and requires virtually no manual tuning. We perform an initial forward- and backward decoding with tight initial beams, then we identify search errors, and then we recursively increment the beam sizes and perform new forward and backward decodings for erroneous intervals until no more search errors are detected. Consequently, each utterance and even each single word is decoded with the smallest beam size required to decode it correctly. On all tested systems we achieve an error rate equal or very close to classical decoding with ideally tuned beam size, but unsupervisedly without specific tuning, and at around 2 times faster runtime. An additional speedup by factor 2 can be achieved by decoding the forward and backward pass in separate threads.
SLU_01: QUERY UNDERSTANDING ENHANCED BY HIERARCHICAL PARSING STRUCTURES
Show abstract
Query understanding has been well studied in the areas of information retrieval and spoken language understanding (SLU). There are generally three layers of query understanding: domain classification, user intent detection, and semantic tagging. Classifiers can be applied to domain and intent detection in real systems, and semantic tagging (or slot filling) is commonly defined as a sequence-labeling task -- mapping a sequence of words to a sequence of labels. Various statistical features (e.g., n-grams) can be extracted from annotated queries for learning label prediction models; however, linguistic characteristics of queries, such as hierarchical structures and semantic relationships, are usually neglected in the feature extraction process. In this work, we propose an approach that leverages linguistic knowledge encoded in hierarchical parse trees for query understanding. Specifically, for natural language queries, we extract a set of syntactic structural features and semantic dependency features from query parse trees to enhance inference model learning. Experiments on real natural language queries show that augmenting sequence labeling models with linguistic knowledge can improve query understanding performance in various domains.
SLU_02: CONVOLUTIONAL NEURAL NETWORK BASED TRIANGULAR CRF FOR JOINT INTENT DETECTION AND SLOT FILLING
Show abstract
We describe a joint model for intent detection and slot filling based on convolutional neural networks (CNN). The proposed architecture can be perceived as a neural network (NN) version of the triangular CRF model (TriCRF), in which the intent label and the slot sequence are modeled jointly and their dependencies are exploited. Our slot filling component is a globally normalized CRF style model, as opposed to left-to-right models in recent NN based slot taggers. Its features are automatically extracted through CNN layers and shared by the intent model. We show that our slot model component generates state-of-the-art results, outperforming CRF significantly. Our joint model outperforms the standard TriCRF by 1% absolute for both intent and slot. On a number of other domains, our joint model achieves 0.7 - 1%, and 0.9 - 2.1% absolute gains over the independent modeling approach for intent and slot respectively.
SLU_03: SEMANTIC ENTITY DETECTION FROM MULTIPLE ASR HYPOTHESES WITHIN THE WFST FRAMEWORK
Show abstract
The paper presents a novel approach to named entity detection from ASR lattices. Since the described method not only detects the named entities but also assigns a detailed semantic interpretation to them, we call our approach the semantic entity detection. All the algorithms are designed to use automata operations defined within the framework of weighted finite state transducers (WFST) -- the ASR lattices are nowadays frequently represented as weighted acceptors. The expert knowledge about the semantics of the task at hand can be first expressed in the form of a context free grammar and then converted to the FST form. We use a WFST optimization to obtain compact representation of the ASR lattice. The WFST framework also allows to use the word confusion networks as another representation of multiple ASR hypotheses. That way we can use the full power of composition and optimization operations implemented in the OpenFST toolkit for our semantic entity detection algorithm. The devised method also employs the concept of a factor automaton; this approach allows us to overcome the need for a filler model and consequently makes the method more general. The paper includes experimental evaluation of the proposed algorithm and compares the performance obtained by using the one-best word hypothesis, optimized lattices and word confusion networks.
SLU_04: ON-LINE ADAPTATION OF SEMANTIC MODELS FOR SPOKEN LANGUAGE UNDERSTANDING
Show abstract
Spoken language understanding (SLU) systems extract semantic information from speech signals, which is usually mapped onto concept sequences. The distribution of concepts in dialogues are usually sparse. Therefore, general models may fail to model the concept distribution for a dialogue and semantic models can benefit from adaptation. In this paper, we present an instance-based approach for on-line adaptation of semantic models. We show that we can improve the performance of an SLU system on an utterance, by retrieving relevant instances from the training data and using them for on-line adapting the semantic models. The instance-based adaptation scheme uses two different similarity metrics edit distance and n-gram match score on three different tokenizations; word-concept pairs, words, and concepts. We have achieved a significant improvement (6% relative) in the understanding performance by conducting re-scoring experiments on the n-best lists that the SLU outputs. We have also applied a two-level adaptation scheme, where adaptation is first applied to the automatic speech recognizer (ASR) and then to the SLU.
SLU_05: DYSFLUENT SPEECH DETECTION BY IMAGE FORENSICS TECHNIQUES
Show abstract
As speech recognition has become popular, the importance of dysfluency detection increased considerably. Once a dysfluent event in spontaneous speech is identified, the speech recognition performance could be enhanced by eliminating its negative effect. Most existing techniques to detect such dysfluent events are based on statistical models. Sparse regularity of dysfluent events and complexity to describe such events in a speech recognition system makes its recognition rigorous. These problems are addressed by our algorithm inspired by image forensics. This paper suggests our algorithm developed to extract novel features of complex dysfluencies. The common steps of classifier design were used to statistically evaluate the proposed features of complex dysfluencies in spectral and cepstral domains. Support vector machines perform objective assessment of MFCC features, MFCC based derived features, PCA based derived features and kernel PCA based derived features of complex dysfluencies, where our derived features increased the performance by 46% opposite to MFCC.
SLU_06: BARGE-IN EFFECTS IN BAYESIAN DIALOGUE ACT RECOGNITION AND SIMULATION
Show abstract
Dialogue act recognition and simulation are traditionally considered separate processes. Here, we argue that both can be fruitfully treated as interleaved processes within the same probabilistic model, leading to a synchronous improvement of performance in both. To demonstrate this, we train multiple Bayes Nets that predict the timing and content of the next user utterance. A specific focus is on providing support for barge-ins. We describe experiments using the Let's Go data that show an improvement in classification accuracy (+5%) in Bayesian dialogue act recognition involving barge-ins using partial context compared to using full context. Our results also indicate that simulated dialogues with user barge-in are more realistic than simulations without barge-in events.
Dial_01: EXPERT-BASED REWARD SHAPING AND EXPLORATION SCHEME FOR BOOSTING POLICY LEARNING OF DIALOGUE MANAGEMENT
Show abstract
This paper investigates the conditions under which expert knowledge can be used to accelerate the policy optimization of a learning agent. Recent works on reinforcement learning for dialogue management allowed to devise sophisticated methods for value estimation in order to deal all together with exploration/exploitation dilemma, sample-efficiency and non-stationary environments. In this paper, a reward shaping method and an exploration scheme, both based on some intuitive hand-coded expert advices, are combined with an efficient temporal difference-based learning procedure. The key objective is to boost the initial training stage, when the system is not sufficiently reliable to interact with real users (e.g. clients). Our claims are illustrated by experiments based on simulation and carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS).
Dial_02: DIALOGUE MANAGEMENT FOR LEADING THE CONVERSATION IN PERSUASIVE DIALOGUE SYSTEMS
Show abstract
In this research, we propose a probabilistic dialogue modeling method for persuasive dialogue systems that interact with the user based on a specific goal, and lead the user to take actions that the system intends from candidate actions satisfying the user's needs. As a baseline system, we develop a dialogue model assuming the user makes decisions based on preference. Then we improve the model by introducing methods to guide the user from topic to topic. We evaluate the system knowledge and dialogue manager in a task that tests the system's persuasive power, and find that the proposed method is effective in this respect.
Dial_03: UNSUPERVISED INDUCTION AND FILLING OF SEMANTIC SLOTS FOR SPOKEN DIALOGUE SYSTEMS USING FRAME-SEMANTIC PARSING
Show abstract
Spoken dialogue systems typically use predefined semantic slots to parse users’ natural language inputs into unified semantic representations. To define the slots, domain experts and professional annotators are often involved, and the cost can be expensive. In this paper, we ask the following question: given a collection of unlabeled raw audios, can we use the frame semantics theory to automatically induce and fill the semantic slots in an unsupervised fashion? To do this, we propose the use of a state-of-the-art frame-semantic parser, and a spectral clustering based slot ranking model that adapts the generic output of the parser to the target semantic space. Empirical experiments on a real-world spoken dialogue dataset show that the automatically induced semantic slots are in line with the reference slots created by domain experts: we observe a mean averaged precision of 69.36% using ASR-transcribed data. Our slot filling evaluations also indicate the promising future of this proposed approach.
Multi_01: CROSS-LINGUAL CONTEXT SHARING AND PARAMETER-TYING FOR MULTI-LINGUAL SPEECH RECOGNITION
Show abstract
This paper is concerned with the problem of building acoustic models for automatic speech recognition (ASR) using speech data from multiple languages. Techniques for multi-lingual ASR are developed in the context of the subspace Gaussian mixture model (SGMM)[2, 3]. Multi-lingual SGMM based ASR systems have been configured with shared subspace parameters trained from multiple languages but with distinct language dependent phonetic contexts and states[11, 12]. First, an approach for sharing state-level target language and foreign language SGMM parameters is described. Second, semi-tied covariance transformations are applied as an alternative to full-covariance Gaussians to make acoustic model training less sensitive to issues of insufficient training data. These techniques are applied to Hindi and Marathi language data obtained for an agricultural commodities dialog task in multiple Indian languages.
Multi_02: IMPROVED PUNCTUATION RECOVERY THROUGH COMBINATION OF MULTIPLE SPEECH STREAMS
Show abstract
In this paper, we present a technique to use the information in multiple parallel speech streams, which are approximate translations of each other, in order to improve performance in a punctuation recovery task. We first build a phrase-level alignment of these multiple streams, using phrase tables to link the phrase pairs together. The information so collected is then used to make it more likely that sentence units are equivalent across streams. We applied this technique to a number of simultaneously interpreted speeches of the European Parliament Committees, for the recovery of the full stop, in four different languages (English, Italian, Portuguese and Spanish). We observed an average improvement in SER of 37% when compared to an existing baseline, in Portuguese and English.
Multi_03: INVESTIGATION OF MULTILINGUAL DEEP NEURAL NETWORKS FOR SPOKEN TERM DETECTION
Show abstract
The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (approx. 10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance.
Multi_04: LANGUAGE STYLE AND DOMAIN ADAPTATION FOR CROSS-LANGUAGE SLU PORTING
Show abstract
Automatic cross-language Spoken Language Understanding porting is plagued by two limitations. First, SLU are usually trained on limited domain corpora. Second, language pair resources (e.g. aligned corpora) are scarce or unmatched in style (e.g. news vs. conversation). We present experiments on automatic style adaptation of the input for the translation systems and their output for SLU. We approach the problem of scarce aligned data by adapting the available parallel data to the target domain using limited in-domain and larger web crawled close-to-domain corpora. SLU performance is optimized by re-ranking its output with Recurrent Neural Network-based joint language model. We evaluate end-to-end SLU porting on close and distant language pairs: Spanish - Italian and Turkish - Italian; and achieve significant improvements both in translation quality and SLU performance.
Robust_01: AUTOMATIC MODEL COMPLEXITY CONTROL FOR GENERALIZED VARIABLE PARAMETER HMMS
Show abstract
This paper investigates a novel model complexity control method to improve the generalization and computational efficiency of conventional generalized variable parameter HMMs (GVP-HMM). The optimal polynomial degrees of Gaussian mean, variance and model space linear transform trajectories are automatically determined at local level. Significant error rate reductions of 20% and 28% relative were obtained over the multi-style training baseline systems on Aurora 2 and a medium vocabulary Mandarin Chinese speech recognition task respectively. Consistent performance improvements and model size compression of 57% relative were also obtained over the baseline GVP-HMM systems using a uniformly assigned polynomial degree.
Robust_02: IMPROVED CEPSTRAL MEAN AND VARIANCE NORMALIZATION USING BAYESIAN FRAMEWORK
Show abstract
Cepstral Mean and Variance Normalization (CMVN) is a computationally efficient normalization technique for noise robust speech recognition. The performance of CMVN is known to degrade for short utterances, due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. In this work, we propose to use posterior estimates of mean and variance in CMVN, instead of the maximum likelihood estimates. This Bayesian approach, in addition to providing a robust estimate of parameters, is also shown to preserve discriminable information without increase in computational cost, making it particularly relevant for Interactive Voice Response (IVR)-based applications. The relative WER reduction of this approach w.r.t. Cepstral Mean Normalization, CMVN and Histogram Equalization are (i) 40.1%, 27% and 4.3% with the Aurora2 database for all utterances, (ii) 25.7%, 38.6% and 30.4% with the Aurora2 database for short utterances, and (iii) 18.7%, 12.6% and 2.5% with the Aurora4 database.
Robust_03: VECTOR TAYLOR SERIES BASED HMM ADAPTATION FOR GENERALIZED CEPSTRUM IN NOISY ENVIRONMENT
Show abstract
This paper derives a vector Taylor series based hidden Markov model (HMM) adaptation algorithm for cepstral coefficients extracted using a power mapping function in noisy environments. It is well known that the HMM adaptation using vector Taylor series (VTS) approximation for mel-frequency cepstral coefficients (MFCCs) significantly improves the automatic speech recognition (ASR) performance in noisy environments. In addition, the power normalized cepstral coefficients (PNCCs) that replace a logarithmic mapping function with a power mapping function have been proposed for a robust ASR system. Since the logarithmic mapping function is highly sensitive to additive components, the replacement of the spectral mapping function is effective to ASR systems in noisy environments. In this paper, we extend the VTS based model adaptation approach to cepstral coefficients extracted using a power mapping function and it is proved that the proposed adaptation formula is the generic form of conventional one. Experimental results indicate that the HMM adaptation in the cepstrum obtained by using a power mapping function improves the ASR performance comparing the conventional VTS based conventional approach for MFCCs.
Robust_04: THE SECOND CHIME SPEECH SEPARATION AND RECOGNITION CHALLENGE: AN OVERVIEW OF CHALLENGE SYSTEMS AND OUTCOMES
Show abstract
Distant-microphone automatic speech recognition (ASR) remains a challenging goal in everyday environments involving multiple background sources and reverberation. This paper reports on the results of the 2nd ‘CHiME’ Challenge, an initiative designed to analyse and evaluate the performance of ASR systems in a real-world domestic environment. We discuss the rationale for the challenge and provide a summary of the datasets, tasks and baseline systems. The paper overviews the systems that were entered for the two challenge tracks: small-vocabulary with moving talker and medium-vocabulary with stationary talker. We present a summary of the challenge findings including novel results produced by challenge system combination. Possible directions for future challenges are discussed.
Robust_05: LEARNING STATE LABELS FOR SPARSE CLASSIFICATION OF SPEECH WITH MATRIX DECONVOLUTION
Show abstract
Non-negative spectral factorisation with long temporal context has been successfully used for noise robust recognition of speech in multi-source environments. Sparse classification from activations of speech atoms can be employed instead of conventional GMMs to determine speech state likelihoods. For accurate classification, correct linguistic state labels must be assigned to speech atoms. We propose using non-negative matrix deconvolution for learning the labels with algorithms closely matching a framework that separates speech from additive noises. Experiments on the 1st CHiME Challenge corpus show improvement in recognition accuracy over labels acquired from original atom sources or previously used least squares regression. The new approach also circumvents numerical issues encountered in previous learning methods, and opens up possibilities for new speech basis generation algorithms.
Robust_06: MODIFIED SPLICE AND ITS EXTENSION TO NON-STEREO DATA FOR NOISE ROBUST SPEECH RECOGNITION
Show abstract
In this paper, a modification to the training process of the popular SPLICE algorithm has been proposed for noise robust speech recognition. The modification is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean and noisy training utterances, but not stereo counterparts, are required. Finally, an MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed. The modified SPLICE shows 8.6% absolute improvement over SPLICE in Test C of Aurora-2 database, and 2.93% overall. Non-stereo method shows 10.37% and 6.93% absolute improvements over Aurora-2 and Aurora-4 baseline models respectively. Run-time adaptation shows 9.89% absolute improvement in modified framework as compared to SPLICE for Test C, and 4.96% overall w.r.t. standard MLLR adaptation on HMMs.
Robust_07: A PROPAGATION APPROACH TO MODELLING THE JOINT DISTRIBUTIONS OF CLEAN AND CORRUPTED SPEECH IN THE MEL-CEPSTRAL DOMAIN
Show abstract
This paper presents a closed form solution relating the joint distributions of corrupted and clean speech in the short-time Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficient (MFCC) domains. This makes possible a tighter integration of STFT domain speech enhancement and feature and model-compensation techniques for robust automatic speech recognition. The approach directly utilizes the conventional speech distortion model for STFT speech enhancement , allowing for low cost, single pass, causal implementations. Compared to similar uncertainty propagation approaches, it provides the full joint distribution, rather than just the posterior distribution, which provides additional mod el compensation possibilities. The method is exemplified by deriving an MMSE-MFCC estimator from the propagated joint distribution. It is shown that similar performance to that of STFT uncertainty propagation (STFT-UP) can be obtaine d on the AURORA4, while deriving the full joint distribution.
SDRKWS_01: THE TAO OF ATWV: PROBING THE MYSTERIES OF KEYWORD SEARCH PERFORMANCE
Show abstract
In this paper we apply diagnostic analysis to gain a deeper understanding of the performance of the the keyword search system that we have developed for conversational telephone speech in the IARPA Babel program. We summarize the Babel task, its primary performance metric, "actual term weighted value" (ATWV), and our recognition and keyword search systems. Our analysis uses two new oracle ATWV measures, a bootstrap-based ATWV confidence interval, and includes a study of the underpinnings of the large ATWV gains due to system combination. This analysis quantifies the potential ATWV gains from improving the number of true hits and the overall quality of the detection scores in our system's posting lists. It also shows that system combination improves our systems' ATWV via a small increase in the number of true hits in the posting lists.
SDRKWS_02: TOWARDS UNSUPERVISED SEMANTIC RETRIEVAL OF SPOKEN CONTENT WITH QUERY EXPANSION BASED ON AUTOMATICALLY DISCOVERED ACOUSTIC PATTERNS
Show abstract
This paper presents an initial effort to retrieve semantically related spoken content in a completely unsupervised way. Unsupervised approaches of spoken content retrieval is attractive because the need for annotated data reasonably matched to the spoken content for training acoustic and language models can be bypassed. However, almost all such unsupervised approaches focus on spoken term detection, or returning the spoken segments containing the query, using either template matching techniques such as dynamic time warping (DTW) or model-based approaches. However, users usually prefer to retrieve all objects semantically related to the query, but not necessarily including the query terms. This paper proposes a different approach. We transcribe the spoken segments in the archive to be retrieved through into sequences of acoustic patterns automatically discovered in an unsupervised method. For an input query in spoken form, the top-N spoken segments from the archive obtained with the first-pass retrieval with DTW are taken as pseudo-relevant. The acoustic patterns frequently occurring in these segments are therefore considered as query-related and used for query expansion. Preliminary experiments performed on Mandarin broadcast news offered very encouraging results.
SDRKWS_03: SCORE NORMALIZATION AND SYSTEM COMBINATION FOR IMPROVED KEYWORD SPOTTING
Show abstract
We present two techniques that are shown to yield improved Keyword Spotting (KWS) performance when using the ATWV/MTWV performance measures: (i) score normalization, where the scores of different keywords become commensurate with each other and they more closely correspond to the probability of being correct than raw posteriors; and (ii) system combination, where the detections of multiple systems are merged together, and their scores are interpolated with weights which are optimized using MTWV as the maximization criterion. Both score normalization and system combination approaches show that significant gains in ATWV/MTWV can be obtained, sometimes on the order of 8-10 points (absolute), in five different languages. A variant of these methods resulted in the highest performance for the official surprise language evaluation for the IARPA-funded Babel project in April 2013.
SDRKWS_04: THE IBM KEYWORD SEARCH SYSTEM FOR THE DARPA RATS PROGRAM
Show abstract
The paper describes a state-of-the-art keyword search (KWS) system in which significant improvements are obtained by using Convolutional Neural Network acoustic models, a two-step speech segmentation approach and a simplified ASR architecture optimized for KWS. The system described in this paper had the best performance in the 2013 DARPA RATS evaluation for both Levantine and Farsi.
NewApp_01: EMOTION RECOGNITION FROM SPONTANEOUS SPEECH USING HIDDEN MARKOV MODELS WITH DEEP BELIEF NETWORKS
Show abstract
Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition. Our work provides insights into important similarities and differences between speech and emotion.
NewApp_02: AUTOMATIC PRONUNCIATION CLUSTERING USING A WORLD ENGLISH ARCHIVE AND PRONUNCIATION STRUCTURE ANALYSIS
Show abstract
English is the only language available for global communi-cation. Due to the influence of speakers’ mother tongue, however, those from different regions inevitably have different accents in their pronunciation of English. The ulti-mate goal of our project is creating a global pronunciation map of World Englishes on an individual basis, for speakers to use to locate similar English pronunciations. If the speak-er is a learner, he can also know how his pronunciation compares to other varieties. Creating the map mathematically requires a matrix of pronunciation distances among all the speakers considered. This paper investigates invariant pronunciation structure analysis and Support Vector Regression (SVR) to predict the inter-speaker pronunciation distances. In experiments, the Speech Accent Archive (SAA), which contains speech data of worldwide accented English, is used as training and testing samples. IPA narrow transcriptions in the archive are used to prepare reference pronunciation distances, which are then predicted based on structural analysis and SVR, not with IPA transcriptions. Correlation between the reference distances and the predicted distances is calculated. Experimental results show very promising results and our proposed method outperforms by far a baseline system developed using an HMM-based phoneme recognizer.
NewApp_03: PHONETIC AND ANTHROPOMETRIC CONDITIONING OF MSA-KST COGNITIVE IMPAIRMENT CHARACTERIZATION SYSTEM
Show abstract
We explore the impact of speech- and speaker-specific modeling onto the Modulation Spectrum Analysis -- Kolmogorov-Smirnov feature Testing (MSA-KST) characterization method in the task of automated prediction of the cognitive impairment diagnosis, namely dysphasia and pervasive development disorder. Phoneme-synchronous capturing of speech dynamics is a reasonable choice for a segmental speech characterization system as it allows comparing speech dynamics in the similar phonetic contexts. Speaker-specific modeling aims at reducing the ``within-the-class'' variability of the characterized speech or speaker population by removing the effect of speaker properties that should have no relation to the characterization. Specifically the vocal tract length of a speaker has nothing to do with the diagnosis attribution and, thus, the feature set shall be normalized accordingly. The resulting system compares favorably to the baseline system of the Interspeech'2013 Computational Paralinguistics Challenge.
NewApp_04: ASR FOR ELECTRO-LARYNGEAL SPEECH
Show abstract
The electro-larynx device (EL) offers the possibility to re-obtain speech when the larynx is removed after a total laryngectomy. Speech produced with an EL suffers from inadequate speech sound quality, therefore there is a strong need to enhance EL speech. When disordered speech is applied to Automatic Speech Recognition (ASR) systems, the performance will significantly decrease. ASR systems are increasingly part of daily life and therefore, the word accuracy rate of disordered speech should be reasonably high in order to be able to make ASR technologies accessible for patients suffering from speech disorders. Moreover, ASR is a method to get an objective rating for the intelligibility of disordered speech. In this paper we apply disordered speech, namely speech produced by an EL, on an ASR system which was designed for normal, healthy speech and evaluate its performance with different types of adaptation. Furthermore, we show that two approaches to reduce the directly radiated EL (DREL) noise from the device itself are able to increase the word accuracy rate compared to the unprocessed EL speech.
NewApp_05: AUTOMATIC SENTIMENT EXTRACTION FROM YOUTUBE VIDEOS
Show abstract
Extracting speaker sentiment from natural audio streams such as YouTube is a very challenging task. A number of factors contribute to the task difficulty, namely, Automatic Speech Recognition (ASR) of spontaneous speech, unknown background environments, variable source and channel characteristics, accents, diverse topics, etc. In this study, we propose a system to detect speaker sentiment in YouTube videos. The proposed method combines ASR with text based sentiment extraction. Specifically, the text based sentiment extraction uses Parts-of-Speech (POS) tagged text and Maximum Entropy (ME) modeling to develop sentiment models. Here, we analyze how data from different user domains and POS tags combinations affect sentiment classification. Furthermore, an iterative feature reduction technique for ME Sentiment Model development without sacrificing classification accuracy is proposed. Finally YouTube data is decoded using our customized KALDI Automatic Speech Recognition system (ASR) for decoding the transcript. ME sentiment models are used to extract the sentiment from the decoded transcript thereby estimating the sentiment of the YouTube video. Our experimental evaluation show promising results for classifying sentiments using ME based method. We also evaluate the effect of various NLP and ASR model parameters on accuracy. We present a text based method for simulated Word Error Rate (WER) calculation. DET curves for ASR performance and the ASR accuracy is presented. Detailed analysis of how Natural Language Processing (NLP) and ASR components affect sentiment detection individually and in cascade is presented.
SPFea_01: ACOUSTIC CHARACTERISTICS RELATED TO THE PERCEPTUAL PITCH IN WHISPERED VOWELS
Show abstract
The characteristics of whispered speech are not well known. The most remarkable difference from ordinal speech is the pitch (the height of speech), since whispered speech has no fundamental frequency. In this study, we have tried to reveal the mechanism of producing pitch in whispered speech through an experiment in which a male and a female subjects uttered Japanese whispered vowels in a way so as to tune their pitch to the guidance tone with different five to nine frequencies. We applied multivariate analysis such as the principal component analysis to the data in order to make clear which part of frequency contributes much to the change of pitch. We have succeeded in endorsing the previous observations, i.e. shift of formants is dominant, with more detailed numerical evidence. In addition, we obtained some implications to approach the pitch mechanism of whispered speech. The main result obtained is that two or three formants of less than 5 kHz are shifted upward and the energy is increased in high frequency region over 5 kHz.
SPFea_02: AN SVD-BASED SCHEME FOR MFCC COMPRESSION IN DISTRIBUTED SPEECH RECOGNITION SYSTEM
Show abstract
This paper proposes a new scheme for low bit-rate source coding of Mel Frequency Cepstral Coefficients (MFCCs) in Distributed Speech Recognition (DSR) system. The method uses the compressed ETSI Advanced Front-End (ETSI-AFE) features factorized into SVD components. By investigating the correlation property between successive MFCC frames, the odd ones are encoded using ETSI-AFE, while only the singular values and the nearest left singular vectors index are encoded and transmitted for the even frames. At the server side, the non-transmitted MFCCs are evaluated through their quantized singular values and the nearest left singular vectors. The system provides a compression bit-rate of 2.7 kbps. The recognition experiments were carried out on the Aurora-2 database for clean and multi-condition training modes. The simulation results show good recognition performance without significant degradation, with respect to the ETSI-AFE encoder.
SPFea_03: MODELS OF TONE FOR TONAL AND NON-TONAL LANGUAGES
Show abstract
Conventional wisdom in automatic speech recognition asserts that pitch information is not helpful in building speech recognizers for non-tonal languages and contributes only modestly to performance in speech recognizers for tonal languages. To maintain consistency between different systems, pitch is therefore often ignored, trading the slight performance benefits for greater system uniformity/ simplicity. In this paper, we report results that challenge this conventional approach. We present new models of tone that deliver consistent performance improvements for tonal languages (Cantonese, Vietnamese) and even modest improvements for non-tonal languages. Using neural networks for feature integration and fusion, these models achieve significant gains throughout, and provide us with system uniformity and standardization across all languages, tonal and non-tonal.
SPFea_04: A STUDY OF SUPERVISED INTRINSIC SPECTRAL ANALYSIS FOR TIMIT PHONE CLASSIFICATION
Show abstract
Intrinsic Spectral Analysis (ISA) has been formulated within a manifold learning setting allowing natural extensions to out-of-sample data together with feature reduction in a learning framework. In this paper, we propose two approaches to improve the performance of supervised ISA, and then we examine the effect of applying Linear Discriminant technique in the intrinsic subspace compared with the extrinsic one. In the interest of reducing complexity, we propose a preprocessing operation to find a small subset of data points being well representative of the manifold structure; this is accomplished by maximizing the quadratic Renyi entropy. Furthermore, we use class based graphs which not only simplify our problem but also can be helpful in a classification task. Experimental results for phone classification task on TIMIT dataset showed that ISA features improve the performance compared with traditional features, and supervised discriminant techniques outperform in the ISA subspace compared to conventional feature spaces.
NN_01: SEMI-SUPERVISED TRAINING OF DEEP NEURAL NETWORKS
Show abstract
In this paper we search for an optimal strategy for semisupervised Deep Neural Network (DNN) training. We assume that a small part of the data is transcribed, while the majority of the data is untranscribed. We explore self-training strategies with data selection based on both the utterancelevel and frame-level confidences. Further on, we study the interactions between semi-supervised frame-discriminative training and sequence-discriminative sMBR training. We found it beneficial to reduce the disproportion in amounts of transcribed and untranscribed data by including the transcribed data several times, as well as to do a frame-selection based on per-frame confidences derived from confusion in a lattice. For the experiments, we used the Limited language pack condition for the Surprise language task (Vietnamese) from the IARPA Babel program. The absolute Word Error Rate (WER) improvement for frame cross-entropy training is 2.2%, this corresponds to WER recovery of 36% when compared to the identical system, where the DNN is built on the fully transcribed data.
NN_02: HYBRID SPEECH RECOGNITION WITH DEEP BIDIRECTIONAL LSTM
Show abstract
Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in frame-level accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.
NN_03: IMPROVING ROBUSTNESS OF DEEP NEURAL NETWORKS VIA SPECTRAL MASKING FOR AUTOMATIC SPEECH RECOGNITION
Show abstract
The performance of human listeners degrades rather slowly compared to machines in noisy environments. This has been attributed to the ability of performing auditory scene analysis which separates the speech prior to recognition. In this work, we investigate two mask estimation approaches, namely the state dependent and the deep neural network (DNN) based estimations, to separate speech from noises for improving DNN acoustic models' noise robustness. The second approach has been experimentally shown to outperform the first one. Due to the stereo data based training and ill-defined masks for speech with channel distortions, both methods do not generalize well to unseen conditions and fail to beat the performance of the multi-style trained baseline system. However, the model trained on masked features demonstrates strong complementariness to the baseline model. The simple average of the two system's posteriors yields word error rates of 4.4% on Aurora2 and 12.3% on Aurora4.
NN_04: HYBRID ACOUSTIC MODELS FOR DISTANT AND MULTICHANNEL LARGE VOCABULARY SPEECH RECOGNITION
Show abstract
We investigate the application of deep neural network (DNN)- hidden Markov model (HMM) hybrid acoustic models for far-field speech recognition of meetings recorded using mi- crophone arrays. We show that the hybrid models achieve significantly better accuracy than conventional systems based on Gaussian mixture models (GMMs). We observe up to 8% absolute word error rate (WER) reduction from a discrimina- tively trained GMM baseline when using a single distant mi- crophone, and between 4–6% absolute WER reduction when using beamforming on various combinations of array chan- nels. By training the networks on audio from multiple chan- nels, we find the networks can recover significant part of ac- curacy difference between the single distant microphone and beamformed configurations. Finally, we show that the accu- racy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones.
NN_05: DEEP MAXOUT NEURAL NETWORKS FOR SPEECH RECOGNITION
Show abstract
A recently introduced type of neural network called maxout has worked well in many domains. In this paper, we propose to apply maxout for acoustic models in speech recognition. The maxout neuron picks the maximum value within a group of linear pieces as its activation. This nonlinearity is a generalization to the rectified nonlinearity and has the ability to approximate any form of activation functions. We apply maxout networks to the Switchboard phone-call transcription task and evaluate the performances under both a 24-hour low-resource condition and a 300-hour core condition. Experimental results demonstrate that maxout networks converge faster, generalize better and are easier to optimize than rectified linear networks and sigmoid networks. Furthermore, experiments show that maxout networks reduce underfitting and are able to achieve good results without dropout training. Under both conditions, maxout networks yield relative improvements of 1.1-5.1% over rectified linear networks and 2.6-14.5% over sigmoid networks on benchmark test sets.
NN_06: LEARNING FILTER BANKS WITHIN A DEEP NEURAL NETWORK FRAMEWORK
Show abstract
Mel-filter banks are commonly used in speech recognition, as they are motivated from theory related to speech production and perception. While features derived from mel-filter banks are quite popular, we argue that this filter bank is not really an appropriate choice as it is not learned for the objective at hand, i.e. speech recognition. In this paper, we explore replacing the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network. Thus, the filter bank is learned to minimize cross-entropy, which is more closely tied to the speech recognition objective. On a 50-hour English Broadcast News task, we show that we can achieve a 5\% relative improvement in word error rate (WER) using the filter bank learning approach, compared to having a fixed set of filters.
NN_07: ACCELERATING HESSIAN-FREE OPTIMIZATION FOR DEEP NEURAL NETWORKS BY IMPLICIT PRECONDITIONING AND SAMPLING
Show abstract
Hessian-free training has become a popular parallel second order optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
NN_08: IMPROVEMENTS TO DEEP CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR
Show abstract
Deep Covolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
NN_09: ELASTIC SPECTRAL DISTORTION FOR LOW RESOURCE SPEECH RECOGNITION WITH DEEP NEURAL NETWORKS
Show abstract
An acoustic model based on hidden Markov models with deep neural networks (DNN-HMM) has recently been proposed and achieved high recognition accuracy. In this paper, we investigated an elastic spectral distortion method to artificially augment training samples to help DNN-HMMs acquire enough robustness even when there are a limited number of training samples. We investigated three distortion methods―vocal tract length distortion, speech rate distortion, and frequency-axis random distortion―and evaluated those methods with Japanese lecture recordings. In a large vocabulary continuous speech recognition task with only 10 hours of training samples, a DNN-HMM trained with the elastic spectral distortion method achieved a 10.1% relative word error reduction compared with a normally trained DNN-HMM.
NN_10: COMBINING STOCHASTIC AVERAGE GRADIENT AND HESSIAN-FREE OPTIMIZATION FOR SEQUENCE TRAINING OF DEEP NEURAL NETWORKS
Show abstract
Minimum phone error (MPE) training of deep neural networks (DNN) is an effective technique for reducing word error rate of automatic speech recognition tasks. This training is often carried out using a Hessian-free (HF) quasi-Newton approach, although other methods such as stochastic gradient descent have also been applied successfully. In this paper we present a novel stochastic approach to HF sequence training inspired by recently proposed stochastic average gradient (SAG) method. SAG reuses gradient information from past updates, and consequently simulates the presence of more training data than is really observed for each model update. We extend SAG by dynamically weighting the contribution of previous gradients, and by combining it to a stochastic HF optimization. We term the resulting procedure DSAG-HF. Experimental results for training DNNs on 1500h of audio data show that compared to baseline HF training, DSAG-HF leads to better held-out MPE loss after each model parameter update, and converges to an overall better loss value. Fur- thermore, since each update in DSAG-HF takes place over smaller amount of data, this procedure converges in about half the time as baseline HF sequence training.
NN_11: ACCELERATING RECURRENT NEURAL NETWORK TRAINING VIA TWO STAGE CLASSES AND PARALLELIZATION
Show abstract
Recurrent neural network (RNN) language models have proven to be successful to lower the perplexity and word error rate in automatic speech recognition (ASR). However, one challenge to adopt RNN language models is due to their heavy computational cost in training. In this paper, we propose two techniques to accelerate RNN training: 1) two stage class RNN and 2) parallel RNN training. In experiments on Microsoft internal short message dictation (SMD) data set, two stage class RNNs and parallel RNNs not only result in equal or lower WERs compared to original RNNs but also accelerate training by 2 and 10 times respectively. It is worth noting that two stage class RNN speedup can also be applied to test stage, which is essential to reduce the latency in real time ASR applications.
NN_12: IMPACT OF DEEP MLP ARCHITECTURE ON DIFFERENT ACOUSTIC MODELING TECHNIQUES FOR UNDER-RESOURCED SPEECH RECOGNITION
Show abstract
Posterior based acoustic modeling techniques such as Kullback-Leibler divergence based HMM (KL-HMM) and Tandem are able to exploit out-of-language data through posterior features, estimated by a Multi-Layer Perceptron (MLP). In this paper, we investigate the performance of posterior based approaches in the context of under-resourced speech recognition when a standard three-layer MLP is replaced by a deeper five-layer MLP. The deeper MLP architecture yields similar gains of about 15% (relative) for Tandem, KL-HMM as well as for a hybrid HMM/MLP system that directly uses the posterior estimates as emission probabilities. The best performing system, a bilingual KL-HMM based on a deep MLP, jointly trained on Afrikaans and Dutch data, performs 13% better than a hybrid system using the same bilingual MLP and 26% better than a subspace Gaussian mixture system only trained on Afrikaans data.
NN_13: CONTEXT-DEPENDENT MODELLING OF DEEP NEURAL NETWORK USING LOGISTIC REGRESSION
Show abstract
The data sparsity problem of context-dependent acoustic modelling in automatic speech recognition is addressed by using the decision tree state clusters as the training targets in the standard context-dependent (CD) deep neural network (DNN) systems. As a result, the CD states within a cluster cannot be distinguished during decoding. This problem, referred to as the clustering problem, is not explicitly addressed in the current literature. In this paper, we formulate the CD DNN as an instance of the canonical state modelling technique based on a set of broad phone classes to address both the data sparsity and the clustering problems. The triphone is clustered into multiple sets of shorter biphones using broad phone contexts to address the data sparsity issue. A DNN is trained to discriminate the biphones within each set. The canonical states are represented by the concatenated log posteriors of all the broad phone DNNs. Logistic regression is used to transform the canonical states into the triphone state output probability. Clustering of the regression parameters is used to reduce model complexity while still achieving unique acoustic scores for all possible triphones. The experimental results on a broadcast news transcription task reveal that the proposed regression-based CD DNN significantly outperforms the standard CD DNN. The best system provides a 2.7% absolute WER reduction compared to the best standard CD DNN system.
NN_14: DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS
Show abstract
In this work, we propose several deep neural network architectures that are able to leverage data from multiple languages. Modularity is achieved by training networks for extracting high-level features and for estimating phoneme state posteriors separately, and then combining them for decoding in a hybrid DNN/HMM setup. This approach has been shown to achieve superior performance for single-language systems, and here we demonstrate that feature extractors benefit significantly from being trained as multi-lingual networks with shared hidden representations. We also show that existing mono-lingual networks can be re-used in a modular fashion to achieve a similar level of performance without having to train new networks on multi-lingual data. Furthermore, we investigate in extending these architectures to make use of language-specific acoustic features. Evaluations are performed on a low-resource conversational telephone speech transcription task in Vietnamese, while additional data for acoustic model training is provided in Pashto, Tagalog, Turkish, and Cantonese. Improvements of up to 17.4% and 13.8% over mono-lingual GMMs and DNNs, respectively, are obtained.
NN_15: DISCRIMINATIVE PIECEWISE LINEAR TRANSFORMATION BASED ON DEEP LEARNING FOR NOISE ROBUST AUTOMATIC SPEECH RECOGNITION
Show abstract
In this paper, we propose the use of deep neural networks to expand conventional methods of statistical feature enhancement based on piecewise linear transformation. Stereo-based piecewise linear compensation for environments (SPLICE), which is a powerful statistical approach for feature enhancement, models the probabilistic distribution of input noisy features as a mixture of Gaussians. However, soft assignment of an input vector to divided regions is sometimes done inadequately and the vector comes to go through inadequate conversion. Especially when conversion has to be linear, the conversion performance will be easily degraded. Feature enhancement using neural networks is another powerful approach which can directly model a non-linear relationship between noisy and clean feature spaces. In this case, however, it tends to suffer from over-fitting problems. In this paper, we attempt to mitigate this problem by reducing the number of model parameters to estimate. Our neural network is trained whose output layer is associated with the states in the clean feature space, not in the noisy feature space. This strategy makes the size of the output layer independent of the kind of a given noisy environment. Firstly, we characterize the distribution of clean features as a Gaussian mixture model and then, by using deep neural network, estimate discriminatively the state in the clean space that an input noisy feature corresponds to. Experimental evaluations using the Aurora 2 dataset demonstrate that our proposed method has the best performance compared to conventional methods.
NN_16: PORTING CONCEPTS FROM DNNS BACK TO GMMS
Show abstract
Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination.
NN_17: LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION
Show abstract
YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Improving accessibility to these videos for the hard of hearing and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely challenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009 YouTube has provided automatic generation of closed captions for videos detected to have English speech; the service now exists for Japanese and Korean too. This paper describes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an “island of confidence” filtering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13% relative compared to previously reported sequence trained DNN results for this task.
NN_18: HIERARCHICAL NEURAL NETWORKS AND ENHANCED CLASS POSTERIORS FOR SOCIAL SIGNAL CLASSIFICATION
Show abstract
With the impressive advances of deep learning in recent years the interest in neural networks has resurged in the fields of automatic speech recognition and emotion recognition. In this paper we apply neural networks to address speaker-independent detection and classification of laughter and filler vocalizations in speech. We first explore modeling class posteriors with standard neural networks and deep stacked autoencoders. Then, we adopt a hierarchical neural architecture to compute enhanced class posteriors and demonstrate that this approach introduces significant and consistent improvements on the Social Signals Sub-Challenge of the Interspeech 2013 Computational Paralinguistics Challenge (ComParE). On this task we achieve a value of 92.4% of the unweighted average area-under-the-curve, which is the official competition measure, on the test set. This constitutes an improvement of 9.1% over the baseline and is the best result obtained so far on this task.
LowZero_01: ACOUSTIC DATA-DRIVEN PRONUNCIATION LEXICON FOR LARGE VOCABULARY SPEECH RECOGNITION
Show abstract
Speech recognition systems normally use handcrafted pronunciation lexicons designed by linguistic experts. Building and maintaining such a lexicon is expensive and time consuming. This paper concerns automatically learning a pronunciation lexicon for speech recognition. We assume the availability of a small seed lexicon and then learn the pronunciations of new words directly from speech that is transcribed at word-level. We present two implementations for refining the putative pronunciations of new words based on acoustic evidence. The first one is an expectation maximization (EM) algorithm based on weighted finite state transducers (WFSTs) and the other is its Viterbi approximation. We carried out experiments on the Switchboard corpus of conversational telephone speech. The expert lexicon has a size of more than 30,000 words, from which we randomly selected 5,000 words to form the seed lexicon. By using the proposed lexicon learning method, we have significantly improved the accuracy compared with a lexicon learned using a grapheme-to-phoneme transformation, and have obtained a word error rate that approaches that achieved using a fully handcrafted lexicon.
LowZero_02: ACOUSTIC UNIT DISCOVERY AND PRONUNCIATION GENERATION FROM A GRAPHEME-BASED LEXICON
Show abstract
We present a framework for discovering acoustic units and generating an associated pronunciation lexicon from an initial grapheme-based recognition system. Our approach consists of two distinct contributions. First, context-dependent grapheme models are clustered using a spectral clustering approach to create a set of phone-like acoustic units. Next, we transform the pronunciation lexicon using a statistical machine translation-based approach. Pronunciation hypotheses generated from a decoding of the training set are used to create a phrase-based translation table. We propose a novel method for scoring the phrase-based rules that significantly improves the output of the transformation process. Results on an English language dataset demonstrate the combined methods provide a 13% relative reduction in word error rate compared to a baseline grapheme-based system. Our approach could potentially be applied to low-resource languages without existing lexicons, such as in the Babel project.
LowZero_03: A HIERARCHICAL SYSTEM FOR WORD DISCOVERY EXPLOITING DTW-BASED INITIALIZATION
Show abstract
Discovering the linguistic structure of a language solely from spoken input asks for two steps: phonetic and lexical discovery. The first is concerned with identifying the categorical subword unit inventory and relating it to the underlying acoustics, while the second aims at discovering words as repeated patterns of subword units. The hierarchical approach presented here accounts for classification errors in the first stage by modelling the pronunciation of a word in terms of subword units probabilistically: a hidden Markov model with discrete emission probabilities, emitting the observed subword unit sequences. We describe how the system can be learned in a completely unsupervised fashion from spoken input. To improve the initialization of the training of the word pronunciations, the output of a dynamic time warping based acoustic pattern discovery system is used, as it is able to discover similar temporal sequences in the input data. This improved initialization, using only weak supervision, has led to a 40% reduction in word error rate on a digit recognition task.
LowZero_04: NMF-BASED KEYWORD LEARNING FROM SCARCE DATA
Show abstract
This research is situated in a project aimed at the development of a vocal user interface (VUI) that learns to understand its users specifically persons with a speech impairment. The vocal interface adapts to the speech of the user by learning the vocabulary from interaction examples. Word learning is implemented through weakly supervised non-negative matrix factorization (NMF). The goal of this study is to investigate how we can improve word learning when the number of interaction examples is low. We demonstrate two approaches to train NMF models on scarce data: 1) training word models using smoothed training data, and 2) training word models that strictly correspond to the grounding information derived from a few interaction examples. We found that both approaches can substantially improve word learning from scarce training data.
LowZero_05: DEEP MAXOUT NETWORKS FOR LOW-RESOURCE SPEECH RECOGNITION
Show abstract
As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show state-of-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks. Our focus is on the particular advantage of DMNs under low-resource conditions with limited transcribed speech. We extend DMNs to hybrid and bottleneck feature systems, and explore optimal network structures (number of maxout layers, pooling strategy, etc) for both setups. On the newly released Babel corpus, behaviors of DMNs are extensively studied under different levels of data availability. Experiments show that DMNs improve low-resource speech recognition significantly. Moreover, DMNs introduce sparsity to their hidden activations and thus can act as sparse feature extractors.
LowZero_06: COMBINATION OF DATA BORROWING STRATEGIES FOR LOW-RESOURCE LVCSR
Show abstract
Large vocabulary continuous speech recognition (LVCSR) is particularly difficult for low-resource languages, where only very limited manually transcribed data are available. However, it is often feasible to obtain large amount of untranscribed data of the low-resource target language or sufficient transcribed data of some non-target languages. Borrowing data from these additional sources to help LVCSR for low-resource language becomes an important research direction. This paper presents an integrated data borrowing framework in this scenario. Three data borrowing approaches were first investigated in detail, including feature, model and data corpus. They borrow data at different levels from additional sources, and all get substantial performance improvements. As these strategies work independently, the obtained gains are likely additive. The three strategies are then combined to form an integrated data borrowing framework. Experiments showed that with the integrated data borrowing framework, significant improvement of more than 10% absolute WER reduction over a conventional baseline was obtained. In particular, the gain under the extreme limited low-resource scenario is 16%.
LowZero_07: FIXED-DIMENSIONAL ACOUSTIC EMBEDDINGS OF VARIABLE-LENGTH SEGMENTS IN LOW-RESOURCE SETTINGS
Show abstract
Measures of acoustic similarity between words or other units are critical for segmental exemplar-based acoustic models, spoken term discovery, and query-by-example search. Dynamic time warping (DTW) alignment cost has been the most commonly used measure, but it has well-known inadequacies. Some recently proposed alternatives require large amounts of training data. In the interest of finding more efficient, accurate, and low-resource alternatives, we consider the problem of embedding speech segments of arbitrary length into fixed-dimensional spaces in which simple distances (such as cosine or Euclidean) serve as a proxy for linguistically meaningful (phonetic, lexical, etc.) dissimilarities. Such embeddings would enable efficient audio indexing and permit application of standard distance learning techniques to segmental acoustic modeling. In this paper, we explore several supervised and unsupervised approaches to this problem and evaluate them on an acoustic word discrimination task. We identify several embedding algorithms that match or improve upon the DTW baseline in low-resource settings.
LowZero_08: USING PROXIES FOR OOV KEYWORDS IN THE KEYWORD SEARCH TASK
Show abstract
We propose a simple but effective weighted finite state transducer (WFST) based framework for handling out-of-vocabulary (OOV) keywords in a speech search task. State-of-the-art large vocabulary continuous speech recognition (LVCSR) and keyword search (KWS) systems are developed for conversational telephone speech in Tagalog. Word-based and phone-based indexes are created from word lattices, the latter by using the LVCSR system's pronunciation lexicon. Pronunciations of OOV keywords are hypothesized via a standard grapheme-to-phoneme method. In-vocabulary proxies (word or phone sequences) are generated for each OOV keyword using WFST techniques that permit incorporation of a phone confusion matrix. Empirical results when searching for the Babel/NIST evaluation keywords in the Babel 10 hour development-test speech collection show that (i) searching for word proxies in the word index significantly outperforms searching for phonetic representations of OOV words in a phone index, and (ii) while phone confusion information yields minor improvement when searching a phone index, it yields up to 40% improvement in actual term weighted value when searching a word index with word proxies.
LowZero_09: SEARCH RESULTS BASED N-BEST HYPOTHESIS RESCORING WITH MAXIMUM ENTROPY CLASSIFICATION
Show abstract
We propose a simple yet effective method for improving speech recognition by reranking speech N-best hypotheses with search results. We model N-best reranking as a binary classification problem and select the best hypothesis with highest classification confidence. We use query-specific features extracted from search results to encode domain knowledge and using it with a maximum entropy classifier to rescore the N-best list. We show that rescoring even only the top 2 hypotheses, we can obtain a significant 3% absolute sentence accuracy (SACC) improvement over a strong baseline on production traffic from an entertainment domain.
LowZero_10: USING WEB TEXT TO IMPROVE KEYWORD SPOTTING IN SPEECH
Show abstract
For low resource languages, collecting sufficient training data to build acoustic and language models is time consuming and often expensive. But large amounts of text data, such as online newspapers, web forums or online encyclopedias, usually exist for languages that have a large population of native speakers. This text data can be easily collected from the web and then used to both expand the recognizer's vocabulary and improve the language model. One challenge, however, is normalizing and filtering the web data for a specific task. In this paper, we investigate the use of online text resources to improve the performance of speech recognition specifically for the task of keyword spotting. For the five languages provided in the base period of the IARPA BABEL project, we automatically collected text data from the web using only LimitedLP resources. We then compared two methods for filtering the web data, one based on perplexity ranking and the other based on out-of-vocabulary (OOV) word detection. By integrating the web text into our systems, we observed significant improvements in keyword spotting accuracy for four out of the five languages. The best approach obtained an improvement in actual term weighted value (ATWV) of 0.0424 compared to a baseline system trained only on LimitedLP resources. On average, ATWV was improved by 0.0243 across five languages.
LowZero_11: MULTI-STREAM TEMPORALLY VARYING WEIGHT REGRESSION FOR CROSS-LINGUAL SPEECH RECOGNITION
Show abstract
Building a good Automatic Speech Recognition (ASR) system with limited resources is a very challenging task due to the existing many speech variations. Multilingual and cross-lingual speech recognition techniques are commonly used for this task. This paper investigates the recently proposed Temporally Varying Weight Regression (TVWR) method for cross-lingual speech recognition. TVWR uses posterior features to implicitly model the long-term temporal structures in acoustic patterns. By leveraging on the well-trained foreign recognizers, high quality monophone/state posteriors can be easily incorporated into TVWR to boost the ASR performance on low-resource languages. Furthermore, multi-stream TVWR is proposed, where multiple sets of posterior features are used to incorporate richer (temporal and spatial) context information. Finally, a separate state-tying for the TVWR regression parameters is used to better utilize the more reliable posterior features. Experimental results are evaluated for English and Malay speech recognition with limited resources. By using the Czech, Hungarian and Russian posterior features, TVWR was found to consistently outperform the tandem systems trained on the same features.
LowZero_12: DISCRIMINATIVE SEMI-SUPERVISED TRAINING FOR KEYWORD SEARCH IN LOW RESOURCE LANGUAGES
Show abstract
In this paper, we investigate semi-supervised training for low resource languages where the initial systems may have high error rate (>= 70.0% word eror rate). To handle the lack of data, we study semi-supervised techniques including data selection, data weighting, discriminative training and multi-layer perceptron learning to improve system performance. The entire suite of semi-supervised methods presented in this paper was evaluated under the IARPA Babel program for the keyword spotting tasks. Our semi-supervised system had the best performance in the OpenKWS13 surprise language evaluation for the limited condition.
LowZero_13: PROBABILISTIC LEXICAL MODELING AND UNSUPERVISED TRAINING FOR ZERO-RESOURCED ASR
Show abstract
Standard automatic speech recognition (ASR) systems rely on transcribed speech, language models, and pronunciation dictionaries to achieve state-of-the-art performance. The unavailability of these resources constrains the ASR technology to be available for many languages. In this paper, we propose a novel zero-resourced ASR approach to train acoustic models that only uses list of probable words from the language of interest. The proposed approach is based on Kullback-Leibler divergence based hidden Markov model (KL-HMM), grapheme subword units, knowledge of grapheme-to-phoneme mapping, and graphemic constraints derived from the word list. The approach also exploits existing acoustic and lexical resources available in other resource rich languages. Furthermore, we propose unsupervised adaptation of KL-HMM acoustic model parameters if untranscribed speech data in the target language is available. We demonstrate the potential of the proposed approach through a simulated study on Greek language.
LowZero_14: LIGHTLY SUPERVISED AUTOMATIC SUBTITLING OF WEATHER FORECASTS
Show abstract
Since subtitling television content is a costly process, there are large potential advantages to automating it, using automatic speech recognition (ASR). However, training the necessary acoustic models can be a challenge, since the available training data usually lacks verbatim orthographic transcriptions. If there are approximate transcriptions, this problem can be overcome using light supervision methods. In this paper, we perform speech recognition on broadcasts of Weatherview, BBC's daily weather report, as a first step towards automatic subtitling. For training, we use a large set of past broadcasts, using their manually created subtitles as approximate transcriptions. We discuss and and compare two different light supervision methods, applying them to this data. The best training set finally obtained with these methods is used to create a hybrid deep neural network-based recognition system, which yields high recognition accuracies on three separate Weatherview evaluation sets.
LowZero_15: UNSUPERVISED WORD SEGMENTATION FROM NOISY INPUT
Show abstract
In this paper we present an algorithm for the unsupervised segmentation of a character or phoneme lattice into words. Using a lattice at the input rather than a single string accounts for the uncertainty of the character/phoneme recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model is known. Recently a Weighted Finite State Transducer (WFST) based approach has been published which we show to suffer from an issue: language model probabilities of known words are computed incorrectly. Fixing this issue leads to greatly improved precision and recall rates, however at the cost of increased computational complexity. It is therefore practical only for single input strings. To allow for a lattice input and thus for errors in the character/phoneme recognizer, we propose a computationally efficient suboptimal two-stage approach, which is shown to significantly improve the word segmentation performance compared to the earlier WFST approach.
LowZero_16: AN EMPIRICAL STUDY OF CONFUSION MODELING IN KEYWORD SEARCH FOR LOW RESOURCE LANGUAGES
Show abstract
Keyword search in the context of low resource languages has emerged as a key area of research. The dominant approach in spoken term detection is to use automatic speech recognition (ASR) as a front end to produce a representation of audio that can be indexed. In this paper we present the gist of our experience in dealing with the biggest drawback of this approach; the ability to deal with out-of-vocabulary words and terms which the ASR system did not output. We present results across four languages using a range of confusion models as query expansion techniques which lead to significant improvements in spoken term detection performance as measured by the MTWV metric.
LowZero_17: SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NEURAL NETWORK FEATURE EXTRACTOR TRAINING
Show abstract
This paper presents bootstrapping approach for neural network training. The neural networks serve as bottle-neck feature extractor for subsequent GMM-HMM recognizer. The recognizer is also used for transcription and confidence assignment of untranscribed data. Based on the confidence, segments are selected and mixed with supervised data and new NNs are trained. With this approach, it is possible to recover 40-55% of the difference between partially and fully transcribed data (3 to 5% absolute improvement over NN trained on supervised data only). Using 70-85% of automatically transcribed segments with the highest confidence was found optimal to achieve this result.