Automatic Speech Recognition and Understanding Workshop
December 8-12, 2013 | Olomouc, Czech Republic
The performance of speech and language processing technologies has improved dramatically over the past decade, with an increasing number of systems being deployed in a large variety of applications, such as spoken dialog systems, speech translation and information retrieval systems. Most efforts to date were focused on a very small number of languages spoken by a large number of speakers in countries of great economic potential, and a population with immediate information technology needs. However, speech technology has a lot to contribute to those languages that do not fall into this category. Firstly, languages with a small number of speakers and few linguistic resources may suddenly become of interest for humanitarian, economical or regional conflict reasons. Secondly, a large number of languages are in danger of becoming extinct, and ongoing projects for preserving them could benefit from speech and language technologies.
With close to 7000 languages in the world and the need to support multiple input and output languages, the most important challenge today is to port speech processing systems to new languages rapidly and at reasonable costs. Major bottlenecks are the sparseness of speech and text data, the lack of language conventions, and the gap between technology and language expertise. Large-scale data resources are currently available for a minority of languages and the costs for data collections are prohibitive to all but the most widely spoken and economically viable languages. Thus data sparseness is one the most pressing challenges and is further exacerbate by the fact that today’s speech technologies heavily rely on statistically based modeling schemes, such as Hidden Markov Models and N-gram language models. While statistical modeling algorithms are mostly language independent and proved to work well for a variety of languages, the building of well performing speech recognition systems is far from being language independent. The building process must take into account language peculiarities, such as sound systems, phonotactics, word segmentation, and morphology. The lack of language conventions concerns a surprisingly large number of languages or dialects. The lack of a standardized writing system for example concerns the majority of languages and hinders web harvesting of large text corpora as well as the construction of vocabularies and dictionaries. Last but not least, despite the well-defined process of system building it is not only cost- and time consuming to handle language-specific peculiarities but it also requires substantial language expertise. Unfortunately, it is often difficult to find system developers who simultaneously have the required technical background and significant insight into the language in question. Consequently, one of the central issues in developing speech processing systems in many languages is the challenge of bridging the gap between language and technology expertise.
In my talk I will present ongoing work at the Cognitive Systems Lab on rapidly building speech recognition systems for yet unsupported languages if few speech data and few or no transcripts, text or linguistic resources are available. This includes the sharing of data and models across languages, as well as the rapid adaptation of language independent models to yet unsupported languages. Techniques and tools will be described which lower the overall costs for system development by automating the system building process, leveraging off crowd sourcing, and reducing the data needs without suffering significant performance losses. Finally, I will present the web-based Rapid Language Adaptation Toolkit (RLAT), an online service (http://csl.ira.uka.de/rlat-dev) that enables native language experts to build speech recognition components without requiring detailed technology expertise. RLAT enables the user to collect speech data or leverage off multilingual seed models to initialize and train acoustic models, to harvest large amounts of text data in order to create language models, and to automatically derive vocabularies and generate pronunciation models. The resulting components can be evaluated in an end-to-end system allowing for iterative improvements. By archiving the data gathered on-the-fly from many cooperative users, we hope to significantly increase the repository of languages and linguistic resources, and make the data and components available at large to the community. By keeping the users in the developmental loop, RLAT can learn from the users’ expertise to constantly adapt and improve. This will hopefully revolutionize the system development process for yet under-supported languages.
Tanja Schultz received her Ph.D. and Masters in Computer Science from University of Karlsruhe, Germany in 2000 and 1995 respectively and passed the German state examination for teachers of Mathematics, Sports, and Educational Science from Heidelberg University, in 1990. She joined Carnegie Mellon University in 2000 and became a Research Professor at the Language Technologies Institute. Since 2007 she is a Full Professor at the Department of Informatics of the Karlsruhe Institute of Technology (KIT) in Germany. She is the director of the Cognitive Systems Lab, where her research activities focus on human-machine interfaces with a particular area of expertise in rapid adaptation of speech processing systems to new domains and languages. She co-edited a book on this subject and received several awards for this work. In 2001 she received the FZI price for an outstanding Ph.D. thesis. In 2002 she was awarded the Allen Newell Medal for Research Excellence from Carnegie Mellon for her contribution to Speech Translation and the ISCA best paper award for her publication on language independent acoustic modeling. In 2005 she received the Carnegie Mellon Language Technologies Institute Junior Faculty Chair. Her recent research work on silent speech interfaces based on myoelectric signals received best demo and paper prices and was awarded with the Alcatel-Lucent Research Award for Technical Communication in 2012. Tanja Schultz is the author of more than 250 articles published in books, journals, and proceedings. She is a member of the Society of Computer Science (GI) for more than 20 years, of the IEEE Computer Society, and the International Speech Communication Association ISCA, where she serves her second term as an elected ISCA board member.
This talk will summarize our experience in developing speech-to-text transcription systems with little or no manually transcribed data and limited resources (lexicon or language specific knowledge). Since our initial studies in 2000, we have applied such approaches to a number of languages, including Finnish, Ukranian, Portuguese, Bulgarian, Hungarian, Slovak and Latvian. We have found significant improvements using acoustic features produced by discriminative classifiers such as multi-layer perceptrons (MLPs) trained for other languages. On the language modeling side we have explored unsupervised morphological decomposition to reduce the need for textual resources. These studies have been carried out in collaboration with colleagues at LIMSI (www.limsi.fr/tlp) and Vocapia Research (www.vocapia.com), most recently in the context of the Quaero program (www.quaero.org).
Lori Lamel joined the CNRS-LIMSI laboratory in October 1991 where she is now senior research scientist (DR1). She obtained her Ph.D. degree in EECS from MIT in 1988 and has over 270 reviewed publications. Her research centers on speaker-independent, large vocabulary continuous speech recognition; studies in acoustic-phonetics; lexical and phonological modeling; design, analysis, and realization of large speech corpora; speaker and language identification. One focus has been on the rapid development of speech-to-text transcription systems via cross-lingual porting and unsupervised acoustic model training.
This presentation will describe the data resources collected to support the Babel program, the challenges that performers have faced in the program when working with this data in the Base Period of the Program, and lessons learned. The goal of the Babel Program is to rapidly develop speech recognition capability for keyword search in new languages, working with speech recorded in a variety of conditions and with limited amounts of transcription. This effort requires the collection of speech data in a wide variety of languages to facilitate research efforts and assess progress toward Program objectives. The speech data is recorded in the country where the speakers reside and contains variability in speaker demographics and recording conditions. The Program will ultimately address a broad set of languages with a variety of phonotactic, phonological, tonal, morphological, and syntactic characteristics. In the Base Period, performers worked with four Development Languages (Cantonese, Pashto, Tagalog, and Turkish), and then were evaluated on a Surprise Language, Vietnamese, for which they had to build their systems in four weeks. The Program focused solely on telephone speech in the Base Period, but performers will also be working on speech collected with additional types of devices (e.g., table-top microphone) in order to foster research on channel robustness starting in the Option 1 period.
Dr. Mary P. Harper is currently a Program Manager in Incisive Analysis at The Intelligence Advanced Research Projects Activity (IARPA) where she is managing the Babel Program. She earned her BA (Psychology, 1976) at Kent State University, an MS (Psychology, 1980) at the University of Massachusetts, and both an MS and a PhD in Computer Science at Brown University (1986, 1990). From 1989-2007, she was a professor in the School of Electrical and Computer Engineering at Purdue University. Dr. Harper's academic research has focused on computer modeling of human communication with a focus on methods for incorporating multiple types of knowledge sources, including lexical, syntactic, prosodic, and visual sources. She has published over 100 peer-reviewed articles. Dr. Harper served as a rotating Program Director at National Science Foundation while continuing her research activities at Purdue from 2002-2005. She joined the Center for the Advanced Study of Language (CASL) at University of Maryland in 2005 as a senior research scientist investigating the use of human language technology (HLT). From 2006-2008, she also took on the role of Area Director for the Technology Use sub-area at CASL, in which role she led a team of researchers conducting research in this area. In 2008-2010, Dr. Harper, as a principal research scientist, worked together with researchers at the Johns Hopkins HLT Center of Excellence, to develop next-generation human language technologies.
Our self-organizing unit (SOU) approach to speech recognizer training originally focused on zero resource training, i.e., where no transcribed speech was used in training, while using large amounts of untranscribed speech. In this presentation we discuss not only our zero resource SOU training but also how we have expanded the SOU training to include the use of varying, limited amounts of transcribed speech (up to one hour), in conjunction with large amounts of untranscribed speech. We describe how the training process changes with the availability of increasing amounts of transcribed speech. The different methods that we use are compared and we also consider how our approach relates to other methods that work with very limited resources.
Dr. Gish is a speech researcher at BBN Technologies. He holds a Ph.D. in Applied Mathematics from Harvard University. He currently leads a group working on various aspects of speech and language processing. A primary activity of his is the development of speech recognition technology that can extract information from speech with a minimal amount of training data. His work also includes research on keyword spotting, topic classification and speaker identification/verification.
The development of an automatic speech recognizer is typically a highly supervised process involving the specification of phonetic inventories, lexicons, acoustic and language models, along with annotated training corpora. Although some model parameters may be modified via adaptation, the overall structure of the speech recognizer remains relatively static. While this approach has been effective for problems when there is adequate human expertise and labeled corpora, it is challenged by less-supervised or unsupervised scenarios. It also stands in stark contrast to human processing of speech and language where learning is an intrinsic capability.
In this talk I will describe some of the speech and language research topics being investigated at MIT that require fewer or even zero conventional linguistic resources. In particular I plan to describe our recent progress in unsupervised spoken term discovery, and an inference-based method to automatically learn sub-word unit inventories from unannotated speech.
James Glass is a Senior Research Scientist at the MIT Computer Science and Artificial Intelligence Laboratory where he heads the Spoken Language Systems Group. He is also a Lecturer in the Harvard-MIT Division of Health Sciences and Technology. He received his graduate degrees in Electrical Engineering and Computer Science from MIT in 1985 and 1988. His primary research interests are in the area of speech communication and human-computer interaction, centered on automatic speech recognition and spoken language understanding. He has lectured, taught courses, supervised students, and published extensively in these areas. He is currently a Senior Member of the IEEE, an Associate Editor for the IEEE Transactions on Audio, Speech, and Language Processing, and a member of the Editorial Board for Computer, Speech, and Language.