INTERSPEECH 2010 Tutorial Program

Conditional Random Fields and Direct Decoding for Speech and Language Processing

  • Eric Fosler-Lussier (Ohio State University)
  • Geoffrey Zweig (Microsoft)


Speech recognition has long been dominated by the Hidden Markov Model paradigm; HMMs provide a flexible statistical model that can capture the relationship between hidden sequences and observed data. However, that flexibility comes with some assumptions: the particular probabilistic form of the model assumes that the data is generated by the underlying hidden state sequence, and that the observations are frame-wise independent given the hidden states. Moreover, transition probabilities between states are only functions of the state identities; observations cannot affect the actual transition probabilities of a standard HMM. While discriminative training criteria can be used to change the behavior of generative models, this does not change the set of statistical assumptions made by the model.

Several groups, including our own, have begun investigating models that directly model the relationship between hidden state and observations: observations directly predict state sequences, instead of being mediated through Bayes' rule. This line of research stems in some ways from the Automatic Neural Network (ANN) models of the 1990s, which sought to directly predict the posterior of single phonetic states given acoustic data. The latest generation of direct models, including variants of the Conditional Random Fields (CRF) model, allow for direct prediction of sequences of states given data without some of the statistical assumptions of HMMs - in particular allowing for the exploration of feature sets that are correlated both within a frame and across time. Segmental CRFs have further been developed which allow for segment-level features in place of frame-level features. The topic has seen a growing number of papers at ICASSP and Interspeech over the last few years, including a special session at ICASSP 2010.

This tutorial will examine the CRF class of models in detail. The tutorial will present a taxonomy of statistical models, and relate CRFs to other commonly used techniques such as HMMs, MLPs and MEMMs. It will describe the feature sets commonly used in direct modeling for speech recognition and natural language processing, and present training and decoding methods for these models. On the practical side, the tutorial will contain case studies illustrating examples from our own research, and descriptions of the toolkits that are currently available for researchers to work with. It will conclude with a survey of active research challenges in the area. We envision the target audience to be students and speech recognition researchers who would like to acquire a more in-depth understanding of this exciting area and the associated research opportunities.


Eric Fosler-Lussier is an Assistant Professor of Computer Science and Engineering, with a courtesy appointment in Linguistics, at The Ohio State University. He received his Ph.D. in 1999 from the University of California, Berkeley, performing his dissertation research at the International Computer Science Institute under the tutelage of Prof. Nelson Morgan. He has also been a Member of Technical Staff at Bell Labs, Lucent Technologies, a Visiting Researcher at Columbia University, and has served on the IEEE Speech and Language Technical Committee (2006-8). In 2006, Prof. Fosler-Lussier was awarded an NSF CAREER award. He has published over 80 papers in speech and language processing, is a member of the ACL and a senior member of the IEEE. He is generally interested in integrating linguistic insights as priors in statistical learning systems.

Geoffrey Zweig studied at the University of California at Berkeley earning a B.A. in physics (Summa Cum Laude) in 1985 and a PhD in computer science in 1998. After graduating he worked for eight years at the IBM TJ Watson research center, where he led the speech recognition efforts in the DARPA EARS and GALE programs and managed the Advanced Large Vocabulary Continuous Speech Recognition group. In 2006 he joined Microsoft Research in Redmond, WA as a Senior Researcher. His work at Microsoft has revolved around acoustic and language modeling techniques for voice search applications, and most recently in the development of a direct modeling framework for voice search. He has published over 50 papers in the area of speech recognition along with numerous patents. In addition to Microsoft, he is on the affiliate faculty of the University of Washington. Dr. Zweig is a member of the ACM and senior member of the IEEE. He served from 2003 to 2006 as associate editor of the IEEE transactions on Audio Speech and Language Processing, and is currently on the editorial board of Computer Speech and Language. Dr. Zweig received a best paper award at WWW6, and an Outstanding Innovation award from IBM in 2005.

This page was last updated on 21-June-2010 3:00 UTC.