INTERSPEECH 2010 Tutorial Program

Meeting Recognition

  • Thomas Hain (University of Sheffield)
  • Steve Renals (University of Edinburgh)


Meeting processing has been and still is a strong focus of large research projects in Europe, Asia, and the U.S. for the last 5 to 10 years. The combination of interesting applications with a large range of complex technologies has attracted research in audio and video processing, in signal processing and human interfaces, in language analysis and communication. Many of the technologies required are complex and do not perform perfectly. Only combining them allows to satisfy the requirements for interesting new research into human communication as well as practical applications. Furthermore the complexity of the tasks forces the use of real-world data for all subject area, hence requiring requiring robustness and high flexibility of the algorithms used. The relevance of this topic area is also recognised in the coordination NIST evaluations on speech and video processing algorithms such as the RT and CLEAR competitions. The notion of capturing meetings is more prevalent due to increased interest in video conferencing and thus there is strong commercial interest in meeting recognition.

Meeting recognition has been referred to as an "ASR complete" problem, with challenges arising from microphone array based audio capture, highly overlapped multiparty conversational speech, important non-lexical information relating to social interactions, as well as a wide range of speech understanding challenges. This will be reflected in the tutorial which will cover meeting capture and annotation, speech processing, representation and transfer of information, presentation and user interfacing. Obtaining high quality recordings is a non-trivial task which requires careful planning on desired outcomes and quality both in recognition and classification. Key events and their annotation is crucial for conducting research. Many types of annotation are desirable but good annotation quality is hard to achieve. The processing of the speech signals is crucial as the main source of content. Far field microphone array based recognition, diarisation, automatic speech recognition, and disfluency filtering are they main aspects here, alongside online aspects in these areas. Compact representation of content for visualisation is vital for applications such as off-line browsing and search for specific content. Summarisation and content linking (e.g. to slides presented) allow transfer of information to remote meeting participants. Finally, how to present the wealth of information to a remote meeting participant is of crucial importance, even more so for remote participants.

Within the EU Integrated Projects AMI and AMIDA (www.amiproject.org) we have worked on recording, annotation, recognition and classification, presentation and interpretation of meeting data as well as application demonstrators. The outcome of these projects will serve as a strong foundation and source of demonstration and examples for this tutorial.

The objective of this tutorial is to present a good overview of the research topics associated with meeting processing, the state-of-the-art in recording and processing technologies involved, as well as successful application scenarios. We will especially focus on issues arising from bringing a wide range of subjects together in single targeted applications. In particular we want to highlight the value of observation of complex communication scenarios and the wealth of information obtainable from work in real world scenarios.


Thomas Hain holds a degree from the University of Technology, Vienna and a PhD from Cambridge University. He was Senior Technologist at Philips Speech Processing, Research Associate and Lecturer at the Speech, Vision and Robotics Group the Cambridge University Engineering Department (CUED), joined DCS in 2004. Thomas Hain has a well established track record in automatic speech recognition, audio processing and associated machine learning techniques with more than 50 well-cited publications. He can refer to an extensive history of research on large scale speech recognition systems for participation in NIST speech to text evaluations, establishing a track-record in very competitive systems. He was/is involved in research and site-management of large international projects (e.g. the FP6 projects AMI and AMIDA, the US EARS programme) and participates in the MC-ITN SCALE and the NE PASCAL2 projects. Meeting speech recognition developed under his leadership is available at webASR (www.webasr.org).
Hain was a member of the IEEE Speech and Language Technical Committee between 2007 and 2009, serves on the Editorial Board of Computer, Speech and Language, and has been appointed area chair for speech recognition at several prominent conferences such as EUSIPCO 2009 and ICASSP 2010. He also served as program committee member of Interspeech 2009.

Steve Renals is director of the Centre for Speech Technology Research CSTR) and professor of Speech Technology in the School of Informatics, at the University of Edinburgh. He received a BSc in Chemistry from the University of Sheffield in 1986, an MSc in Artificial Intelligence from the University of Edinburgh in 1987, and a PhD in Speech Recognition and Neural Networks, also from Edinburgh, in 1990. From 1991-92 he was a postdoctoral fellow at the International Computer Science Institute (ICSI), Berkeley, and was then an EPSRC postdoctoral fellow in Information Engineering at the University of Cambridge (1992-94). From 1994-2003 he was a lecturer, then reader, in Computer Science at the University of Sheffield, moving to Edinburgh in 2003. He is an associate editor of ACM Transactions on Speech and Language Processing and IEEE Signal Processing Letters, a former member of the IEEE Technical Committee on Machine Learning and Signal Processing, and a member of the ICMI-MLMI Advisory Board.
Renals has been working on meeting recognition since 2002, and jointly coordinated the European M4, AMI and AMIDA projects, which focused on meeting recognition. He has research interests in speech recognition, statistical language processing and multimodal interaction, over 150 refereed publications in these areas.

This page was last updated on 21-June-2010 3:00 UTC.