Special Sessions

Open Vocabulary Spoken Document Retrieval

Variety of multimedia contents is explosively increasing not only inside PCs but also on the World Wide Web. Most of these contents are raw data including speech, music, and movies. Available text-based tag information is limited and not enough for retrieving these contents. Recently retrieving these raw data has been attempted in the field of movie and speech processing using speech recognition and feature extraction techniques. However, researches have been closed in each narrow field. Sharing these retrieving techniques among audiovisual research fields is effective for extending further research. This special session is interested in retrieval of spoken documents, multi-modally recorded lectures, movies, and music. Retrieving these contents poses various problems including open-vocabulary speech recognition, language models, acoustic models, multi-speaker speech recognition, speech recognition under BGM environment, use of multimodal information, prosodic feature extraction, musical note recognition, pitch extraction, indexing, searching method, and topic segmentation. "Open vocabulary" is one of the most important issues from the view point of practical information retrieval.

  • Seiichi Nakagawa (is2010-ovsdr@cl.ics.tut.ac.jp), Toyohashi University of Technology, Japan
  • Tomoyosi Akiba, Kiyoaki Aikawa, Berlin Chen, Pascale Fung, Xinhui Hu, Yoshiaki Itoh, Tatsuya Kawahara, Haizhou Li, Tomoko Matsui, Hiroaki Nanjo, Hiromitsu Nishizaki, Yoichi Yamashita


Compressive Sensing for Speech and Language Processing

Compressive sensing (CS) has gained popularity in the last few years as a technique used to reconstruct a signal from few training examples, a problem which arises in many machine learning applications. This reconstruction can be defined as adaptively finding a dictionary which best represents the signal on a per sample basis. This dictionary could include random projections, as is typically done for signal reconstruction, or actual training samples from the data, which is explored in many machine learning applications.
Compressive sensing is a rapidly growing field with papers in a variety of signal processing and machine learning conferences, but this area has still garnered little attention from in the speech community. With the increasing amount of speech data available, the need to efficiently represent and search through this data space is becoming of vital importance.
Compressive sensing has relevant applications in a variety of speech and language processing areas, including in signal compression/reconstruction, identifying novel acoustic features, dimensionality reduction, acoustic modeling, language modeling and text-based processing, to name a few. Therefore, we look to create an atmosphere where experts in compressive sensing, machine learning and speech processing can come together and share their ideas on how compressive sensing can be utilized for various topics within speech.

  • Tara Sainath (tsainath@us.ibm.com), IBM T.J. Watson Research Center (USA)
  • Bhuvana Ramabhadran (bhuvana@us.ibm.com), IBM T.J. Watson Research Center (USA)
  • Hynek Hermansky (hynek@jhu.edu), The Johns Hopkins University (USA)


Social Signals in Speech

he expressive functions of vocal behavior have been widely investigated and described in the literature. However, most of this research was limited to the investigation of affective states with the prototypical emotions such as anger, disgust, happiness, or the emotional dimensions of arousal and valence, receiving most of the focus. Other expressive dimensions, related to the signalling of social cues in interaction, have received far less attention. Among these expressive dimensions we consider signals of politeness or rudeness, familiarity, (dis-)agreement, rapport, dominance, etcetera, and also of social emotions including being angry at the interlocutor, love and liking, jealousy or flirting, etcetera. Unraveling the relation between vocal and social conversational phenomena is relevant for the understanding and automatic analysis of human social signals. As future applications, Embodied Conversational Agents (ECAs) and spoken dialogue systems can be developed which will be able to behave more human-like and will be able to recognize and display social interactional behavior. This special session aims to create a better understanding of how vocal behavior can be used to signal social cues in interaction. We intend to discuss state-of-the-art research on the relation between vocal behavior and social interaction, and we aim to raise discussions about fundamental issues and future challenges in this emerging domain of Social Signal Processing (SSP).

For further details, please check http://www.cs.utwente.nl/~truongkp/is2010sssss.html

  • Khiet Truong (k.p.truong@ewi.utwente.nl), Human Media Interaction, University of Twente
  • Dirk Heylen (d.k.j.heylen@ewi.utwente.nl), Human Media Interaction, University of Twente


Quality of Experiencing Speech Services

Speech services - for communication between humans or between humans and machines - have mainly been evaluated following two paradigms: On the one hand, metrics of system performance have been developed to quantify the characteristics of the system and its underlying components; on the other hand, subjective evaluation has been carried out to analyze the quality perceived by actual users (Quality of Experience, QoE). However, automatic evaluation of speech technology is more convenient in order to save costly and time-consuming subjective tests, and will ultimately lead to better speech services.

The primary purpose of this special session is to discuss technological and perceptual metrics related to the quality of experiencing speech services. We ask: What conceptions of quality are currently in use, and how do they relate to each other? What information can be extracted from speech signals? What types of speech transmission degradations are covered by standardized prediction models like PESQ and the E-model? Which approaches can be taken to monitor speech quality? What parameters can be used for describing spoken dialogue system performance and user behavior in spoken-dialogue interactions? How can these parameters be related to system quality? Is it possible to simulate user behavior for this purpose?

  • Sebastian Moeler (Sebastian.Moeller@telekom.de), Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany
  • Alexander Raake (Alexander.Raake@telekom.de), Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany
  • Marcel Waeltermann (Marcel.Waeltermann@telekom.de), Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany


Speech Intelligibility Enhancement for All Ages, Health Conditions and Environments

Spoken language is the most direct means of communication between human beings. However, speech communication often "breaks down" because of increased age, hearing loss, speech disorder and especially acoustic interferences in real adverse environments. To deal with these effects, a number of techniques have been developed to eventually facilitate human speech communication. Generally speaking, these techniques aim to improve speech quality or speech intelligibility. Most traditional techniques are to improve speech quality. In recent years, increased attention has been paid to techniques that can enhance speech intelligibility, which is especially motivated by hearing prostheses such as hearing aids and cochlear implants.

To stimulate further discussion and to promote new research, this special session provides an opportunity for researchers from various communities including speech science, medicine and signal processing. This special session will present the latest research on enhancement of speech intelligibility in human speech communication for all ages, health conditions and environments. Topics of interest include, but are not limited to: (1) Speech understanding for elderly listeners in challenging listening environments; (2) Speech intelligibility enhancement techniques for hearing-impaired listeners, such as, hearing aids and cochlear implants; (3) Assistant techniques for improving speech intelligibility of speech-disordered persons; (4) Speech intelligibility enhancement in various real, difficult listening conditions.

  • Qian-Jie Fu (qfu@hei.org), House Ear Institute, USA
  • Junfeng Li (junfeng@jaist.ac.jp), Japan Advanced Institute of Science and Technology, Japan
  • Tetsuya Takiguchi (takigu@kobe-u.ac.jp), Kobe University, Japan


INTERSPEECH 2010 Paralinguistic Challenge
- Age, Gender, and Affect

Most paralinguistic analysis tasks resemble each other not only by means of processing and ever-present data sparseness, but by lacking agreed-upon evaluation procedures and comparability, in contrast to more ""traditional"" disciplines in speech analysis; at the same time, this is a rapidly emerging field of research, due to the constantly growing interest on applications to human behaviour analysis, and technologies for human-machine communication and multimedia retrieval. In these respects, the INTERSPEECH 2010 Paralinguistic Challenge shall help bridging the gap between excellent research on paralinguistic information in spoken language and low compatibility of results, by addressing three selected sub-challenges: in the Age Sub-Challenge, the age of speakers has to be determined; in the Gender Sub-Challenge, a two-class classification task has to be solved and finally, the Affect Sub-Challenge asks for determination of speakers' interest in ordinal representation in this year's challenge as opposed to last INTERSPEECH's Emotion Challenge, which dealt with emotion in a broader sense. Contributors may find their own features and classification algorithm. However, besides the two corpora ""aGender"" and ""AVIC"", a standard feature set will be provided that may be used. Each participation will be accompanied by a paper presenting the results.

Further Details: Paralinguistic Challenge Web Site

  • Bjoern Schuller (schuller@tum.de), CNRS-LIMSI
  • Stefan Steidl (stefan.steidl@informatik.uni-erlangen.de), FAU Erlangen-Nuremberg
  • Anton Batliner (Anton.Batliner@lrz.uni-muenchen.de), FAU Erlangen-Nuremberg
  • Felix Burkhardt (Felix.Burkhardt@telekom.de), Deutsche Telekom
  • Laurence Devillers (devil@limsi.fr), CNRS-LIMSI
  • Christian Mueller (cmueller@dfki.de), DFKI
  • Shrikanth Narayanan (shri@sipi.usc.edu), University of Southern California


Models of Speech - In Search of Better Representations

The evolution of speech technologies has been stimulated by models of speech, such as the acoustic theory of speech production, statistical modeling of speech spectra leading to LPC, PARCOR, LSF parameters, the Fujisaki model of F0 dynamics, sinusoidal models and many others. They triggered a surge of research activities on speech perception and production as well as the development of new algorithms and tools for investigations and applications by providing structured frameworks. Recent advances in computational power, data acquisition and mining technologies and the application of statistical approaches such as kernel methods, Neural Networks and Hidden Markov Models have provided powerful technical solutions to limited problems without necessarily promoting our understanding of how speech works. In our mind, only the combination of these technologies and model-based approaches might eventually provide parsimonious descriptions of speech representations closely related to linguistic and para-linguistic units and structures through "Analysis-by-Synthesis" in a broader meaning. This session encourages the application of improved algorithms for parameter extraction for models of speech and ways of benchmarking these, to outline their usefulness for phonetic research, but also illustrate their limitations.



Fact and Replica of Speech Production

Speech production research during the past decade has increased our interest about the underlying processes and their models both in biomechanical articulatory structures and aero-acoustic phenomena at the larynx and in the vocal tract. Evolution of measurement technologies, such as MRI, ultrasound imaging and 3D EMA, has contributed to exploring 3D evidence of speech production mechanisms and facilitating the work for constructing numerical and mechanical replicas. The purpose of this special session is to provide an opportunity for international research community to bring together those who have advanced our skill and knowledge for further development through the linkage between the facts and models.

  • Kiyoshi Honda (khonda@sannet.ne.jp), LPP, CNRS-Univ. Paris3
  • Masaaki Honda (hon@waseda.jp), Waseda University
  • Jianwu Dang (dangjianwu@tju.edu.cn; jdang@jaist.ac.jp), Tianjin University, JAIST