Keynote 1: Steve Young - Still Talking to Machines (Cognitively Speaking)

Time:Monday 11:00 Place:Hall A/B Type:Keynote
Chair:Isabel Trancoso
11:00Still Talking to Machines (Cognitively Speaking)
Steve Young (Cambridge University Engineering Department)
This overview article reviews the structure of a fully statistical spoken dialogue system (SDS), using as illustration, various systems and components built at Cambridge over the last few years. Most of the components in an SDS are essentially classifiers which can be trained using supervised learning. However, the dialogue management component must track the state of the dialogue and optimise a reward accumulated over time. This requires techniques for statistical inference and policy optimisation using reinforcement learning. The potential advantages of a fully statistical SDS are the ability to train from data without hand-crafting, increased robustness to environmental noise and user uncertainty, and the ability to adapt and learn on-line. Index Terms: spoken dialogue systems, reinforcement learning, speech understanding, speech synthesis, natural language generation

ASR: Acoustic Models I

Time:Monday 13:30 Place:Hall A/B Type:Oral
Chair:Frank Soong
13:30A Discriminative Splitting Criterion for Phonetic Decision Trees
Simon Wiesler (RWTH Aachen University)
Georg Heigold (RWTH Aachen University)
Markus Nußbaum-Thom (RWTH Aachen University)
Ralf Schlüter (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)
Phonetic decision trees are a key concept in acoustic modeling for large vocabulary continuous speech recognition. Although discriminative training has become a major line of research in speech recognition and all state-of-the-art acoustic models are trained discriminatively, the conventional phonetic decision tree approach still relies on the maximum likelihood principle. In this paper we develop a splitting criterion based on the minimization of the classification error. An improvement of more than 10% relative over a discriminatively trained baseline system on the Wall Street Journal corpus suggests that the proposed approach is promising.
13:50Canonical State Models for Automatic Speech Recognition
Mark Gales (Cambridge University Engineering Department)
Kai Yu (Cambridge University Engineering Department)
Current speech recognition systems are often based on HMMs with state-clustered Gaussian Mixture Models (GMMs) to represent the context dependent output distributions. Though highly successful, the standard form of model does not exploit any relationships between the states, they each have separate model parameters. This paper describes a general class of model where the context-dependent state parameters are a transformed version of one, or more, canonical states. A number of published models sit within this framework, including, semi-continuous HMMs, subspace GMMs and the HMM error model. A set of preliminary experiments illustrating some of this model's properties using CMLLR transformations from the canonical state to the context dependent state are described.
14:10Restructuring Exponential Family Mixture Models
Pierre Dognin (IBM T.J. Watson Research Center)
John Hershey (IBM T.J. Watson Research Center)
Vaibhava Goel (IBM T.J. Watson Research Center)
Peder Olsen (IBM T.J. Watson Research Center)
Variational KL (varKL) divergence minimization was previously applied to restructuring acoustic models (AMs) using Gaussian mixture models by reducing their size while preserving their accuracy. In this paper, we derive a related varKL for exponential family mixture models (EMMs) and test its accuracy using the weighted local maximum likelihood agglomerative clustering technique. Minimizing varKL between a reference and a restructured AM led previously to the variational expectation maximization (varEM) algorithm; which we extend to EMMs. We present results on a clustering task using AMs trained on 50 hrs of Broadcast News (BN). EMMs are trained on fMMI-PLP features combined with frame level phone posterior probabilities given by the recently introduced sparse representation phone identification process. As we reduce model size, we test the word error rate using the standard BN test set and compare with baseline models of the same size, trained directly from data.
14:30Unsupervised Discovery and Training of Maximally Dissimilar Cluster Models
Francoise Beaufays (Google)
Vincent Vanhoucke (Google)
Brian Strope (Google)
One of the difficult problems of acoustic modeling for Automatic Speech Recognition (ASR) is how to adequately model the wide variety of acoustic conditions which may be present in the data. The problem is especially acute for tasks such as Google Search by Voice, where the amount of speech available per transaction is small, and adaptation techniques start showing their limitations. As training data from a very large user population is available however, it is possible to identify and jointly model subsets of the data with similar acoustic qualities. We describe a technique which allows us to perform this modeling at scale on large amounts of data by learning a tree-structured partition of the acoustic space,and we demonstrate that we can significantly improve recognition accuracy in various conditions through unsupervised Maximum Mutual Information (MMI) training. Being fully unsupervised, this technique scales easily to increasing numbers of conditions.
14:50Probabilistic State Clustering Using Conditional Random Field For Context-Dependent Acoustic Modelling
Khe Chai Sim (National University of Singapore)
Hidden Markov Models are widely used in speech recognition systems. Due to the co-articulation effects of continuous speech, context-dependent models have been found to yield performance improvements. One major issue with context-dependent acoustic modelling is the robust parameter estimation of unseen or rare models in the training data. Typically, decision tree state clustering is used to ensure that there are sufficient data for each physical state. Decision trees based on phonetic questions are used to cluster the states. In this paper, conditional random field (CRF) is used to perform probabilistic state clustering where phonetic questions are used as binary feature functions to predict the latent cluster weights. Experimental results on the Wall Street Journal reveals that CRF-based state clustering outperformed the conventional maximum likelihood decision tree state clustering with similar model complexities by about 10% relative.
15:10Integrate Template Matching and Statistical Modeling for Speech Recognition
Xie Sun (University of Missouri)
Yunxin Zhao (University of Missouri)
We propose a novel approach of integrating template matching with statistical modeling to improve continuous speech recognition. We use multiple Gaussian Mixture Model (GMM) indices to represent each frame of speech templates, use agglomerative clustering to generate template representatives, and use log likelihood ratio as the local distance measure for DTW template matching in lattice rescoring. Experimental results on the TIMIT phone recognition task demonstrated that the proposed approach consistently improved several HMM baselines significantly, where the absolute accuracy gain was 1.69%~1.83% if all training templates were used, and the gain was 1.29%~1.37% if template representatives were used.

Spoken dialogue systems I

Time:Monday 13:30 Place:201A Type:Oral
Chair:Ramon Lopez-Cozar Delgado
13:30Cross-Lingual Spoken Language Understanding from Unaligned Data using Discriminative Classification Models and Machine Translation
Fabrice Lefèvre (Univ. Avignon)
François Mairesse (Univ. Cambridge)
Steve Young (Univ. Cambridge)
This paper investigates several approaches to bootstrapping a new spoken language understanding (SLU) component in a target language given a large dataset of semantically-annotated utterances in some other source language. The aim is to reduce the cost associated with porting a spoken dialogue system from one language to another by minimising the amount of data required in the target language. Since word-level semantic annotations are costly, Semantic Tuple Classifiers (STCs) are used in conjunction with statistical machine translation models both of which are trained from unaligned data to further reduce development time. The paper presents experiments in which a French SLU component in the tourist information domain is bootstrapped from English data. Results show that training STCs on automatically translated data produced the best performance for predicting the utterance's dialogue act type, however individual slot/value pairs are best predicted by training STCs on the source language and using them to decode translated utterances.
13:50Techniques for topic detection based processing in spoken dialog systems
Rajesh Balchandran (IBM T J Watson Research Center)
Leonid Rachevsky (IBM T J Watson Research Center)
Bhuvana Ramabhadran (IBM T J Watson Research Center)
Miroslav Novak (IBM T J Watson Research Center)
In this paper we explore various techniques for topic detection in the context of conversational spoken dialog systems and also propose variants over known techniques to address the constraints of memory, accuracy and scalability associated with their practical implementation of spoken dialog systems. Tests were carried out on a multiple-topic spoken dialog system to compare and analyze these techniques. Results show benefits and compromises with each approach suggesting that the best choice of technique for topic detection would be dependent on the specific deployment requirements.
14:10Optimizing Spoken Dialogue Management with Fitted Value Iteration
Senthilkumar Chandramohan (IMS research group, SUPELEC - Metz Campus, France)
Matthieu Geist (IMS research group, SUPELEC - Metz Campus, France)
Olivier Pietquin (IMS research group, SUPELEC - Metz Campus, France)
In recent years machine learning approaches have been proposed for dialogue management optimization in spoken dialogue systems. It is customary to cast the dialogue management problem into a Markov Decision Process and to find the optimal policy using Reinforcement Learning (RL) algorithms. Yet, the dialogue state space is large and standard RL algorithms fail to handle it. In this paper we explore the possibility of using a generalization framework for dialogue management which is a particular fitted value iteration algorithm (namely fitted-Q iteration). We show that fitted-Q, when applied to continuous state space dialogue management problems, can generalize well and makes efficient use of samples to learn the approximate optimal state-action value function. Our experimental results show that fitted-Q performs significantly better than the hand-coded policy and relatively better than the policy learned using least-square policy iteration, another generalization algorithm.
14:30Natural Belief-Critic: a reinforcement algorithm for parameter estimation in statistical spoken dialogue systems
Filip Jurčíček (Engineering Department, Cambridge University)
Blaise Thomson (Engineering Department, Cambridge University)
Simon Keizer (Engineering Department, Cambridge University)
François Mairesse (Engineering Department, Cambridge University)
Milica Gašić (Engineering Department, Cambridge University)
Kai Yu (Engineering Department, Cambridge University)
Steve Young (Engineering Department, Cambridge University)
This paper presents a novel algorithm for learning parameters in statistical dialogue systems which are modelled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy which selects the system's responses based on the inferred state; and a reward function which specifies the desired behaviour of the system. Ideally both the model parameters and the policy would be designed to maximise the reward function. However, whilst there are many techniques available for learning the optimal policy, there are no good ways of learning the optimal model parameters that scale to real-world dialogue systems. The Natural Belief-Critic (NBC) algorithm presented in this paper is a policy gradient method which offers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected reward. The resulting gradient is then used to adapt the prior distribution of the dialogue model parameters. The algorithm is evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximise the reward function result in significantly improved performance compared to the baseline handcrafted parameters.
14:50Is it Possible to Predict Task Completion in Automated Troubleshooters?
Alexander Schmitt (Ulm University, Germany)
Michael Scholz (Ulm University, Germany)
Wolfgang Minker (Ulm University, Germany)
Jackson Liscombe (SpeechCycle, Inc.)
David Sündermann (SpeechCycle Inc.)
The online prediction of task success in Interactive Voice Response (IVR) systems is a comparatively new field of research. It helps to identify critical calls and enables to react, before it is too late and the caller hangs up. This publication answers, to which extent it is possible to predict task completion and how existing approaches generalize for longer lasting dialogues. We compare the performance of two different modeling techniques: linear modeling and the new n-gram modeling. The study shows that n-gram modeling outperforms linear modeling significantly at later prediction points. From a comprehensive set of interaction parameters we identify the relevant ones using Information Gain Ratio. New interaction parameters are presented and evaluated. The study is based on 41.422 calls from an automated Internet troubleshooter with average turn length of 21.4 turns per call.
15:10Minimally Invasive Surgery for Spoken Dialog Systems
David Suendermann (SpeechCycle Labs)
Jackson Liscombe (SpeechCycle Labs)
Roberto Pieraccini (SpeechCycle Labs)
We demonstrate three techniques (Escalator, Engager, and EverywhereContender) designed to optimize performance of commercial spoken dialog systems. These techniques have in common that they produce very small or no negative performance impact even during a potential experimental phase. This is because they can either be applied offline to data collected on a deployed system, or they can be incorporated conservatively such that only a low percentage of calls will get affected until the optimal strategy becomes apparent.

Speech Perception I: Factors Influencing Perception

Time:Monday 13:30 Place:201B Type:Oral
Chair:Diane Kewley-Port
13:30Detecting categorical perception in continuous discrimination data
Paul Boersma (University of Amsterdam)
Katerina Chladkova (University of Amsterdam)
We present a method for assessing categorical perception from continuous discrimination data. Until recently, categorical perception of speech has exclusively been measured by discrimination and identification experiments with a small number of repeatedly presented stimuli. Experiments by Rogers and Davis have shown that using non-repeating stimuli along a densely-sampled phonetic continuum yields a more reliable measure of categorization. However, no analysis method has been proposed that would preserve the continuous nature of the obtained discrimination data. In the present study, we describe a method of analysis that can be applied to continuous discrimination data without having to discretize the raw data at any time during the analysis.
13:50The interaction between stimulus range and the number of response categories in vowel perception
Titia Benders (Amsterdam Center for Language and Communication, University of Amsterdam)
Paola Escudero (Amsterdam Center for Language and Communication, University of Amsterdam)
We investigate the influence of the stimulus range and the number of response categories on the location of perceptual boundaries. The F1 continuum between Spanish /i/ and /e/ was presented to Peruvian listeners in three ranges. Half of the listeners could classify the tokens as /i/ and /e/, the other half chose from the five Spanish vowels. A boundary shift between /i/ and /e/ was observed as a function of the stimulus range, which was larger when listeners were given only two response categories. These results are interpreted as an effect of listeners’ category expectations on speech perception.
14:10The Relation Between Pitch Perception Preference and Emotion Identification
Marie Nilsenova (Tilburg University)
Martijn Goudbeek (Tilburg University)
Luuk Kempen (Tilburg University)
In our study, we explore the effect of synthetic vs analytic listening mode on the identification of emotions. Numerous psychoacoustic studies have shown that listeners differ in how they process complex sounds; some listeners focus on the fundamental frequency while others attend to the higher harmonics. The difference appears to have a neurological basis, expressed in a leftward (for F0 listeners) or rightward (for spectral listeners) asymmetry of gray matter volume in the lateral Heschl's gyrus. In our experiment we found that spectral listeners performed better in an emotion judgment task, which is what we expected based on the fact that the processing of emotional prosody is relatively right-hemisphere lateralized.
14:30Competition in the Perception of Spoken Japanese Words
Takashi Otake (E-Listening Laboratory)
James M. McQueen (Radboud University Nijmegen)
Anne Cutler (Max Planck Institute for Psycholinguistics)
Japanese listeners detected Japanese words embedded at the end of nonsense sequences (e.g., kaba 'hippopotamus' in gyachikaba). When the final portion of the preceding context together with the initial portion of the word (e.g., here, the sequence chika) was compatible with many lexical competitors, recognition of the embedded word was more difficult than when such a sequence was compatible with few competitors. This clear effect of competition, established here for preceding context in Japanese, joins similar demonstrations, in other languages and for following contexts, to underline that the functional architecture of the human spoken-word recognition system is a universal one.
14:50Influence of musical training on perception of L2 speech
Makiko Sadakata (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Lotte van der Zanden (Department of Psychology, University of Nijmegen, The Netherlands)
Kaoru Sekiyama (Division of Cognitive Psychology, Kumamoto University, Japan)
The current study reports specific cases in which a positive transfer of perceptual ability from the music domain to the language domain occurs. We tested whether musical training enhances discrimination and identification performance of L2 speech sounds (timing features, nasal consonants and vowels). Native Dutch and Japanese speakers with different musical training experience, matched for their estimated verbal IQ, participated in the experiments. Results indicated that musical training strongly increases one’s ability to perceive timing information in speech signals. We also found a benefit of musical training on discrimination performance for a subset of the tested vowel contrasts.
15:10Full body aero-tactile integration of speech perception
Donald Derrick (Department of Linguistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada)
Gick Bryan (Department of Linguistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada AND Haskins Laboratories, New Haven, Connecticut 06511-6695, USA)
We follow up on our research demonstrating that aero- tactile information can enhance or interfere with accurate au- ditory perception, even among uninformed and untrained per- ceivers [1]. Mimicking aspiration, we applied slight, inaudi- ble air puffs on participants skin at the right ankles, simulta- neously with syllables beginning with aspirated (‘pa’, ‘ta’) and unsapirated (‘ba’, ‘da’) stops, dividing the participants into two groups, those with hairy, and those with hairless ankles. Since hair follicle endings (mechanoreceptors) are used to detect air turbulence [2] we expected, and observed, that syllables heard simultaneously with cutaneous air puffs would be more likely to be heard as aspirated, but only among those with hairy an- kles. These results demonstrate that information from any part of the body can be integrated in speech perception, but the stim- uli must be unambiguously relatable to the speech event in order to be integrated into speech perception.

Prosody: Models

Time:Monday 13:30 Place:302 Type:Oral
Chair:Aijun Li
13:30Nucleus position within the intonation phrase: a typological study of English, Czech and Hungarian
Tomáš Duběda (Institute of Translation Studies, Charles University in Prague)
Katalin Mády (Institute of Phonetics and Speech Processing, University of Munich)
In this paper we examine cases of non-final nucleus (or sentence stress) in English, Czech and Hungarian. These three languages differ substantially with respect to word order rules, prosodic plasticity (ability to signal information structure by shifting the nucleus) and the degree of grammaticalization in nucleus position. Recordings of parallel texts are studied with the aim to quantify different categories of shifts, as well as inter-speaker agreement in the position of the nucleus.
13:50Focus-sensitive Operator or Focus Inducer: Always and Only
Lee Yong-cheol (University of Pennsylvania)
Nambu Satoshi (University of Pennsylvania)
There is a long-standing debate in the literature about whether focus particles function as focus-sensitive operators or focus inducers. However, it has not yet been established which perspective is correct. The current study investigates the effect of focus particles, using four different conditions. The results show that the focus particles are focus-sensitive operators, since the focused words are not affected by the presence of the focus particles. In addition, prosodic differences can be seen between always and only. The former has more increased mean and maximal F0, duration, and intensity values than the latter. This result suggests that always bears notable prosodic features in and of itself. Index Terms: focus-particle, focus-sensitive operator, focus inducer, PENTA, intonational function
14:10F0 Declination in English and Mandarin Broadcast News Speech
Jiahong Yuan (University of Pennsylvania)
Mark Liberman (University of Pennsylvania)
This study investigates F0 declination in broadcast news speech in English and Mandarin Chinese. The results demonstrate a strong relationship between utterance length and declination slope. Shorter utterances have steeper declination even after excluding the initial rising and final lowering effects. Both topline and baseline show declination, but they are independent. The topline and baseline have different patterns in Mandarin Chinese, whereas in English their patterns are similar. Mandarin Chinese has more and steeper declination than English, as well as wider pitch range and more F0 fluctuations. Index Terms: declination, F0, regression, convex-hull
14:30Frequency of occurrence effects on pitch accent realisation
Katrin Schweitzer (Institute for Natural Language Processing, University of Stuttgart)
Michael Walsh (Institute for Natural Language Processing, University of Stuttgart)
Bernd Moebius (Institute of Communication Sciences, University of Bonn)
Hinrich Schuetze (Institute for Natural Language Processing, University of Stuttgart)
This paper presents the results of a corpus study which examines the impact of frequency of occurrence of accented words on the realisation of pitch accents. In particular, statistical analyses explore this influence on pitch accent range and alignment. The results indicate a significant effect of frequency of occurrence on the relative height of L*H and H*L pitch accents and an also significant but more subtle effect on the alignment of L*H accents.
14:50On the Automatic ToBI Accent Type Identification from Data
César González-Ferreras (University of Valladolid)
Carlos Vivaracho-Pascual (University of Valladolid)
David Escudero-Mancebo (University of Valladolid)
Valentín Cardeñoso-Payo (University of Valladolid)
This contribution faces the ToBI accent recognition problem with the goal of multiclass identification vs. the more conservative Accent vs. No Accent approach. A neural network and a decision tree are used for automatic recognition of the ToBI accents in the Boston Radio Corpus. Multiclass classification results show the difficulty of the problem and the impact of imbalanced classes. A study of the confusion/similarity between accent types, based on in-pair recognition rates, shows its impact on the overall performance. More expressive F0 contours parametrization techniques have been used to improve recognition rates.
15:10AuToBI -- A Tool for Automatic ToBI annotation
Andrew Rosenberg (Queens College / CUNY)
This paper describes the AuToBI system for automatic generation of hypothesized ToBI labels. While research on automatic prosodic annotation has been conducted for many years, AuToBI represents the first publicly available tool to automatically detect and classify the prosodic events that make up the ToBI annotation standard. This paper describes the feature extraction routines as well as the classifiers used to detect and classify ToBI tones. Additionally, we report performance evaluating AuToBI models trained on the Boston Directions Corpus on the Columbia Games Corpus. By reporting performance on distinct speakers, domains and recording conditions, this evaluation describes an accurate expectation of the performance of the system when applied to other material.

Speech Synthesis I: Unit Selection and Others

Time:Monday 13:30 Place:International Conference Room A Type:Poster
Chair:Robert Clark
#1A classifier-based target cost for unit selection speech synthesis trained on perceptual data
Volker Strom (Centre for Speech Technology Research, University of Edinburgh)
Simon King (Centre for Speech Technology Research, University of Edinburgh)
Our goal is to automatically learn a perceptually-optimal target cost function for a unit selection speech synthesiser. The approach we take here is to train a classifier on human perceptual judgements of synthetic speech. The output of the classifier is used to make a simple three-way distinction rather than to estimate a continuously-valued cost.In order to collect the necessary perceptual data, we synthesised 145,137 short sentences with the usual target cost switched off, so that the search was driven by the join cost only. We then selected the 7200 sentences with the best joins and asked 60 listeners to judge them, providing their ratings for each syllable. From this, we derived a rating for each demiphone. Using as input the same context features employed in our conventional target cost function, we trained a classifier on these human perceptual ratings.We synthesised two sets of test sentences with both our standard target cost and the new target cost based on the classifier. A/B preference tests showed that the classifier-based target cost, which was learned completely automatically from modest amounts of perceptual data, is almost as good as our carefully- and expertly-tuned standard target cost.
Wei Zhang (IBM T. J. Watson Research Center, Yorktown Heights, New York 10598 USA)
Xiaodong Cui (IBM T. J. Watson Research Center , Yorktown Heights, New York 10598 USA)
This paper presents an approach using phonetic context similarity as a cost function in unit selection of concatenative Text-to- Speech. The approach measures the degree of similarity between the desired context and the candidate segment under different phonetic contexts. It considers the impact from relatively far contexts when plenty of candidates are available and can take advantage of the data from other symbolically different contexts when the candidates are sparse. Moreover, the cost function also provides an efficient way to prune the search space. Different parameters for modeling, normalization and integerization are discussed. MOS evaluation shows that it can improve the synthesis quality significantly.
Mitsuaki Isogai (NTT Cyber Space Laboratories, NTT Corporation)
Hideyuki Mizuno (NTT Cyber Space Laboratories, NTT Corporation)
We propose a new speech database reduction method that can create efficient speech databases for concatenation-type corpus-based TTS systems. Our aim is to create small speech databases that can yield the highest quality speech output possible. The main points of proposed method are as follows; (1) It has a 2-stage algorithm to reduce speech database size. (2) Consideration of the real speech elements needed allows us to select the most suitable subset of a full-size database; this yields scalable downsized speech databases. A listening test shows that proposed method can reduced a database from 13 hours to 10 hours with no degradation in output quality. Furthermore, synthesized speech using database sizes of 8 and 6 hours keeps relatively high MOS of more than 3.5; 95% of MOS using full size database.
#4Automatic Error Detection for Unit Selection Speech Synthesis Using Log Likelihood Ratio based SVM Classifier
Heng Lu (iFLYTEK Speech Lab, University of Science and Technology of China, P.R.China)
Zhen-hua Ling (iFLYTEK Speech Lab, University of Science and Technology of China, P.R.China)
Si Wei (iFLYTEK Speech Lab, University of Science and Technology of China, P.R.China)
Li-rong Dai (iFLYTEK Speech Lab, University of Science and Technology of China, P.R.China)
Ren-hua Wang (iFLYTEK Speech Lab, University of Science and Technology of China, P.R.China)
This paper proposes a method to detect the errors in synthetic speech of a unit selection speech synthesis system automatically using log likelihood ratio and support vector machine (SVM). For SVM training, a set of synthetic speech are firstly generated by a given speech synthesis system and their synthetic errors are labeled by manually annotating the segments that sound unnatural. Then, two context-dependent acoustic models are trained using the natural and unnatural segments of labeled synthetic speech respectively. The log likelihood ratio of acoustic features between these two models is adopted to train the SVM classifier for error detection. Experimental results show the proposed method is effective in detecting the errors of pitch contour within a word for a Mandarin speech synthesis system. The proposed SVM method using log likelihood ratio between context-dependent acoustic models outperforms the SVM classifier trained on acoustic features directly.
#5Using Robust Viterbi Algorithm and HMM-Modeling in Unit Selection TTS to Replace Units of Poor Quality
Hanna Silen (Tampere University of Technology, Department of Signal Processing, Tampere, Finland)
Elina Helander (Tampere University of Technology, Department of Signal Processing, Tampere, Finland)
Jani Nurminen (Nokia Devices R&D, Tampere, Finland)
Konsta Koppinen (Tampere University of Technology, Department of Signal Processing, Tampere, Finland)
Moncef Gabbouj (Tampere University of Technology, Department of Signal Processing, Tampere, Finland)
In hidden Markov model-based unit selection synthesis, the benefits of both unit selection and statistical parametric speech synthesis are combined. However, conventional Viterbi algorithm is forced to do a selection also when no suitable units are available. This can drift the search and decrease the overall quality. Consequently, we propose to use robust Viterbi algorithm that can simultaneously detect bad units and select the best sequence. The unsuitable units are replaced using hidden Markov model-based synthesis. Evaluations indicate that the use of robust Viterbi algorithm combined with unit replacement increases the quality compared to the traditional algorithm.
#6Automatic detection of abnormal stress patterns in unit selection synthesis
Yeon-Jun Kim (AT&T Labs-Research)
Mark Beutnagel (AT&T Labs-Research)
This paper introduces a method to detect lexical stress errors in unit selection synthesis automatically using machine learning algorithms. If unintended stress patterns can be detected following unit selection, based on features available in the unit database, it may be possible to modify the units during waveform synthesis to correct errors and produce an acceptable stress pattern. In this paper, three machine learning algorithms were trained with acoustic measurements from natural utterances and corresponding stress patterns: CART, SVM and MaxEnt. Our experimental results showed that MaxEnt performs the best (83.3% for 3-syllable words, 88.7% for 4-syllable words correctly classified) in the natural stress pattern classification. Though classification rates are good, a large number of false alarms are produced. However, there is some indication that signal modifications based on false positives do little harm to the speech output.
#7Enhancements of Viterbi Search for Fast Unit Selection Synthesis
Daniel Tihelka (University of West Bohemia)
Jiri Kala (University of West Bohemia)
Jindrich Matousek (University of West Bohemia)
The paper describes the optimisation of Viterbi search used in unit selection TTS, since with a large speech corpus necessary to achieve a high level of naturalness, the performace still suffers. To improve the search speed, the combination of sophisticated stopping schemes and pruning thresholds is employed into the baseline search. The optimised search is, moreover, extremely flexible in configuration, requiring only three intuitively comprehensible coefficients to be set. This provides the means for tuning the search depending on device resources, while it allows reaching significant performance increase. To illustrate it, several configuration scenarios, with speed--up ranging from 6 to 58 times, are presented. Their impact on speech quality is verified by CCR listening test, taking into account only the phrases with the highest number of differences when compared to the baseline search.
#8Accurate Pitch Marking for Prosodic Modification of Speech Segments
Thomas Ewender (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)
Beat Pfister (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)
This paper describes a new approach to pitch marking. Unlike other approaches that use the same combination of features for the whole signal, we take into account the signal properties and apply different features according to some heuristic. We use the short-term energy as a novel robust feature for placing the pitch marks. Where the energy information turns out to be not suitable as an indicator we resort to the fundamental wave computed from a contiguous F0 contour in combination with detailed voicing information. Our experiments demonstrate that the proposed pitch marking algorithm considerably improves the quality of synthesised speech generated by a concatenative text-to-speech system that uses TD-PSOLA for prosodic modifications.
Shifeng Pan (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
Meng Zhang (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
Jianhua Tao (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
The paper investigates a new method to solve concatenation problems of Mandarin speech synthesis which is based on the hybrid approach of HMM-based speech synthesis and unit selection. Unlike other works which use only boundary F0 errors as concatenation cost, a CART based F0 dependency model which considers much context information is trained to measure smoothness of F0. Instead of phoneme-sized units, the basic units of our HUS system are syllables which has been proved to be better for the prosody stability in Mandarin. The experiments show that the proposed method achieves better performance than conventional hybrid system and unit selection system
#10Modeling Liaison in French by Using Decision Trees
Josafa de Jesus Aguiar Pontes (Tokyo Institute of Technology, Japan)
Sadaoki Furui (Tokyo Institute of Technology, Japan)
French is known to be a language with major pronunciation irregularities at word endings with consonants. Particularly, the well-known phonetic phenomenon called Liaison is one of the major issues for French phonetizers. Rule-based methods have been used to solve these issues. Yet, the current models still produce a great number of pronunciation errors to be used in 2nd language learning applications. In addition, the number of rules tends to be large and their interaction complex, making maintenance a problem. In order to try to alleviate such problems, we propose here an approach that, starting from a database (compiled from cases documented in the literature), allows us to build C4.5 decision trees and subsequently, automate the generation of the required rules. A prototype based on our approach has been tested against six other state-of-the-art phonetizers. The comparison shows the prototype system is better than most of them, being equivalent to the second-rank system.
#11Improvement on Plural Unit Selection and Fusion
Jian Luan (Toshiba (China) Research and Development Center)
Jian Li (Toshiba (China) Research and Development Center)
Plural unit selection and fusion is a successful method for concatenative synthesis. Yet its unit fusion algorithm is simple and requires improvement. Previous research on unit fusion is mainly involved in boundary smoothing and not quite suitable for the application mentioned above. Therefore, a high-quality unit fusion method is proposed in this paper. More accurate pitch frame alignment and primary unit selection are implemented. Besides, the fusion of pitch frames is performed on FFT spectra for less quality loss. Experiment results indicate that the proposed method evidently outperforms the baseline at an overall preference ratio of 54:17.
#12Improving Speech Synthesis of Machine Translation Output
Alok Parlikar (Language Technologies Institute, Carnegie Mellon University)
Alan W. Black (Language Technologies Institute, Carnegie Mellon University)
Stephan Vogel (Language Technologies Institute, Carnegie Mellon University)
Speech synthesizers are optimized for fluent natural text. However, in a speech to speech translation system, they have to process machine translation output, which is often not fluent. Rendering machine translations as speech makes them even harder to understand than the synthesis of natural text. A speech synthesizer must deal with the disfluencies in translations in order to be comprehensible and communicate the content. In this paper, we explore three synthesis strategies that address different problems found in translation output. By carrying out listening tasks and measuring transcription accuracies, we find that these methods can make the synthesis of translations more intelligible.
#13Paraphrase generation to improve Text-To-Speech Synthesis
Ghislain Putois (Orange Labs)
Jonathan Chevelu (Orange Labs)
Cédric Boidin (Orange Labs)
Text-to-speech synthesiser systems are of overall good quality, especially when adapted to a specific task. Given this task and an adapted voice corpus, the message quality is mainly dependent on the wording used. This paper presents how a paraphrase generator can be used in synergy with a text-to-speech synthesis system to improve its overall performances. Our system is composed of a paraphrase generator using a French-to-French corpus learnt on a bilingual aligned corpus, a tts selector based on the unit selection cost, and a tts synthesiser. We present an evaluation of the system, which highlights the need for systematic subjective evaluation.

ASR: Search, Decoding and Confidence Measures I

Time:Monday 13:30 Place:International Conference Room B Type:Poster
Chair:Takaaki Hori
#1Phone Mismatch Penalty Matrices for Two-Stage Keyword Spotting Via Multi-Pass Phone Recognizer
Han Chang Woo (Seoul National University)
Kang Shin Jae (Seoul National University)
Lee Chul Min (Seoul National University)
Kim Nam Soo (Seoul National University)
In this paper, we propose a novel approach to estimate three types of phone mismatch penalty matrices for two-state keyword spotting. When the output of a phone recognizer is given, text matching with the phone sequences provided by the specified keyword using the proposed phone mismatch penalty matrices is carried out to detect a specific keyword. The penalty matrices which is estimated from the training data through deliberate error generation are accounting for substitution, insertion and deletion errors. In comparative experiments on a Korean continuous speech recognition task, the proposed approach has shown a significant improvement.
#2English Spoken Term Detection in Multilingual Recordings
Petr Motlicek (Idiap Research Institute, Martigny, Switzerland)
Fabio Valente (Idiap Research Institute, Martigny, Switzerland)
Philip Garner (Idiap Research Institute, Martigny, Switzerland)
This paper investigates the automatic detection of English spoken terms in a multi-language scenario over real lecture recordings. Spoken Term Detection (STD) is based on an LVCSR where the output is represented in the form of word lattices. The lattices are then used to search the required terms. Processed lectures are mainly composed of English, French and Italian recordings where the language can also change within one recording. Therefore, the English STD system uses an Out-Of-Language (OOL) detection module to filter out non-English input segments. OOL detection is evaluated w.r.t. various confidence measures estimated from word lattices. Experimental studies of OOL detection followed by English STD are performed on several hours of multilingual recordings. Significant improvement of OOL+STD over a stand-alone STD system is achieved (relatively more than 50% in EER). Finally, an additional modality (text slides in the form of PowerPoint presentations) is exploited to improve STD.
#3A Hybrid Approach to Robust Word Lattice Generation Via Acoustic-Based Word Detection
Icksang Han (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Chiyoun Park (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongmi Cho (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongsu Kim (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
A large-vocabulary continuous speech recognition (LVCSR) system usually utilizes a language model in order to reduce the complexity of the algorithm. However, the constraint also produces side-effects including low accuracy of the out-of-grammar sentences and the error propagation of misrecognized words. In order to compensate for the side-effects of the language model, this paper proposes a novel lattice generation method that adopts the idea from the keyword detection method. By combining the word candidates detected mainly from the acoustic aspect of the signal to the word lattice from the ordinary speech recognizer, a hybrid lattice is constructed. The hybrid lattice shows 33% improvement in terms of the lattice accuracy under the condition where the lattice density is the same. In addition, it is observed that the proposed model shows less sensitivity to the out-of-grammar sentences and to the error propagation due to misrecognized words.
#4Direct Observation of Pruning Errors (DOPE): A Search Analysis Tool
Volker Steinbiss (RWTH Aachen University)
Martin Sundermeyer (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)
The search for the optimal word sequence can be performed efficiently even in a speech recognizer with a very large vocabulary and complex models. This is achieved using pruning methods with empirically chosen parameters and the willingness to accept a certain amount of pruning errors. Quite unsatisfying though, it is state-of-the-art that such pruning errors are not directly detected but, instead, indirect consequences of them, providing only a rough picture of what happens during search. With the tool Direct Observation of Pruning Errors (DOPE), described in this paper, pruning errors are detected on the state hypothesis level, which is a very fine level of granulation, several orders of magnitude finer than the sentence level. This allows much more exact analyses, including the analysis of pruning methods, or the effects of pruning parameters.
#5Direct Construction of Compact Context-Dependency Transducers From Data
David Rybach (RWTH Aachen University, Germany)
Michael Riley (Google Inc., USA)
This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision-tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.
#6 Incremental composition of static decoding graphs with label pushing
Miroslav Novak (IBM)
We present new results achieved in the application of incremental graph composition algorithm, in particular using the label pushing method to further reduce the final graph size. In our previous work we have shown that the incremental composition is an efficient alternative to the conventional finite state transducer (FST) determinization-composition-minimization approach, with some limitations. One of the limitations was that the word labels must stay aligned with the actual word ends. We describe an updated version of the algorithm which allows us to push the word labels relatively to the word ends to increase the effect of the minimization. The size of resulting graph is now very close to the ones produced by the conventional FST approach with label pushing.
#7A Novel Path Extension Framework Using Steady Segment Detection for Mandarin Speech Recognition
Zhanlei Yang (National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Wenju Liu (National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Frame based decoders are short of using long span of time knowledge while segment based decoders often confuse with complex calculating. This paper proposes a novel decoding framework by integrating steady speech segments information into path extension procedure. Firstly, as baseline decoding system, a dynamic lexicon-tree copy recognizer is developed, which aims to accelerate popular frame based recognizer, HTK. Steady segments, where the spectrum is stable, are extracted using landmark detection, and then detection results are provided to the following decoding module. At decoding stage, traditional inter-HMM token spreading framework is modified using steady segment knowledge, based on the observation that coexistence of steady frame and inter-HMM extension is impossible. Experiments conducted on Mandarin broadcasting speech show that the character error rate and run time achieve 22.1% and 5.24% relative reduction respectively.
#8On the relation of Bayes Risk, Word Error, and Word Posteriors in ASR
Ralf Schlueter (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Markus Nussbaum-Thom (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Hermann Ney (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
In automatic speech recognition, we are faced with a well-known inconsistency: Bayes decision rule is usually used to minimize sentence (word sequence) error, whereas in practice we want to minimize word error, which also is the usual evaluation measure. Recently, a number of speech recognition approaches to approximate Bayes decision rule with word error (Levenshtein/edit distance) cost were proposed. Nevertheless, experiments show that the decisions often remain the same and that the effect on the word error rate is limited, especially at low error rates. In this work, further analytic evidence for these observations is provided. A set of conditions is presented, for which Bayes decision rule with sentence and word error cost function leads to the same decisions. Furthermore, the case of word error cost is investigated and related to word posterior probabilities. The analytic results are verified experimentally on several large vocabulary speech recognition tasks.
#9Time Condition Search in Automatic Speech Recognition Reconsidered
David Nolden (RWTH Aachen)
Hermann Ney (RWTH Aachen)
Ralf Schlueter (RWTH Aachen)
In this paper we re-investigate the time conditioned search (TCS) method in comparison to the well known word conditioned search, and analyze its applicability on state-of-the-art large vocabulary continuous speech recognition tasks. In contrast to current standard approaches, time conditioned search offers theoretical advantages particularly in combination with huge vocabularies and huge language models, but it is difficult to combine with across word modelling, which was proven to be an important technique in automatic speech recognition. Our novel contributions for TCS are a pruning step during the recombination called Early Word End Pruning, an additional recombination technique called Context Recombination, the idea of a Startup Interval to reduce the number of started trees, and a mechanism to combine TCS with across word modelling. We show that, with these techniques, TCS can outperform WCS on a current task.
#10Efficient Data Selection for Speech Recognition Based on Prior Confidence Estimation Using Speech and Context Independent Models
Satoshi KOBASHIKAWA (NTT Cyber Space Laboratories, NTT Corporation)
Taichi ASAMI (NTT Cyber Space Laboratories, NTT Corporation)
Yoshikazu YAMAGUCHI (NTT Cyber Space Laboratories, NTT Corporation)
Hirokazu MASATAKI (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi TAKAHASHI (NTT Cyber Space Laboratories, NTT Corporation)
This paper proposes an efficient data selection technique to identify well recognized texts in massive volumes of speech data. Conventional confidence measure techniques can be used to obtain this accurate data, but they require speech recognition results to estimate confidence. Without a significant level of confidence, considerable computer resources are wasted since inaccurate recognition results are generated only to be rejected later. The technique proposed herein rapidly estimates the prior confidence based on just an acoustic likelihood calculation by using speech and context independent models before speech recognition processing; it then recognizes data with high confidence selectively. Simulations show that it matches the data selection performance of the conventional posterior confidence measure with less than 2 % of the computation time.
#11A Novel Confidence Measure Based on Marginalization of Jointly Estimated Error Cause Probabilities
Atsunori Ogawa (NTT Corporation)
Atsushi Nakamura (NTT Corporation)
We propose a novel confidence measure based on the marginalization of jointly estimated error cause probabilities. Conventional confidence measures directly score the reliability of recognition results. In contrast, our method first calculates joint confidence and error cause probabilities and then sums them with respect to the error cause patterns to obtain the marginal confidence probability. We show experimentally that, the confidence estimation accuracy obtained with the proposed method is significantly improved compared with that obtained with the conventional confidence measure.

Special-purpose speech applications

Time:Monday 13:30 Place:International Conference Room C Type:Poster
Chair:Su-Youn Yoon
#1Evaluation of a Silent Speech Interface Based on Magnetic Sensing
Robin Hofe (Department of Computer Science, University of Sheffield, UK)
Stephen R. Ell (Department of Engineering, University of Hull, UK)
Michael J. Fagan (Department of Engineering, University of Hull, UK)
James M. Gilbert (Department of Engineering, University of Hull, UK)
Phil D. Green (Department of Computer Science, University of Sheffield, UK)
Roger K. Moore (Department of Computer Science, University of Sheffield, UK)
Sergey I. Rybchenko (Department of Engineering, University of Hull, UK)
This paper reports on isolated word recognition experiments using a novel silent speech interface. The interface consist of magnetic pellets that are fixed to relevant speech articulators, and a set of magnetic field sensors that measure changes in the overall magnetic field created by these pellets during speech. The reported experiments demonstrate the effectiveness of this technique and show the suitability of the system, even at its early stages of development, for small vocabulary speech recognition.
#2Advanced Speech Communication System for Deaf People
Rubén San-Segundo (Speech Technology Group at Universidad Politécnica de Madrid)
Verónica López (Speech Technology Group at Universidad Politécnica de Madrid)
Raquel Martín (Speech Technology Group at Universidad Politécnica de Madrid)
Syaheerah Lufti (Speech Technology Group at Universidad Politécnica de Madrid)
Javier Ferreiros (Speech Technology Group at Universidad Politécnica de Madrid)
Ricardo Cordoba (Speech Technology Group at Universidad Politécnica de Madrid)
José Manuel Pardo (Speech Technology Group at Universidad Politécnica de Madrid)
This paper describes the development and field evaluation of an Advanced Speech Communication System for Deaf People. The system has two modules. The first one is a Spanish into Spanish Sign Language (LSE: Lengua de Signos Española) translation module made up of a speech recognizer, a natural language translator (for converting a word sequence into a sequence of signs), and a 3D avatar animation module (for playing back the signs). The second module is a Spoken Spanish generator from sign-writing composed of a visual interface (for specifying a sign sequence), a language translator (for generating a Spanish sentence), and finally, a text to speech converter. The system integrates three translation technologies: an example-based strategy, a rule-based translation method and a statistical translator. The field evaluation was carried out in the Local Traffic Office in the city of Toledo (Spain) involving real government employees and deaf people.
#3Unsupervised Acoustic Model Adaptation for Multi-Origin Non Native ASR
Sethserey Sam (Laboratoir d'Informatique de Grenoble (LIG)-France / MICA Research Center-Vietnam)
Eric Castelli (MICA Research Center-Vietnam)
Laurent Besacier (Laboratoire d'Informatique de Grenoble (LIG)-France)
To date, the performance of speech and language recognition systems is poor on non-native speech. The challenge for non-native speech recognition is to maximize the accuracy of a speech recognition system when only a small amount of non-native data is available. We report on the acoustic model adaptation for improving the recognition of non-native speech in English, French and Vietnamese, spoken by speakers of different origins. Using online unsupervised adaptation acoustic modeling without any additional data for adapting purposes, we investigate how an unsupervised multilingual acoustic model interpolation method can help to improve the phone accuracy of the system. Results improvement of 7% of absolute phone level accuracy (PLA) obtained from the experiments demonstrate the feasibility of the method.
#4Speech-Based Automated Cognitive Status Assessment
Dilek Hakkani-Tür (International Computer Science Institute (ICSI))
Dimitra Vergyri (SRI International, Speech Technology and Research Lab)
Gokhan Tur (SRI International, Speech Technology and Research Lab)
Verbal interviews performed by trained clinicians are a common form of assessments to measure cognitive decline. The aim in this paper is to study the usability of automated methods for evaluating verbal cognitive status assessment tests for the elderly. If reliable, such methods for cognitive assessment can be used for frequent, non-intrusive, low-cost screenings and provide objective and longitudinal cognitive status monitoring data that can complement regular clinical visits and would be useful for early detection of conditions associated with language and communication impairments. This study focuses on two types of tests: a story-recall test, used for memory and language functioning assessment, and a picture description test, used to assess the information content in speech. A data collection was designed for this study involving recordings of about 100 people, mostly over 70 years old, performing these tests. The speech samples were manually transcribed and annotated with semantic units in order to obtain manual evaluation scores. We explore the use of automatic speech recognition and language processing methods to derive objective, automatically extracted metrics of cognitive status that are highly correlated with the manual scores. We use recall and precision based metrics based on semantic content units associated with the tests. Our experiments show high correlation between manually obtained scores and the automatic metrics obtained using eithermanual or automatic speech transcriptions.
#5Speech Recognition with a Seamlessly Updated Language Model for Real-Time Closed-Captioning
Toru Imai (NHK Science & Technology Research Laboratories)
Shinichi Homma (NHK Science & Technology Research Laboratories)
Akio Kobayashi (NHK Science & Technology Research Laboratories)
Takahiro Oku (NHK Science & Technology Research Laboratories)
Shoei Sato (NHK Science & Technology Research Laboratories)
It is desirable to consistently and seamlessly update a language model of speech recognition without stopping it for online applications such as real-time closed-captioning. This paper proposes a novel speech recognition system that enables the model to be updated at any time even while it is running. It can run the second decoder with the latest model in parallel, and their priority that must be accessed is controlled at a non-speech portion by an additional job process, which sends acoustic features only to an active target decoder with the latest model and sends recognized words to the backend manual error correction for closed-captioning. The system seamlessly updates the model and ensures endless speech recognition with the latest model at any time. Our new practical real-time closed-captioning system reduced word errors by two thirds with the proposed language model update mechanism in the speech recognition and captioning experiments for Japanese broadcast news programs.
#6The comparison between the deletion-based methods and the mixing-based methods for audio CAPTCHA systems
Takuya Nishimoto (Graduate School of Information Science and Technology, the University of Tokyo)
Takayuki Watanabe (Department of Communication, Division of Human Science, School of Arts and Sciences, Tokyo Woman's Christian University)
Audio CAPTCHA systems, which distinguish between software agents and human beings, are especially important for persons with visual disability. The popular approach is based on mixing-based methods (MBM), which use the mixed sounds of target speech and noises. We have proposed a deletion-based method (DBM) which uses the phonemic restoration effects. Our approach can control the difficulty of tasks simply by the masking ratio. In this paper, we propose a design principle of CAPTCHA, according to which the tasks should be designed so that the large difference of performance between the machines and human beings can be provided. We also show the experimental results that support the hypotheses as follows: (1) only using MBM, the degree of task difficulty can not be controlled easily, (2) using DBM, the degree of task difficulty and safeness of CAPTCHA system can be controlled easily.
#7Comparing mono- and multilingual acoustic seed models for a low e-resourced language: a case-study of Luxembourgish
Martine Adda-Decker (LIMSI-CNRS)
Lori Lamel (LIMSI-CNRS)
Natalie Snoeren (LIMSI-CNRS)
Luxembourgish is embedded in a multilingual context on the divide between Romance and Germanic cultures and has often been viewed as one of Europe's under-resourced languages. We focus on the acoustic modeling of Luxembourgish. By taking advantage of monolingual acoustic seeds selected from German, French or English model sets via IPA symbol correspondances, we investigated whether Luxembourgish spoken words were globally better represented by one of these languages. Although speech in Luxembourgish is frequently interspersed with French words, forced alignments on these data showed a clear preference for Germanic acoustic models with only a limited usage of French. German models provided the best match with 54% of the data, 35% for English and only 11% for French models. A further set of multilingual acoustic models, estimated from the pooled German, French, and English audio data allowed to capture between 27% and 48% of the data depending on conditions.
#8Manipulating Treacheoesophageal Speech
R.J.J.H. van Son (Netherlands Cancer Institute/ACLC)
Irene Jacobi (Netherlands Cancer Institute)
Frans J. M. Hilgers (Netherlands Cancer Institute)
Speech therapy aiming at improving voice quality and speech intelligibility is often hampered by the lack of knowledge of the underlying deficits. One way to help speech therapists treating patients would be to supply synthetic bench-marks for pathological speech. In a listening experiment testing perceived intelligibility, three types of manipulations of tracheoesophageal speech were evaluated by experienced speech therapists. It was found that modeling the intensity contour of the voice source signal improved speech quality over plain analysis-synthesis. Replacing the voicing source with fully synthetic source periods decreased the perceived intelligibility markedly. Making the source fully periodic with a regular pitch had no effect on perceived intelligibility. Low quality speech benefited more from manipulations, or deteriorated less, than high quality speech.
#9Towards mixed language speech recognition systems
David Imseng (Idiap Research Institute, Martigny, Switzerland)
Hervé Bourlard (Idiap Research Institute, Martigny, Switzerland)
Mathew Magimai Doss (Idiap Research Institute, Martigny, Switzerland)
Multilingual speech recognition obviously involves numerous research challenges, including common phoneme sets, adaptation on limited amount of training data, as well as mixed language recognition (common in many countries, like Switzerland). In this latter case, it is not even possible to assume that one knows in advance the language being spoken. This is the context and motivation of the present work. We indeed investigate how current state-of-the-art speech recognition systems can be exploited in multilingual environments, where the language (from an assumed set of 5 possible languages, in our case) is not a priori known during recognition. We combine monolingual systems and extensively develop and compare different features and acoustic models. On SpeechDat(II) datasets, and in the context of isolated words, we show that it is actually possible to approach performances of monolingual systems even if the identity of the spoken language is not a priori known.
#10Voice Search for Development
Etienne Barnard (Human Language Technologies Research Group, Meraka Institute, CSIR)
Johan Schalkwyk (Google Research)
Charl van Heerden (Human Language Technologies Research Group, Meraka Institute, CSIR)
Pedro J Moreno (Google Research)
In light of the serious problems with both illiteracy and information access in the developing world, there is a widespread belief that speech technology can play a significant role in improving the quality of life of developing-world citizens. We review the main reasons why this impact has not occurred to date, and propose that voice-search systems may be a useful tool in delivering on the original promise. The challenges that must be addressed to realize this vision are analyzed, and initial experimental results in developing voice search for two languages of South Africa (Zulu and Afrikaans) are summarized.
#11Cross-cultural Investigation of Prosody in Verbal Feedback in Interactional Rapport
Gina-Anne Levow (University of Chicago)
Susan Duncan (University of Chicago)
Edward King (University of Chicago)
Aspects of speech and non-verbal behavior allow conversational partners to establish and maintain rapport by signaling engagement or endorsement. In the verbal channel, these factors encompass requests for and production of vocal feedback, as well as lexical and grammatical mirroring. However, these cues are often subtle and culture-specific. Here, we present a preliminary investigation of the differences in elicitation and provision of vocal feedback across three diverse language/cultural groups: American English, Gulf/Iraqi Arabic, and Mexican Spanish. Based on a fully-transcribed and aligned sub-corpus of 80 interactions, we identify fundamental contrasts in production of vocal feedback. We identify dramatic differences in the rates of listener verbal feedback across the groups. However, we find both similarities and differences in the use of prosodic cues across these groups.. These differences will inform the development of culturally-sensitive conversational agents..
#12Using Oriented Optical Flow Histograms for Multimodal Speaker Diarization
Mary Tai Knox (International Computer Science Institute)
Gerald Friedland (International Computer Science Institute)
Speaker diarization is the task of partitioning an input stream into speaker homogeneous regions, or in other words, to determine ”who spoke when.” While approaches to this problem have traditionally relied entirely on the audio stream, the availability of accompanying video streams in recent diarization corpora has prompted the study of methods based on multimodal audio-visual features. In this work, we propose the use of robust video features based on oriented optical flow histograms. Using the state-of-the art ICSI diarization system, we show that, when combined with standard audio features, these features improve the diarization error rate by 14% percent over an audio only baseline.
#13Towards an ASR-free objective analysis of pathological speech
Catherine Middag (ELIS, Ghent University, Belgium)
Yvan Saeys (VIB, Ghent University, Belgium)
Jean-Pierre Martens (ELIS, Ghent University, Belgium)
Nowadays, intelligibility is a popular measure of the severity of the articulatory deficiencies of a pathological speaker. Usually, this measure is obtained by means of a perceptual test, consisting of nonconventional and/or nonconnected words. In previous work, we developed a system incorporating two Automatic Speech Recognizers (ASR) that could fairly accurately estimate phoneme intelligibility (PI). In the present paper, we propose a novel method that aims to assess the running speech intelligibility (RSI) as a more relevant indicator of the communication efficiency of a speaker in a natural setting. The proposed method computes a phonological characterization of the speaker by means of a statistical analysis of frame-level phonological features. Important is that this analysis requires no knowledge of what the speaker was supposed to say. The new characterization is demonstrated to predict PI and to provide valuable information about the nature and severity of the pathology.

Speech analysis

Time:Monday 13:30 Place:International Conference Room D Type:Poster
Chair:Torbjorn Svendsen
#1Session Variability Contrasts in the MARP Corpus
Keith W. Godin (Center for Robust Speech Systems, The University of Texas at Dallas, U.S.A.)
John H. L. Hansen (Center for Robust Speech Systems, The University of Texas at Dallas, U.S.A.)
Intra-session and inter-session variability in the Multi-session Audio Research Project (MARP) corpus are contrasted in two experiments that exploit the long-term nature of the corpus. In the first experiment, Gaussian Mixture Models (GMMs) model 30-second session chunks, clustering chunks using the Kullback-Leibler (KL) divergence. Cross-session relationships are found to dominate the clusters. Secondly, session detection with 3 variations in training subsets is performed. Results showed that small changes in long-term characteristics are observed throughout the sessions. These results enhance understanding of the relationship between long-term and short-term variability in speech and will find application in speaker and speech recognition systems.
#2Estimation of Two-to-One Forced Selection Intelligibility Scores by Speech Recognizers Using Noise-Adapted Models
Kazuhiro Kondo (Yamagata University)
Takano Yusuke (Yamagata University)
We attempted to estimate subjective scores of the Japanese Diagnostic Rhyme Test (DRT), a two-to-one forced selection speech intelligibility test, using automatic speech recognizers with language models that force one of the words in the word-pair. The acoustic models were adapted to the speaker, and then adapted to noise at a specified SNR. The match between subjective and recognition scores improved significantly when the adapted noise level and the tested level match. However, when SNR conditions do not match, the recognition scores degraded especially when test SNR conditions were higher than the adapted level.
#3Analysis of Gender Normalization using MLP and VTLN Features
Thomas Schaaf (Multimodal Technologies Inc.)
Florian Metze (Carnegie Mellon University)
This paper analyzes the capability of multilayer perceptron frontends to perform speaker normalization. We find the context decision tree to be a very useful tool to assess the speaker normalization power of different frontends. We introduce a gender question into the training of the phonetic context decision tree. After the context clustering the gender specific models are counted. We compare this for the following frontends: (1) Bottle-Neck (BN) with and without vocal tract length normalization (VTLN), (2) standard MFCC, (3) stacking of multiple MFCC frames with linear discriminant analysis (LDA). We find the BN-frontend to be even more effective in reducing the number of gender questions than VTLN. From this we conclude that a Bottle-Neck frontend is more effective for gender normalization. Combining VTLN and BN-features reduces the number of gender specific models further.
#4Discovering an Optimal Set of Minimally Contrasting Acoustic Speech Units: A Point of Focus for Whole-Word Pattern Matching
Guillaume Aimetti (University of Sheffield)
Roger Moore (Universty of Sheffield)
Louis ten Bosch (Radboud University)
This paper presents a computational model that can automatically learn words, made up from emergent sub-word units, with no prior linguistic knowledge. This research is inspired by current cognitive theories of human speech perception, and therefore strives for ecological plausibility with the desire to build more robust speech recognition technology. Firstly, the particulate structure of the raw acoustic speech signal is derived through a novel acoustic segmentation process, the `acoustic DP-ngram algorithm'. Then, using a cross-modal association learning mechanism, word models are derived as a sequence of the segmented units. An efficient set of sub-word units emerge as a result of a general purpose lossy compression mechanism and the algorithms sensitivity to discriminate acoustic differences. The results show that the system can automatically derive robust word representations and dynamically build re-usable sub-word acoustic units with no pre-defined language-specific rules.
#5Improvements to the equal-parameter BIC for Speaker Diarization
Themos Stafylakis (National Technical University of Athens, Greece)
Xavier Anguera (Multimedia Research Group, Telefonica Research, Spain)
This paper discusses a set of modifications regarding the use of the Bayesian Information Criterion (BIC) for the speaker diarization task. We focus on the specific variant of the BIC that deploys models of equal - or roughly equal - statistical complexity under partitions of different number of speakers and we examine three modifications. Firstly, we investigate a way to deal with the permutation-invariance property of the estimators when dealing with mixture models, while the second is derived by attaching a weakly informative prior over the space of speaker-level state sequences. Finally, based on the recently proposed segmental-BIC approach, we examine its effectiveness when mixture of gaussians are used to model the emission probabilities of a speaker. The experiments are carried out using NIST rich transcription evaluation campaign for meeting data and show improvement over the baseline setting.
#6A Multistream Multiresolution Framework for Phoneme Recognition
Nima Mesgarani (Johns Hopkins University)
Samuel Thomas (Johns Hopkins University)
Hynek Hermansky (Johns Hopkins University)
Spectrotemporal representation of speech has already shown promising results in speech processing technologies, however, many inherent issues of such representation, such as high dimensionality have limited their use in speech and speaker recognition. Multistream framework fits very well to such representation where different regions can be separately mapped into posterior probabilities of classes before merging. In this study, we investigated the effective ways of forming streams out of this representation for robust phoneme recognition. We also investigated multiple ways of fusing the posteriors of different streams based on their individual confidence or interactions between them. We observed an improvement of 8.6% relative improvement in clean and 4% in noise. We developed a simple yet effective linear combination technique that provides intuitive understanding of stream combinations and how even systematic errors can be leant to reduce confusions.
#7Cluster Analysis of Differential Spectral Envelopes on Emotional Speech
Giampiero Salvi (KTH, School of Computer Science and Communication, Dept. of Speech, Music and Hearing, Stockholm, Sweden)
Fabio Tesser (Institute of Cognitive Sciences and Technologies, Italian National Research Council, Padova, Italy)
Enrico Zovato (Loquendo S.p.A., Torino, Italy)
Piero Cosi (Institute of Cognitive Sciences and Technologies, Italian National Research Council, Padova, Italy)
This paper reports on the analysis of the spectral variation of emotional speech. Spectral envelopes of time aligned speech frames are compared between emotionally neutral and active utterances. Statistics are computed over the resulting differential spectral envelopes for each phoneme. Finally, these statistics are classified using agglomerative hierarchical clustering and a measure of dissimilarity between statistical distributions and the resulting clusters are analysed. The results show that there are systematic changes in spectral envelopes when going from neutral to sad or happy speech, and those changes depend on the valence of the emotional content (negative, positive) as well as on the phonetic properties of the sounds such as voicing and place of articulation.
#8Modeling pronunciation variation using context-dependent articulatory feature decision trees
Samuel Bowman (Linguistics, The University of Chicago)
Karen Livescu (TTI-Chicago)
We consider the problem of predicting the surface pronunciations of a word in conversational speech, using a feature-based model of pronunciation variation. We build context-dependent decision trees for both phone-based and feature-based models, and compare their perplexities on conversational data from the Switchboard Transcription Project. We find that feature-based decision trees using featur e bundles based on articulatory phonology outperform phone-based decision trees, and are much more r obust to reductions in training data. We also analyze the usefulness of various context variables.
#9Ungrounded independent factor analysis
Bhiksha Raj (Carnegie Mellon University)
Kevin Wilson (Mitsubishi Electric Research Labs)
Alexander Krueger (University of Paderborn)
Reinhold Haeb-Umbach (University of Paderborn)
We describe an algorithm that performs regularized non-negative matrix factorization (NMF) to find independent components in non- negative data. Previous techniques proposed for this purpose require the data to be grounded, with support that goes down to 0 along each dimension. In our work, this requirement is eliminated. Based on it, we present a technique to find a low-dimensional decomposition of spectrograms by casting it as a problem of discovering independent non-negative components from it. The algorithm itself is implemented as regularized non-negative matrix factorization (NMF). Unlike other ICA algorithms, this algorithm computes the mixing matrix rather than an unmixing matrix. This algorithm provides a better decomposition than standard NMF when the underlying sources are independent. It makes better use of additional observation streams than previous nonnegative ICA algorithms.
#10Signal interaction and the Devil Function
John R. Hershey (IBM T. J. Watson Research Center)
Peder A. Olsen (IBM T. J. Watson Research Center)
Steven J. Rennie (IBM T. J. Watson Research Center)
It is common in signal processing to model signals in the log power spectrum domain. In this domain, when multiple signals are present, they combine in a nonlinear way. If the phases of the signals are independent, then we can analyze the interaction in terms of a probability density we call the "devil function," after its treacherous form. This paper derives an analytical expression for the devil function, and discusses its properties with respect to model-based signal enhancement. Exact inference in this problem requires integrals involving the devil function that are intractable. Previous methods have used approximations to derive closed-form solutions. However it is unknown how these approximations differ from the true interaction function in terms of performance. We propose Monte-Carlo methods for approximating the required integrals. Tests are conducted on a speech separation and recognition problem to compare these methods with past approximations.

Special Session: Models of Speech - In Search of Better Representations

Time:Monday 13:30 Place:301 Type:Special
Chair:Hideki Kawahara & Hansjoerg Mixdorff
13:30A procedure for estimating gestural scores from natural speech
Hosung Nam (Haskins Laboratories)
Vikramjit Mitra (Institute for Systems Research, Department of Electrical and Computer Engineering, University of Maryland, College Park)
Mark Tiede (Haskins Laboratories, R.L.E., MIT)
Elliot Saltzman (Department of Electrical & Computer Engineering, Univ. of Illinois, Urbana-Champaign)
Louis Goldstein (Department of Linguistics, Univ. of Southern California)
Carol Espy-Wilson (Institute for Systems Research, Department of Electrical and Computer Engineering, University of Maryland, College Park)
Mark Hasegawa-Johnson (Department of Electrical & Computer Engineering, Univ. of Illinois, Urbana-Champaign)
Speech can be represented as a constellation of constricting events, gestures, which are defined at vocal tract variables, in a form of gestural score. Gestures and their output trajectories, tract variables, which are available only in synthetic speech, have recently been shown to improve the ASR performance. We introduce a procedure to annotate gestures on natural speech database, a landmark-based time warping method. For a given speech, Haskins Laboratories TADA model is used to generate a gestural score and acoustic output, and an optimal gestural score is estimated through iterative time-warping processes based on landmark (phone) comparison.
13:50On the interdependencies between voice quality, glottal gaps, and voice-source related acoustic measures
Yen-Liang Shue (University of California, Los Angeles)
Gang Chen (University of California, Los Angeles)
Abeer Alwan (University of California, Los Angeles)
In human speech production, the voice source contains important non-lexical information, especially relating to a speaker's voice quality. In this study, direct measurements of the glottal area waveforms were used to examine the effects of voice quality and glottal gaps on voice source model parameters and various acoustic measures. Results showed that the open quotient parameter, cepstral peak prominence (CPP) and most spectral tilt measures were affected by both voice quality and glottal gaps, while the asymmetry parameter was predominantly affected by voice quality, especially of the breathy type. This was also the case with the harmonic-to-noise ratio measures, indicating the presence of more spectral noise for breathy phonations. Analysis showed that the acoustic measure H1-H2 was correlated with both the open quotient and asymmetry source parameters, which agrees with existing theoretical studies.
14:10Simplification and extension of non-periodic excitation source representations for high-quality speech manipulation systems
Hideki Kawahara (Wakayama University)
Masanori Morise (Ritsumeikan University)
Toru Takahashi (Kyoto University)
Hideki Banno (Meijo University)
Ryuichi Nisimura (Wakayama University)
Toshio Irino (Wakayama University)
A systematic framework for non-periodic excitation source representation is proposed for high-quality speech manipulation systems such as TANDEM-STRAIGHT, which is basically a channel VOCODER. The proposed method consists of two subsystems for non-periodic components; a colored noise source and an event analyzer/generator. The colored noise source is represented by using a sigmoid model with non-linear level conversion. Two model parameters, boundary frequency and slope parameters, are estimated based on pitch range linear prediction combined with F0 adaptive temporal axis warping and those on the original temporal axis. The event subsystem detects events based on kurtosis of filtered speech signals. The proposed framework provides significant quality improvement for high-quality recorded speech materials.
14:30Phase equalization-based autoregressive model of speech signals
Sadao Hiroya (NTT Communication Science Laboratories)
Takemi Mochida (NTT Communication Science Laboratories)
This paper presents a novel method for estimating a vocal-tract spectrum from speech signals, based on a modeling of excitation signals of voiced speech. A formulation of linear prediction coding with impulse train is derived and applied to the phase-equalized speech signals, which are converted from the original speech signals by phase equalization. Preliminary results show that the proposed method improves the robustness of the estimation of a vocal-tract spectrum and the quality of re-synthesized speech compared with the conventional method. This technique will be useful for speech coding, speech synthesis, and real-time speech conversion.
14:50Articulatory-Functional Modeling of Speech Prosody: A Review
Yi Xu (University College London, UK)
Santitham Prom-on (King Mongkut’s University of Technology Thonburi, Thailand)
Natural prosody is produced by an articulatory system to convey communicative meanings. It is therefore desirable for prosody modeling to represent both articulatory mechanisms and communicative functions. There are doubts, however, as to whether such representation is necessary or beneficial if the aim of modeling is to just generate perceptually acceptable output. In this paper we briefly review models that have attempted to implement representations of either or both aspects of prosody. We show that, at least theoretically, it is beneficial to represent both articulatory mechanisms and communicative functions even if the goal is to just simulate surface prosody.
15:10Two new estimation methods for a superpositional intonation model
Humberto Maximiliano Torres (Laboratorio de Investigaciones Sensoriales, Hospital de Clínicas, UBA, Argentina)
Hansjörg Mixdorff (Deparment of Computer, BHT Berlin University of Applied Sciences, Germany)
Jorge Alberto Gurlekian (Laboratorio de Investigaciones Sensoriales, Hospital de Clínicas, UBA, Argentina)
Harmut Pfitzinger (Inst. of Phonetics and Digital Speech Processing, Christian-Albrechts-University, Germany)
This work presents two new approaches for parameter estimation of the superpositional intonation model for German. These approaches introduce linguistic and paralinguistic assumptions allowing the initialization of a previous standard method. Additionally, all restrictions on the configuration of accents were eliminated. The proposed linguistic hypotheses can be based on either tonal or lexical accent, which gives rise to two different estimation methods. These two kind of hypotheses were validated by comparison of the estimation performance relative to two standard methods, one manual and one automatic. The results show that the proposed methods far exceed the performance of the automatic method and are slightly beyond the manual method of reference.

Systems for LVCSR

Time:Monday 16:00 Place:Hall A/B Type:Oral
Chair:Lori Lamel
16:00Semi-automated Update of Automatic Transcription System for the Japanese National Congress
Yuya Akita (Kyoto University)
Masato Mimura (Kyoto University)
Graham Neubig (Kyoto University)
Tatsuya Kawahara (Kyoto University)
Update of acoustic and language models is vital to maintain performance of automatic speech recognition (ASR) systems. To alleviate efforts for updating models, we propose a "semi-automated" framework for the ASR system of the Japanese National Congress. The framework consists of our speaking-style transformation (SST) and lightly-supervised training (LSV) approaches, which can automatically generate spoken-style training texts and labels from documents like meeting minutes. An experimental evaluation demonstrated that this update framework improved the ASR performance for the latest meeting data. We also address an estimation method of the ASR accuracy based on SST, which uses minutes as reference texts and does not require verbatim transcripts.
16:20Language Model Cross Adaptation For LVCSR System Combination
Xunying Liu (Cambridge University)
Mark Gales (Cambridge University)
Phil Woodland (Cambridge University)
State-of-the-art large vocabulary continuous speech recognition (LVCSR) systems often combine outputs from multiple sub-systems developed at different sites. Cross system adaptation can be used as an alternative to direct hypothesis level combination schemes such as ROVER. In normal cross adaptation it is assumed that useful diversity among systems exists only at acoustic level. However, complimentary features among complex LVCSR systems also manifest themselves in other layers of modelling hierarchy, e.g., subword and word level. It is thus interesting to also cross adapt language models (LM) to capture them. In this paper cross adaptation of multi-level LMs modelling both syllable and word sequences was investigated to improve LVCSR system combination. Significant error rate gains of 6.7% relative were obtained over ROVER and acoustic model only cross adaptation when combining 13 Chinese LVCSR sub-systems used in the 2010 DARPA GALE evaluation.
16:40Large Vocabulary Continuous Speech Recognition Using WFST-based Linear Classifier for Structured Data
Shinji Watanabe (NTT Corporation)
Takaaki Hori (NTT Corporation)
Atsushi Nakamura (NTT Coporation)
This paper describes a discriminative approach that further advances the framework for Weighted Finite State Transducer (WFST) based decoding. The approach introduces additional linear models for adjusting the scores of a decoding graph composed of conventional information source models, and reviews the WFST-based decoding process as a linear classifier for structured data. The difficulty with the approach is that the number of dimensions of the additional linear models becomes very large in proportion to the number of arcs in a WFST, and our previous study only applied it to a small task. This paper proposes a training method for a large-scale linear classifier employed in WFST-based decoding by using a distributed perceptron algorithm. The experimental results show that the proposed approach was successfully applied to a large vocabulary continuous speech recognition task, and achieved an improvement compared with the performance of the discriminative training of acoustic models.
17:00Accelerating Hierarchical Acoustic Likelihood Computation on Graphics Processors
Pavel Kveton (IBM)
Miroslav Novak (IBM)
The paper presents a method for performance improvements of a speech recognition system by moving a part of the computation - acoustic likelihood computation - onto a Graphics Processor Unit (GPU). In the system, GPU operates as a low cost powerful coprocessor for linear algebra operations. The paper compares GPU implementation of two techniques of acoustic likelihood computation: full Gaussian computation of all components and a significantly faster Gaussian selection method using hierarchical evaluation. The full Gaussian computation is an ideal candidate for GPU implementation because of its matrix multiplication nature. The hierarchical Gaussian computation is a technique commonly used on a CPU since it leads to much better performance by pruning the computation volume. Pruning techniques are generally much harder to implement on GPUs, nevertheless, the paper shows that hierarchical Gaussian computation can be efficiently implemented on GPUs.
17:20Search by Voice in Mandarin Chinese
Jiulong Shan (Google)
Genqing Wu (Google)
Martin Jansche (Google)
Pedro J. Moreno (Google)
In this paper we describe our efforts to build a Mandarin Chinese voice search system. We describe our strategies for data collection, language, lexicon and acoustic modeling, as well as issues related to text normalization that are an integral part of building voice search systems. We show excellent performance on typical spoken search queries under a variety of accents and acoustic conditions. The system has been in operation since October 2009 and has received very positive user reviews.
17:40The AMIDA 2009 Meeting Transcription System
Thomas Hain (Univ Sheffield)
Lukas Burget (Brno Univ. of Technology)
John Dines (Idiap)
Philip N. Garner (Idiap)
Asmaa El Hannani (Univ. Sheffield)
Marijn Huijbregts (Univ. Twente)
Martin Karafiat (Brno Univ. of Technology)
Mike Lincoln (Univ. of Edinburgh)
Wan Vincent (Univ. Of Sheffield)
We present the AMIDA 2009 system for participation in the NIST RT'2009 STT evaluations. Systems for close-talking, far field and speaker attributed STT conditions are described. Improvements to our previous systems are: segmentation and diarisation; stacked bottle-neck posterior feature extraction; fMPE training of acoustic models; adaptation on complete meetings; improvements to WFST decoding; automatic optimisation of decoders and system graphs. Overall these changes gave a 6-13% relative reduction in word error rate while at the same time reducing the real-time factor by a factor of five and using considerably less data for acoustic model training.

Speaker characterization and recognition I

Time:Monday 16:00 Place:201A Type:Oral
Chair:William Campbell
16:00Simple and Efficient Speaker Comparison using Approximate KL Divergence
William Campbell (MIT Lincoln Laboratory)
Zahi Karam (MIT Lincoln Laboratory, DSPG Research Laboratory of Electronics at MIT)
We describe a simple, novel, and efficient system for speaker comparison with two main components. First, the system uses a new approximate KL divergence distance extending earlier GMM parameter vector SVM kernels. The approximate distance incorporates data-dependent mixture weights as well as the standard MAP-adapted GMM mean parameters. Second, the system applies a weighted nuisance projection method for channel compensation. A simple eigenvector method of training is presented. The resulting speaker comparison system is straightforward to implement and is computationally simple---only two low-rank matrix multiplies and an inner product are needed for comparison of two GMM parameter vectors. We demonstrate the approach on a NIST 2008 speaker recognition evaluation task. We provide insight into what methods, parameters, and features are critical for good performance.
16:20The IIR NIST SRE 2008 and 2010 Summed Channel Speaker Recognition Systems
Hanwu Sun (Institute for Infocomm Research)
Bin Ma (Institute for Infocomm Research)
Chien-Lin Huang (Institute for Infocomm Research)
Trung Hieu Nguyen (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)
This paper describes the IIR speaker recognition system for the summed channel evaluation tasks in the 2008 and 2010 NIST SREs. The system includes three main modules: voice activity detection, speaker diarization and speaker recognition. The front-end process employs a spectral subtraction based voice activity detection algorithm for effective speech frame selection. The speaker diarization system applied for the 2007 and 2009 NIST RTs is adopted for the summed channel speech segmentation. A hybrid purifying and clustering algorithm is used to cluster the summed channel speech into two speaker clusters. The GMM-SVM speaker recognition system is adopted to evaluate the performance with both MFCC and LPCC features. The system achieves competitive overall EER rates of 3.46% in the 1conv-summed task and 1.87% in the 8conv-summed task, respectively, while only all English trials are involved.
16:40Speaker Characterization Using Long-Term and Temporal Information
Chien-Lin Huang (Department of Human Language Technology, Institute for Infocomm Research, Singapore)
Hanwu Sun (Department of Human Language Technology, Institute for Infocomm Research, Singapore)
Bin Ma (Department of Human Language Technology, Institute for Infocomm Research, Singapore)
Haizhou Li (Department of Human Language Technology, Institute for Infocomm Research, Singapore)
This paper presents new techniques for front-end analysis using long-term and temporal information for speaker recognition. We propose a long-term feature analysis strategy that averages short-time spectral features over a period of time in an effort to capture the speaker traits that are manifested over a speech segment longer than a spectral frame. We found that the moving averages of temporal information are effective in speaker recognition as well. The experiments on the 2008 NIST Speaker Recognition Evaluation dataset show the long-term and temporal information contribute to substantial EER reductions.
17:00Score-level Compensation of Extreme Speech Duration Variability in Speaker Verification
Sergio Perez-Gomez (Universidad Autonoma de Madrid)
Daniel Ramos-Castro (Universidad Autonoma de Madrid)
Javier Gonzalez-Dominguez (Universidad Autonoma de Madrid)
Joaquin Gonzalez-Rodriguez (Universidad Autonoma de Madrid)
In this work we aim at compensating the degrading effects of utterance length variability of speaker verification systems, which appear in many typical applications such as forensics. The paper concentrates in the score misalignments due to different utterance lengths, proposing several algorithms for its normalization. In order to test the proposed methods, we have built two corpora from NIST SRE 2006 and 2008 data to simulate high utterance length variability. Results show an improvement of the overall system performance for all the algorithms proposed, which is significant even when score normalization techniques such as T-Norm are used.
17:20Speaker Recognition Experiments using Connectionist Transformation Network Features
Alberto Abad (INESC-ID Lisboa, Portugal)
Isabel Trancoso (IST/INESC-ID Lisboa, Portugal)
The use of adaptation transforms common in speech recognition systems as features for speaker recognition is an appealing alternative approach to conventional short-term cepstral modelling of speaker characteristics. Recently, we have shown that it is possible to use transformation weights derived from adaptation techniques applied to the Multi Layer Perceptrons that form a connectionist speech recognizer. The proposed method - named Transformation Network features with SVM modelling (TN-SVM) - showed promising results on a sub-set of NIST SRE 2008 and allowed further improvements when it was combined with baseline systems. In this paper, we summarize the recently proposed TN-SVM approach and present new results. First, we explore two alternative approaches that may be used in the absence of high quality speech transcriptions. Second, we present results of the proposed approach with Nuisance Attribute Projection for session variability compensation.
17:40Speaker Recognition using Supervised Probabilistic Principal Component Analysis
Yun Lei (University of Texas at Dallas)
John Hansen (University of Texas at Dallas)
In this study, a supervised probabilistic principal component analysis (SPPCA) model is proposed in order to integrate the speaker label information into a factor analysis approach using the well-known probabilistic principal component analysis (PPCA) model under a support vector machine (SVM) framework. The latent factor from the proposed model is believed to be more discriminative than one from the PPCA model. The proposed model, combined with different types of intersession compensation techniques in the back-end, is evaluated using the National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) 2008 data corpus, along with a comparison to the PPCA model.

Source separation

Time:Monday 16:00 Place:201B Type:Oral
Chair:Masashi Unoki
Robert Peharz (Graz University of Technology)
Michael Stark (Graz University of Technology)
Franz Pernkopf (Graz University of Technology)
Yannis Stylianou (University of Crete)
We propose a probabilistic factorial sparse coder model for single channel source separation in the magnitude spectrogram domain. The mixture spectrogram is assumed to be the sum of the sources, which are assumed to be generated frame-wise as the output of sparse coders plus noise. For dictionary training we use an algorithm which can be described as non-negative matrix factorization with ℓ0 sparseness constraints. In order to infer likely source spectrogram candidates, we approximate the intractable exact inference by maximizing the posterior over a plausible subset of solutions. We compare our system to the factorial-max vector quantization model, where the proposed method shows a superior performance in terms of signal-to-interference ratio. Finally, the low computational requirements of the algorithm allows close to real time applications.
Yasmina Benabderrahmane (INRS-EMT Telecommunications Canada)
Sid Ahmed Selouani (Université de Moncton Canada)
Douglas O’Shaughnessy (INRS-EMT Telecommunications Canada)
This paper deals with blind speech separation of convolutive mixtures of sources. The separation criterion is based on Oriented Principal Components Analysis (OPCA) in the frequency domain. OPCA is a (second order) extension of standard Principal Component Analysis (PCA) aiming at maximizing the power ratio of a pair of signals. The convolutive mixing is obtained by modeling the Head Related Transfer Function (HRTF). Experimental results show the efficiency of the proposed approach in terms of subjective and objective evaluation, when compared to the Degenerate Unmixing Evaluation Technique (DUET) and the widely used C-FICA (Convolutive Fast-ICA) algorithm
16:40Online Gaussian Process for Nonstationary Speech Separation
Hsin-Lung Hsieh (National Cheng Kung University)
Jen-Tzung Chien (National Cheng Kung University)
In a practical speech enhancement system, it is required to enhance speech signals from the mixed signals, which were corrupted due to the nonstationary source signals and mixing conditions. The source voices may be from different moving speakers. The speakers may abruptly appear or disappear and may be permuted continuously. To deal with these scenarios with a varying number of sources, we present a new method for nonstationary speech separation. An online Gaussian process independent component analysis (OLGP-ICA) is developed to characterize the real-time temporal structure in time-varying mixing system and to capture the evolved statistics of independent sources from online observed signals. A variational Bayes algorithm is established to estimate the evolved parameters for dynamic source separation. In the experiments, the proposed OLGP-ICA is compared with other ICA methods and is illustrated to be effective in recovering speech and music signals in a nonstationary speaking environment.
17:00Convexity and Fast Speech Extraction by Split Bregman Method
Meng Yu (Department of Mathematics, University of California, Irvine, USA)
Wenye Ma (Department of Mathematics, University of California, Los Angeles, USA)
Jack Xin (Department of Mathematics, University of California, Irvine, USA)
Stanley Osher (Department of Mathematics, University of California, Los Angeles, USA)
A fast speech extraction (FSE) method is presented using convex optimization made possible by pause detection of the speech sources. Sparse unmixing filters are sought by L1 regularization and the split Bregman method. A subdivided split Bregman method is developed for efficiently estimating long reverberations in real room recordings. The speech pause detection is based on a binary mask source separation method. The FSE method is evaluated and found to outperform existing blind speech separation approaches on both synthetic and room recorded data in terms of the overall computational speed and separation quality.
17:20Reducing Musical Noise in Blind Source Separation by Time-Domain Sparse Filters and Split Bregman Method
Wenye Ma (Department of Mathematics, University of California, Los Angeles, USA)
Meng Yu (Department of Mathematics, University of California, Irvine, USA)
Jack Xin (Department of Mathematics, University of California, Irvine, USA)
Stanley Osher (Department of Mathematics, University of California, Los Angeles, USA)
Musical noise often arises in the outputs of time-frequency binary mask based blind source separation approaches. Post-processing is desired to enhance the separation quality. An efficient musical noise reduction method by time-domain sparse filters is presented using convex optimization. The sparse filters are sought by L1 regularization and the split Bregman method. The proposed musical noise reduction method is evaluated by both synthetic and room recorded speech and music data, and found to outperform existing musical noise reduction methods in terms of the objective and subjective measures.
17:40Combining Monaural and Binaural Evidence for Reverberant Speech Segregation
John Woodruff (Department of Computer Science and Engineering, The Ohio State University, United States)
Rohit Prabhavalkar (Department of Computer Science and Engineering, The Ohio State University, United States)
Eric Fosler-Lussier (Department of Computer Science and Engineering, The Ohio State University, United States)
DeLiang Wang (Department of Computer Science and Engineering, The Ohio State University, United States and Center for Cognitive Science, The Ohio State University, United States)
Most existing binaural approaches to speech segregation rely on spatial filtering. In environments with minimal reverberation and when sources are well separated in space, spatial filtering can achieve excellent results. However, in everyday environments performance degrades substantially. To address these limitations, we incorporate monaural analysis within a binaural segregation system. We use monaural cues to perform both local and across frequency grouping of mixture components, allowing for a more robust application of spatial filtering. We propose a novel framework in which we combine monaural grouping evidence and binaural localization evidence in a linear model for the estimation of the ideal binary mask. Results indicate that with appropriately designed features that capture both monaural and binaural evidence, an extremely simple model achieves a signal-to-noise ratio improvement of up to 4 dB relative to using spatial filtering alone.

Speech Synthesis II: HMM-based Speech Synthesis

Time:Monday 16:00 Place:302 Type:Oral
Chair:Keiichi Tokuda
16:00Speaker and Language Adaptive Training for HMM-Based Polyglot Speech Synthesis
Heiga Zen (Toshiba Research Europe Ltd.)
This paper proposes a technique for speaker and language adaptive training for HMM-based polyglot speech synthesis. Language-specific context-dependencies in the system are captured using CAT with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by CMLLR-based transforms. This framework allows multi-speaker/multi-language adaptive training and synthesis to be performed. Experimental results show that the proposed technique achieves better synthesis performance than both speaker-adaptively trained language-dependent and language-independent models.
16:20Context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis
Kai Yu (Cambridge University Engineering Department)
Heiga Zen (Toshiba Research Europe Ltd.)
Francois Mairesse (Cambridge University Engineering Department)
Steve Young (Cambridge University Engineering Department)
To achieve high quality synthesised speech in HMM-based speech synthesis, the effective modelling of complex contexts is critical. Traditional approaches use context-dependent HMMs with decision tree based clustering to model the full contexts. However, weak contexts are difficult to capture using this approach. Context adaptive training provides a structured framework for this whereby standard HMMs represent normal contexts and linear transforms represent additional effect of weak-contexts. In contrast to speaker adaptive training, separate decision trees have to be built for the weak and normal context factors. This paper describes the general framework of context adaptive training and investigates three concrete forms: MLLR, CMLLR and CAT based systems. Experiments on a word-level emphasis synthesis task show that all context adaptive training approaches can outperform the standard full-context-dependent HMM approach. The MLLR based system achieved the best performance.
16:40Roles of the Average Voice in Speaker-adaptive HMM-based Speech Synthesis
Junichi Yamagishi (The Centre for Speech Technology Research, University of Edinburgh)
Oliver Watts (The Centre for Speech Technology Research, University of Edinburgh)
Simon King (The Centre for Speech Technology Research, University of Edinburgh)
Bela Usabaev (Universitat Tubingen)
In speaker-adaptive HMM-based speech synthesis, there are typically a few speakers for which the output synthetic speech sounds worse than that of other speakers, despite having the same amount of adaptation data from within the same corpus. This paper investigates these fluctuations in quality and concludes that as mel-cepstral distance from the average voice becomes larger, the MOS naturalness scores generally become worse. Although this negative correlation is not that strong, it suggests a way to improve the training and adaptation strategies. We also draw comparisons between our findings and the work of other researchers regarding "vocal attractiveness."'
17:00An HMM Trajectory Tiling (HTT) based Approach to High Quality TTS
Yao Qian (Microsoft Research Asia, Beijing, China)
Zhi-Jie Yan (Microsoft Research Asia, Beijing, China)
Yijian Wu (Microsoft China, Beijing, China)
Frank K Soong (Microsoft Research Asia, Beijing, China)
The current state-of-art HMM-based speech synthesis can produce highly intelligible speech but still carries the intrinsic vocoding flavor due to its simple excitation model. In this paper, we propose a new HMM trajectory tiling approach to high quality TTS. Trajectory generated by the refined HMM is used to guide the search for the closest waveform segment “tiles” in rendering highly intelligible and natural sounding speech. Normalized distances between the HMM trajectory and those of waveform unit candidates are used for constructing a unit sausage. Normalized cross-correlation is used to finding the best unit sequence in the sausage. The sequence serves as the best segment tiles to track closely the HMM trajectory guide. Tested on the two British English databases, our approach can render natural sounding speech without sacrificing the high intelligibility achieved by HMM-based TTS. They are confirmed subjectively by the corresponding AB preference and intelligibility tests.
17:20A Perceptual Study of Acceleration Parameters in HMM-based TTS
Yi-Ning Chen (Microsoft Research Asia)
Zhi-Jie Yan (Microsoft Research Asia)
Frank K. Soong (Microsoft Research Asia)
Previous study in HMM-based TTS has shown that the acceleration parameters are able to generate smoother trajectories with less distortion. However, the effect has never been investigated in formal objective and subjective tests. In this paper, the acceleration parameters in trajectory generation are studied in depth. We show that discarding acceleration parameters only introduces small additional distortion. But human subjects can easily perceive the quality degradation, because saw-tooth-like trajectories are commonly generated. Therefore, we choose the upper- and lower-bounded envelopes of the saw-tooth trajectories for further analysis. Experimental results show that both envelope trajectories have larger objective distortions. However, the speech synthesized using the envelope trajectories becomes perceptually transparent to the reference. This perceptual study facilitates efficient implementation of low-cost TTS systems, as well as low bit rate speech coding and reconstruction.
17:40Evaluation of Prosodic Contextual Factors for HMM-based Speech Synthesis
Shuji Yokomizo (Tokyo Institute of Technology)
Takashi Nose (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)
We explore the effect of prosodic contextual factors for the HMM-based speech synthesis. In a baseline system, a lot of contextual factors are used during the model training, and the cost for parameter tying by context clustering become relatively high compared to that in the speech recognition. We examine the choice of prosodic contexts by objective measures for English and Japanese speech data. The experimental results show that more compact context sets gives also comparable or close performance to the conventional full context.

Multi-modal signal processing

Time:Monday 16:00 Place:301 Type:Oral
Chair:Pedro J. Moreno
16:00Learning words and speech units through natural interactions
Jonas Hörnstein (Institute for System and Robotics (ISR), Instituto Superior Técnico, Lisbon, Portugal)
José Santos-Victor (Institute for System and Robotics (ISR), Instituto Superior Técnico, Lisbon, Portugal)
This work provides an ecological approach to learning words and speech units through natural interactions, without the need for preprogrammed linguistic knowledge in form of phonemes. Interactions such as imitation games and multimodal word learning create an initial set of words and speech units. These sets are then used to train statistical models in an unsupervised way.
16:20Bimodal Coherence based Scale Ambiguity Cancellation for Target Speech Extraction and Enhancement
Qingju Liu (Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK)
Wenwu Wang (Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK)
Philip Jackson (Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK)
We present a novel method for extracting target speech from auditory mixtures using bimodal coherence, which is statistically characterised by a Gaussian mixture modal (GMM) in the off-line training process, using the robust features obtained from the audio-visual speech. We then adjust the ICA-separated spectral components using the bimodal coherence in the time-frequency domain, to mitigate the scale ambiguities in different frequency bins. We tested our algorithm on the XM2VTS database, and the results show the performance improvement with our proposed algorithm in terms of signal to interference ratio measurements.
16:40Speech Estimation in Non-Stationary Noise Environments Using Timing Structures between Mouth Movements and Sound Signals
Hiroaki Kawashima (Kyoto University)
Yu Horii (Kyoto University)
Takashi Matsuyama (Kyoto University)
A variety of methods for audio-visual integration, which integrate audio and visual information at the level of either features, states, or classifier outputs, have been proposed for the purpose of robust speech recognition. However, these methods do not always fully utilize auditory information when the signal-to-noise ratio becomes low. In this paper, we propose a novel approach to estimate speech signal in noise environments. The key idea behind this approach is to exploit clean speech candidates generated by using timing structures between mouth movements and sound signals. We first extract a pair of feature sequences of media signals and segment each sequence into temporal intervals. Then, we construct a cross-media timing-structure model of human speech by learning the temporal relations of overlapping intervals. Based on the learned model, we generate clean speech candidates from the observed mouth movements.
17:00Synthesizing Photo-Real Talking Head via Trajectory-Guided Sample Selection
Lijuan Wang (Microsoft Reseach Asia)
Frank Soong (Microsoft Reseach Asia)
Xiaojun Qian (Department of Systems Engineering and Engineering Management, Chinese University of Hong Kong, China)
Wei Han (Department of Computer Science, Shanghai Jiao Tong University, China)
We propose a trajectory-guided, real sample concatenating approach for synthesizing high-quality photo-real articulator animation. It renders a photo-real video of articulators in sync with given speech signals by searching for the closest real image sample sequence in the library to the HMM predicted trajectory. Objectively, we evaluated the performance of our system in terms of MSE and investigate the pruning strategies in terms of storage and processing speed. Our talking head took part in the LIPS2009 Challenge contest and won the FIRST place with a subjective MOS score of 4.15 in the Audio-Visual match evaluated by 20 human subjects.
17:20Silent vs Vocalized Articulation for a Portable Ultrasound-Based Silent Speech Interface
Victoria-M. Florescu (SIGMA Laboratory, ESPCI ParisTech, CNRS-UMR 7084, Paris, France)
Lise Crevier-Buchman (Laboratoire de Phonétique et Phonologie, CNRS-UMR 7018, Paris, France)
Bruce Denby (SIGMA Laboratory, ESPCI ParisTech, CNRS-UMR 7084, Paris, France; Université Pierre et Marie Curie, Paris, France)
Thomas Hueber (GIPSA-Lab, Département Parole & Cognition, CNRS-UMR 5216, Grenoble, France)
Antonia Colazo-Simon (Laboratoire de Phonétique et Phonologie, CNRS-UMR 7018, Paris, France)
Claire Pillot-Loiseau (Laboratoire de Phonétique et Phonologie, CNRS-UMR 7018, Paris, France)
Pierre Roussel (SIGMA Laboratory, ESPCI ParisTech, CNRS-UMR 7084, Paris, France)
Cédric Gendrot (Laboratoire de Phonétique et Phonologie, CNRS-UMR 7018, Paris, France)
Sophie Quattrocchi (Laboratoire de Phonétique et Phonologie, CNRS-UMR 7018, Paris, France)
Silent Speech Interfaces have been proposed for communication in silent conditions or as a new means of restoring the voice of persons who have undergone a laryngectomy. To operate such a device, the user must articulate silently. Isolated word recognition tests performed with fixed and portable ultrasound based silent speech interface equipment show that systems trained on vocalized speech exhibit reduced performance when tested on silent articulation, but that training with silently articulated speech allows to recover much of this loss.
17:40Comparison of HMM and TMDN Methods for Lip Synchronisation
Gregor Hofer (Centre for Speech Technology Research, Edinburgh University)
Korin Richmond (Centre for Speech Technology Research, Edinburgh University)
This paper presents a comparison between a hidden Markov model (HMM) based method and a novel artificial neural network (ANN) based method for lip synchronisation. Both model types were trained on motion tracking data, and a perceptual evaluation was carried out comparing the output of the models, both to each other and to the original tracked data. It was found that the ANN-based method was judged significantly better than the HMM based method. Furthermore, the original data was not judged significantly better than the output of the ANN method.


Time:Monday 16:00 Place:International Conference Room A Type:Poster
Chair:Kikuo Maekawa
#1Rhythm and Formant Features for Automatic Alcohol Detection
Florian Schiel (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universitaet Muenchen)
Christian Heinrich (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universitaet Muenchen)
Veronika Neumeyer (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universitaet Muenchen)
Two speech feature sets, RMS rhythmicity and formant frequencies F1-F4, are analyzed for their ability to distinguish alcoholized from sober speech. We describe the statistical framework based on the Alcohol Language Corpus (ALC), including other factors such as gender, age and speaking style, and its application to our case. Rhythm features are calculated using a new method based solely on the short-time energy function; formant features are derived using the standard formant tracker SNACK. Our findings indicate that 3 rhythm and 3 formant features have a high potential to detect intoxication within the speech data of a subject. We also tested the hypothesis that vowels are more centralized in the F1/F2 space for alcoholized speech, but found that, on the contrary, subjects tend to hyper-articulate when being tested for intoxication.
#2An exploration of voice source correlates of focus
Irena Yanushevskaya (Trinity College Dublin)
Christer Gobl (Trinity College Dublin)
John Kane (Trinity College Dublin)
Ailbhe Ní Chasaide (Trinity College Dublin)
This pilot study explores how the voice source parameters vary in focally accented syllables. It examines the dynamics of the voice source parameters in an all-voiced short declarative utterance in which the focus placement was varied. The voice source parameters F0, EE, UP, OQ, RG, RA, RK and RD were obtained through inverse filtering and subsequent parameterisation using the LF-model. The results suggest that the focally accented syllables are marked not only by increased F0 but also by boosted EE, RG and UP. The non-focal realisations show reduced values for the above parameters along with a tendency towards higher OQ values, suggesting a more lax mode of phonation.
#3Modeling perceived vocal age in American English
James Harnsberger (University of Florida)
Rahul Shrivastav (University of Florida)
W.S. Brown, Jr. (University of Florida)
An acoustic analysis of voice, articulatory, and prosodic cues to perceived age was completed for a speech database of 150 American English speakers. Perceived ages were submitted to multiple linear regression analyses with measures of acoustic correlates of: voice quality, articulation, fundamental frequency, and prosody. The fit between predicted and actual perceived ages from the resulting models varied by speech material and gender, with female vocal ages being the easiest to predict. Articulation, pitch, and speaking rate measures were the most predictive in female voices, while, for male voices, the observed ranking was: speaking rate, voice quality, and pitch.
#4Multivariate Analysis of Vocal Fatigue in Continuous Reading
Marie-José Caraty (Paris Descartes University - LIPADE)
Claude Montacié (Paris Sorbonne University - STIH)
We present an experimental paradigm to measure changes in characteristics of speech under vocal fatigue. For speech corpora, we have chosen a vocal load (3 hours) and a cognitive process (reading aloud continuously) that can induce some fatigue of the reader. Fatigue is verified using an analysis of reading errors and disfluencies. A multivariate analysis based on Wilks' lambda test, of 169,042 occurrences of phonemes, can analyze spectral and prosodic changes of each phonetic class. Based on six readers, the results show that nasals (vowels and consonants) are the most discriminant phonemes in vocal fatigue.
#5Frequency-Domain Delexicalization using Surrogate Vowels
Alexander Kain (Center for Spoken Language Understanding, Oregon Health & Science University, Portland, Oregon, USA)
Jan van Santen (Center for Spoken Language Understanding, Oregon Health & Science University, Portland, Oregon, USA)
We propose a delexicalization algorithm that renders the lexical content of an utterance unintelligible, while preserving important acoustic prosodic cues, as well as naturalness and speaker identity. This is achieved by replacing voiced regions by spectral slices from a surrogate vowel, and by averaging the magnitude spectrum during unvoiced regions. Perceptual tests were carried out comparing sentences that were either unprocessed or delexicalized, using a baseline or the proposed method. An intelligibility test resulted in a keyword recall rate of 92% for the unprocessed sentences, and near complete unintelligibility for both delexicalization methods. Affect recognition was at 65% for unprocessed sentences, and 46% and 49% for the baseline and the proposed method, respectively. Preference tests showed that the proposed method preserved drastically more speaker identity, and sounded more natural than the baseline.
#6Emotion Recognition using Imperfect Speech Recognition
Florian Metze (Carnegie Mellon University)
Anton Batliner (Friedrich-Alexander-Universitaet Erlangen-Nuernberg)
Florian Eyben (Technische Universitaet Muenchen)
Polzehl Tim (Technische Universitaet Berlin)
Bjoern Schuller (Technische Universitaet Muenchen)
Stefan Steidl (Friedrich-Alexander-Universitaet Erlangen-Nuernberg)
This paper investigates the use of speech-to-text methods for assigning an emotion class to a given speech utterance. Previous work shows that an emotion extracted from text can convey complementary evidence to the information extracted by classifiers based on spectral, or other non-linguistic features. As speech-to-text usually presents significantly more computational effort, in this study we investigate the degree of speech-to-text accuracy needed for reliable detection of emotions from an automatically generated transcription of an utterance. We evaluate the use of hypotheses in both training and testing, and compare several classification approaches on the same task. Our results show that emotion recognition performance stays roughly constant as long as word accuracy doesn't fall below a reasonable value, making the use of speech-to-text viable for training of emotion classifiers based on linguistics.
#7A Novel Feature Extraction Strategy for Multi-stream Robust Emotion Identification
Gang Liu (CRSS: Center for Robust Speech Systems,Erik Jonsson School of Engineering and Computer Science,University of Texas at Dallas)
Yun Lei (CRSS: Center for Robust Speech Systems,Erik Jonsson School of Engineering and Computer Science,University of Texas at Dallas)
John H. L. Hansen (CRSS: Center for Robust Speech Systems,Erik Jonsson School of Engineering and Computer Science,University of Texas at Dallas)
In this study, we investigate an effective feature extraction front-end for improved emotion identification by speech in clean and noisy condition. First, we explore the application of the PMVDR feature for emotion characterization. Originally for accent/dialect and language identification (LID), PMVDR features are less sensitive to noise. Also developed for LID, shifted delta cepstral (SDC) approach can also be used as a means of incorporating additional temporal information about the speech into the feature vectors. As already known, super-segmental characteristics, such as pitch and intensity, can provide beneficial information to emotion recognition and we believe the improvement can be acquired from improved features. We performed evaluation on the Berlin database of emotion speech. The proposed system, PMVDR-SDC, outperforms the baseline system absolutely by 10.1%, which proves the validity of the approach. Furthermore, we find both PMVDR and SDC offers much better robustness in noisy condition than others, which is critical for the real application.
#8Setup for Acoustic-Visual Speech Synthesis by Concatenating Bimodal Units
Asterios Toutios (Université Nancy 2, LORIA)
Utpala Musti (Université Nancy 2, LORIA)
Slim Ouni (Université Nancy 2, LORIA)
Vincent Colotte (Université Henri Poincaré Nancy 1, LORIA)
Brigitte Wrobel-Dautcourt (Université Henri Poincaré Nancy 1, LORIA)
Marie-Odile Berger (INRIA, LORIA)
This paper presents preliminary work on building a system able to synthesize concurrently the speech signal and a 3D animation of the speaker's face. This is done by concatenating bimodal diphone units, that is, units that comprise both acoustic and visual information. The latter is acquired using a stereovision technique. The proposed method addresses the problems of asynchrony and incoherence inherent in classic approaches to audiovisual synthesis. Unit selection is based on classic target and join costs from acoustic-only synthesis, which are augmented with a visual join cost. Preliminary results indicate the benefits of the approach, since both the synthesized speech signal and the face animation are of good quality. Planned improvements and enhancements to the system are outlined.
#9Towards Affective State Modeling in Narrative and Conversational Settings
Bart Jochems (University of Twente)
Martha Larson (Delft Unversity of Technology)
Roeland Ordelman (University of Twente)
Ronald Poppe (University of Twente)
Khiet Truong (University of Twente)
We carry out two studies on affective state modeling for communication settings that involve unilateral intent on the part of one participant (the evoker) to shift the affective state of another participant (the experiencer). The first investigates viewer response in a narrative setting using a corpus of documentaries annotated with viewer-reported narrative peaks. The second investigates affective triggers in a conversational setting using a corpus of recorded interactions, annotated with continuous affective ratings, between a human interlocutor and an emotionally colored agent. In each case, we build a “one-sided” model using indicators derived from the speech of one participant. Our classification experiments confirm the viability of our models and provide insight into useful features.
#10Detection of anger emotion in dialog speech using prosody feature and temporal relation of utterances
Narichika Nomoto (NTT Cyber Space Laboratories, NTT Corporation)
Hirokazu Masataki (NTT Cyber Space Laboratories, NTT Corporation)
Osamu Yoshioka (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi Takahashi (NTT Cyber Space Laboratories, NTT Corporation)
This paper proposes a novel feature for detecting anger in dialog speech. Anger is classified into two types; loud HotAnger and calm ColdAnger. Prosody can reliably detect the former but not the latter. We analyze both types of anger dialog in the two-party setting, and discover that they exhibit some differences in the temporal relation of utterances from neutral dialog. We create a dialog feature that reflects these differences, and investigate its effectiveness in detecting both types of anger. Tests show the proposed feature combination improves the F-measure of Cold and HotAnger by 24.4 points and 8.8 points against baseline technique that uses only prosody.
#11Gesture and Speech Coordination: The Influence of the Relationship Between Manual Gesture and Speech
Benjamin Roustan (gipsa-lab, UMR5216 CNRS)
Marion Dohen (gipsa-lab, UMR5216 CNRS)
Communication is multimodal. In particular, speech is often accompanied by manual gestures. Moreover, their coordination to speech has often been related to prosody. The aim of this study was to further explore the coordination between prosodic focus and different manual gestures (pointing, beat and control gestures) on ten speakers using motion capture.
#12Analysis and Detection of Cognitive Load and Frustration in Drivers’ Speech
Hynek Boril (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas)
Seyed Omid Sadjadi (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas)
Tristan Kleinschmidt (Speech and Audio Research Laboratory, Queensland University of Technology, Brisbane, Australia)
John H.L. Hansen (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas)
Non-driving related cognitive load and variations of emotional state may impact the driver’s capability to control a vehicle and introduce driving errors. Availability of reliable cognitive load and emotion detection in drivers would benefit the design of active safety systems and other intelligent in-vehicle interfaces. In this study, speech produced by 68 subjects while driving in urban areas is analyzed. A particular focus is on speech production differences in two secondary cognitive tasks, interactions with a co-driver and calls to automated spoken dialog systems (SDS), and two emotional states during the SDS interactions - neutral/negative. A number of speech parameters are found to vary across the cognitive /emotion classes. Suitability of selected spectral- and production-based features for automatic cognitive task/emotion classification is investigated. A fusion of GMM/SVM classifiers yields an accuracy of 89% in cognitive task and 76% in emotion classification.
#13Acoustic-Based Recognition of Head Gestures Accompanying Speech
Akira Sasou (Advanced Industiral Science and Technology, AIST)
Yasuharu Hashimoto (Advanced Industrial Science and Technology, AIST)
Katsuhiko Sakaue (Advanced Industrial Science and Technology, AIST)
Head movements are linked not only to symbolic gestures, such as head-nodding to represent “yes” or head-shaking to represent “no,” but also to the production of suprasegmental features of speech, such as stress, prominence, and other aspects of prosody. Recent studies have shown that head movements play a more direct role in the perception of speech. In this paper, we propose a novel method for recognizing head gestures that accompany speech. The proposed method tracks head movements that accompany speech by localizing the mouth position with a microphone array system. We also propose a recognition method for the mouth-position trajectory, in which Higher- Order Local Cross Correlation is applied to the trajectory. The recognition accuracy of the proposed method was on an average 90.25% for nineteen kinds of head gesture recognition tasks conducted in an open test manner, which outperformed the Hidden Markov Model-based method.
#14Multimodal Dialog in the Car: Combining Speech and Turn-And-Push Dial to Control Comfort Functions
Angela Mahr (German Research Center for Artificial Intelligence)
Sandro Castronovo (German Research Center for Artificial Intelligence)
Margarita Pentcheva (University of the Saarland)
Christian Müller (German Research Center for Artificial Intelligence)
In this paper, we address the question how speech and tangible interfaces can be combined in order to provide effective multimodal interaction in vehicles, taking into account the special requirements induced by the circumstances of driving. Speech is used to set the interaction context and a turn-and-push dial is used to manipulate/adjust. An experimental study is presented that measures the distraction induced by manual, speech-only, and multimodal interaction (combination of speech and turn-and-push dial). Results show that while subjects where able to perform more tasks in the manual condition, their driving was significantly safer with using speech-only or multimodal dialog. Supplemental contributions of this paper are descriptions of how a multimodal dialog manager as well as a driving simulation software are connected to the CAN vehicle bus as well as how driver distraction caused by interacting with a system are measured using the standardized lane change task.
#15Hands Free Audio Analysis from Home Entertainment
Danil Korchagin (Idiap Research Institute)
Philip N. Garner (Idiap Research Institute)
Petr Motlicek (Idiap Research Institute)
In this paper, we describe a system developed for hands free audio analysis for a living room environment. It comprises detection and localisation of the verbal and paralinguistic events, which can augment the behaviour of virtual director and improve the overall experience of interactions between spatially separated families and friends. The results show good performance in reverberant environments and fulfil real-time requirements.
#16Affective Story Teller: A TTS System for Emotional Expressivity
MOSTAFA AL MASUM SHAIKH (Department of Information and Communication Engineering, University of Tokyo, Japan)
Antonio Rui Ferreira Rebordao (Department of Information and Communication Engineering, University of Tokyo, Japan)
Keikichi Hirose (Department of Information and Communication Engineering, University of Tokyo, Japan)
This paper describes a system, Affective Story Teller (AST), as an example of emotionally expressive speech synthesizer. Our technique uses several linguistic resources that recognizes emotions in the input text according to its emotional affinity and assigns appropriate prosodic parameters as well as pitch accents by XML-based tagging to generate a synthesized speech sample. Then the synthesized sample is re-synthesized through TD-PSOLA based pitch manipulation in accordance to emotional connotation. The system employed MARY TTS system to readout a folk tale. The preliminary perceptual test results are encouraging and human judges, by listening to the re-synthesized speech samples of AST, could perceive ”happy”, “sad”, and “fear” emotions much better than compared to when they listened non-affective synthesized speech.

ASR: Speaker Adaptation, Robustness Against Reverberation

Time:Monday 16:00 Place:International Conference Room B Type:Poster
Chair:Shigeki Sagayama
#1Enhancing Children's Speech Recognition under Mismatched Condition by Explicit Acoustic Normalization
Shweta Ghai (Department of Electronics and Communication Engineering, IIT Guwahati, Guwahati-781039, Assam, India.)
Rohit Sinha (Department of Electronics and Communication Engineering, IIT Guwahati, Guwahati-781039, Assam, India.)
Most commonly used model adaptation techniques employ linear/affine transformation on models/features to address the gross acoustic mismatch between the adults' and the children's speech data. Since all sources of acoustic mismatch may not be appropriately modeled by just linear transformation, in this work, the efficacy of our recently proposed explicit acoustic (pitch and speaking rate) normalization in combination with the existing normalization/adaptation techniques is explored for mismatched children's speech recognition. The study shows that explicit normalization of pitch and speaking rate of children's speech further improves the effectiveness of the adaptation methods. With explicit acoustic normalization significant relative improvements of 13% and 5% are obtained over that obtained with combined VTLN and CMLLR for children's speech recognition on adults' speech trained models for connected digit and continuous speech recognition tasks, respectively.
#2Comparison of Discriminative Input and Output Transformations for Speaker Adaptation in the Hybrid NN/HMM Systems
Li Bo (National University of Singapore)
Sim Khe Chai (National University of Singapore)
Speaker variability is one of the major error sources for ASR systems. Speaker adaptation estimates speaker specific models from the speaker independent ones to minimize the mismatch between the training and testing conditions arisen from speaker variabilities. One of the commonly adopted approaches is the transformation based method. In this paper, the discriminative input and output transforms for speaker adaptation in the hybrid NN/HMM systems are compared and further investigated with both structural and data-driven constraints. Experimental results show that the data-driven constrained discriminative transforms are much more robust for unsupervised adaptation.
#3Augmentation of adaptation data
Ravichander Vipperla (School of Informatics, University of Edinburgh)
Steve Renals (School of Informatics, University of Edinburgh)
Joe Frankel (School of Informatics, University of Edinburgh)
Linear regression based speaker adaptation approaches can improve Automatic Speech Recognition (ASR) accuracy significantly for a target speaker. However, when the available adaptation data is limited to a few seconds, the accuracy of the speaker adapted models is often worse compared with speaker independent models. In this paper, we propose an approach to select a set of reference speakers acoustically close to the target speaker whose data can be used to augment the adaptation data. To determine the acoustic similarity of two speakers, we propose a distance metric based on transforming sample points in the acoustic space with the regression matrices of the two speakers. We show the validity of this approach through a speaker identification task. ASR results on SCOTUS and AMI corpora with limited adaptation data of 10 to 15 seconds augmented by data from selected reference speakers show a significant improvement in Word Error Rate over speaker independent and speaker adapted models.
#4Discriminative adaptation based on fast combination of DMAP and DfMLLR
Lukas Machlica (University of West Bohemia in Pilsen, Faculty of Applied Sciences)
Zbynek Zajic (University of West Bohemia in Pilsen, Faculty of Applied Sciences)
Ludek Muller (University of West Bohemia in Pilsen, Faculty of Applied Sciences)
This paper investigates the combination of discriminative adaptation techniques. The discriminative MAP adaptation (DMAP) and discriminative feature MLLR (DfMLLR) are examined. Since each of the methods is proposed for distinct amount of adaptation data it is useful to combine them in order to preserve the systems performance in situations with varying amount of adaptation data. Generally, DfMLLR and DMAP are executed subsequently (DMAP preceded by DfMLLR) demanding to approach the data twice. Since both methods address the data through the same statistics an one-pass-combination was proposed in order to decrease the time consumption. The one-pass-combination utilizes the advantage of DfMLLR method to transform directly the feature vectors. However, instead of feature vectors the statistics are transformed, what allows to use already computed statistics without the need to process the data once again. All approaches are compared also to their non-discriminative alternatives.
#5Revisiting VTLN Using Linear Transformation on Conventional MFCC
Rama Sanand Doddipatla (Lehrstuhl for Informatik 6, RWTH Aachen University, Aachen, Germany)
Ralf Schlueter (Lehrstuhl for Informatik 6, RWTH Aachen University, Aachen, Germany)
Hermann Ney (Lehrstuhl for Informatik 6, RWTH Aachen University, Aachen, Germany)
In this paper, we revisit the linear transformation for VTLN on conventional MFCC proposed by Sanand et. al. in [Interspeech 2008], using the idea of band-limited interpolation. The filter-bank is modified to include half-filters at zero and nyquist frequencies, as the full symmetric spectrum is required for performing band-limited interpolation. In this paper, we show that the filter-bank with half-filters does not affect the recognition performance on clean speech (also shown in [Interspeech 2008]), but does affect the recognition performance on noisy speech. This motivated us to revisit the linear transformation for VTLN in [Interspeech 2008] and propose modifications to undo the affect of half-filters during the feature extraction. We show through recognition experiments that the proposed modifications to the linear transformation have comparable performance as the conventional VTLN approach, still enabling us to perform VTLN using a linear transformation on conventional MFCC.
#6Speaker Adaptation Based on Nonlinear Spectral Transform for Speech Recognition
Toyohiro Hayashi (Nagoya Institute of Technology)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Akinobu Lee (Nagoya Institute of Technology)
Keiich Tokuda (Nagoya Institute of Technology)
This paper proposes a speaker adaptation technique using nonlinear spectral transform based on GMMs. One of the most popular forms of speaker adaptation is based on linear transforms, e.g., MLLR. Although MLLR uses multiple transforms according to regression classes, only a single linear transform is applied to each state. The proposed method performs nonlinear speaker adaptation based on a new likelihood function combining HMMs for recognition with GMMs for spectral transform. Moreover, the context dependency of transforms can also be estimated in the integrated ML fashion. In phoneme recognition experiments, the proposed technique shows better performance than the conventional approaches.
#7Speaker Adaptation Based on System Combination Using Speaker-Class Models
Tetsuo Kosaka (Yamagata University)
Takashi Ito (Yamagata University)
Masaharu Kato (Yamagata University)
Masaki Kohda (Yamagata University)
In this paper, we propose a new system combination approach for an LVCSR system using speaker-class (SC) models and a speaker adaptation technique based on these SC models. The basic concept of the SC-based system is to select speakers who are acoustically similar to a target speaker to train acoustic models. One of the major problems regarding the use of the SC model is determining the selection range of the speakers. In other words, it is difficult to determine the number of speakers that should be selected. In order to solve this problem, several SC models, which are trained by a variety of number of speakers are prepared in advance. In the recognition step, acoustically similar models are selected from the above SC models, and the scores obtained from these models are merged using a word graph combination technique. The proposed method was evaluated using the Corpus of Spontaneous Japanese (CSJ), and showed significant improvement in a lecture speech recognition task.
#8Speaker Adaptation in Transformation Space Using Two-dimensional PCA
Yongwon Jeong (Pusan National University)
Young Rok Song (Pusan National University)
Hyung Soon Kim (Pusan National University)
This paper describes a principled application of two-dimensional principal component analysis (2DPCA) to the decomposition of transformation matrices of maximum likelihood linear regression (MLLR) and its application to speaker adaptation using the bases derived from the analysis. Our previous work applied 2DPCA to speaker-dependent (SD) models to obtain the bases for state space. In this work, we apply 2DPCA to a set of MLLR transformation matrices of training speakers to obtain the bases for transformation space, since the matrices are 2-D in nature, and 2DPCA can decompose a set of matrices without vectorization. Here, we present two approaches using 2DPCA: One in eigenspace-based MLLR (ES-MLLR) framework and the other one in maximum a posteriori linear regression (MAPLR) framework. The experimental results showed that the proposed methods outperformed ES-MLLR for the adaptation data of about 10 seconds or longer.
#9On Speaker Adaptive Training of Artificial Neural Networks
Jan Trmal (Department of Cybernetics, University of West Bohemia)
Jan Zelinka (Department of Cybernetics, University of West Bohemia)
Ludek Muller (Department of Cybernetics, University of West Bohemia)
In the paper we present two techniques improving the recognition accuracy of multilayer perceptron neural networks (MLP ANN) by means of adopting Speaker Adaptive Training. The use of the MLP ANN, usually in combination with the TRAPS parametrization, includes applications in speech recognition tasks, discriminative features production and other. In the first SAT experiments, we used the VTLN as a speaker normalization technique. Moreover, we developed a novel speaker normalization technique called Minimum Error Linear Transform (MELT) that resembles the cMLLR/fMLLR method with respect to the possible application either on the model or features. We tested these two methods extensively on telephone speech corpus SpeechDat-East. The results obtained in these experiments suggest that incorporation of SAT into MLP ANN training process is beneficial and depending on the setup it leads to significant decrease of phoneme error rate (3 % - 8 % absolute, 12 % - 25 % relative).
#10Model Synthesis for Band-limited Speech Recognition
Yongjun He (School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China)
Jiqing Han (School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China)
A recognizer trained with full-bandwidth speech performs badly when recognizing band-limited speech because of environment mismatch. In this paper, we have proposed a novel model synthesis method for band-limited speech recognition. It detects speech bandwidth automatically and synthesizes a new acoustic model only using a full-bandwidth model when the bandwidth has been changed. Experiments conducted on TIMIT/NTIMIT databases show that the proposed method has achieved substantial improvement over the baseline speech recognizer.
#11Performance Estimation of Reverberant Speech Recognition Based on Reverberant Criteria RSR-Dn with Acoustic Parameters
Takahiro Fukumori (Graduate School of Scinence and Engineering, Ritsumeikan University)
Masanori Morise (Colleage of Information Science and Engineering, Ritsumeikan University)
Takanobu Nishiura (Colleage of Information Science and Engineering, Ritsumeikan University)
Reverberation-robust speech recognition has become important in the field of distant-talking speech recognition. However, as no common reverberation criteria for the recognition of reverberant speech have yet been proposed, it has been difficult to estimate its effectiveness. We propose new reverberation criteria RSR-Dn (Reverberant Speech Recognition criteria with Dn) based on ISO3382 acoustic parameters. We first designed the criteria using the relationship between speech recognition performance and ISO3382 acoustic parameters. We then estimated the speech recognition performance obtained with the criteria. Evaluation experiments confirmed that the recognition performance can be accurately and robustly estimated with our proposed criterion.
#12A Novel Approach for Matched Reverberant Training of HMMs using Data Pairs
Armin Sehr (Multimedia Communications and Signal Processing, University Erlangen-Nuremberg)
Christian Hofmann (Multimedia Communications and Signal Processing, University Erlangen-Nuremberg)
Roland Maas (Multimedia Communications and Signal Processing, University Erlangen-Nuremberg)
Walter Kellermann (Multimedia Communications and Signal Processing, University Erlangen-Nuremberg)
For robust distant-talking speech recognition, a novel HMM training approach using data pairs is proposed. The data pairs of clean and reverberant feature vectors, also called stereo data, are used for deriving the HMM parameters of a matched-condition reverberant HMM from a well-trained clean-speech HMM in two steps. In the first step, the alignment of the frames to the states is determined from the clean data and the clean-speech HMM. This state-frame alignment (SFA) is then used in the second step to estimate the Gaussian mixture densities for each state of the reverberant HMM by applying the Expectation Maximization (EM) algorithm to the reverberant data. Thus, a more accurate temporal alignment is achieved than by standard matched condition training, and the discrimination capability of the HMMs is increased. Connected digit recognition experiments show that the proposed approach decreases the word error rate (WER) by up to 44% while substantially reducing the training complexity.
#13An Auditory Based Modulation Spectral Feature for Reverberant Speech Recognition
Hari Krishna Maganti (Fondazione Bruno Kessler - IRST, Trento, Italy)
Marco Matassoni (Fondazione Bruno Kessler - IRST, Trento, Italy)
In this paper, an auditory based modulation spectral feature is presented to improve automatic speech recognition performance in presence of room reverberation. The solution is based on extracting features from auditory processing characteristics, specifically gammatone filtering based long-term modulation spectral features to reduce sensitivity to environmental noise and further preserve the important speech intelligibility information in the speech signal essential for ASR. Experiments are performed on Aurora-5 meeting recorder digit task recorded with four different microphones in hands-free mode at a real meeting room. For comparison purposes the recognition results obtained using standard ETSI basic and advanced front-ends and conventional features with standard feature compensation are tested. The experimental results reveal that the proposed features provide reliable and considerable improvements with respect to the state of the art feature extraction techniques.
#14On the Potential of Channel Selection for Recognition of Reverberated Speech with Multiple Microphones
Martin Wolf (Universitat Politècnica de Catalunya, Barcelona, Spain)
Climent Nadeu (Universitat Politècnica de Catalunya, Barcelona, Spain)
The performance of ASR systems in a room environment with distant microphones is strongly affected by reverberation. As the degree of signal distortion varies among acoustic channels (i.e. microphones), the recognition accuracy can benefit from a proper channel selection. In this paper, we experimentally show that there exists a large margin for WER reduction by channel selection, and discuss several possible methods which do not require any a-priori classification. Moreover, by using a LVCSR task, a significant WER reduction is shown with a simple technique which uses a measure computed from the sub-band time envelope of the various microphone signals.
#15An Improved Wavelet-based Dereverberation for Robust Automatic Speech Recognition
Randy Gomez (ACCMS, Kyoto University)
Tatsuya Kawahara (ACCMS, Kyoto University)
This paper presents an improved wavelet-based dereverberation method for automatic speech recognition (ASR). Dereverberation is based on filtering reverberant wavelet coefficients with the Wiener gains to suppress the effect of the late reflections. Optimization of the wavelet parameters using acoustic model enables the system to estimate the clean speech and late reflections effectively. This results to a better estimate of the Wiener gains for dereverberation in the ASR application. Additional tuning of the parameters of the Wiener gain in relation with the acoustic model further improves the dereverberation process for ASR. In the experiment with real reverberant data, we have achieved a significant improvement in ASR accuracy.
#16Methods for Robust Speech Recognition in Reverberant Environments: A Comparison
Rico Petrick (Dresden University of Technology)
Thomas Feher (Dresden University of Technology)
Masashi Unoki (Japan Advanced Institute of Science and Technology)
Rüdiger Hoffmann (Dresden University of Technology)
In this article the authors continue previous studies regarding the investigation of methods that aim to increase the recognition rate (RR) in reverberant environments of automatic speech recognition systems. Previously three robust front-end methods are tested, the harmonicity based feature analysis (HFA), the temporal power envelope feature analysis and their combination. This paper additionally introduces two well-known methods into the comparison. These are the dereverberation method using the inverse modulation transfer function and the delay-and-sum beamformer (DSB). Recognition experiments are accomplished for command word recognition. The results of this first comparison of such methods prove experimentally some drawn assumptions, e. g. the IMTF method achieves robustness only in the far field, the DSB improves the RR slightly but is outperformed by the HFA due to its indirectivity at low frequencies.

Language learning, TTS, and other applications

Time:Monday 16:00 Place:International Conference Room C Type:Poster
Chair:Helen Meng
#1Integration of Multilayer Regression with Structure-based Pronunciation Assessment
Masayuki Suzuki (The University of Tokyo)
Yu Qiao (Shenzhen Institutes of Advanced Technology)
Nobuaki Minematsu (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)
Automatic pronunciation assessment has several difficulties. Adequacy in controlling the vocal organs is often estimated from the spectral envelopes of input utterances but the envelope patterns are also affected by other factors such as speaker identity. Recently, a new method of speech representation was proposed where these non-linguistic variations are effectively removed through modeling only the contrastive aspects of speech features. This speech representation is called speech structure. However, the often excessively high dimensionality of the speech structure can degrade the performance of structure-based pronunciation assessment. To deal with this problem, we integrate multilayer regression analysis with the structure-based assessment. The results show higher correlation between human and machine scores and also show much higher robustness to speaker differences compared to widely used GOP-based analysis.
#2Using Non-Native Error Patterns to Improve Pronunciation Verification
Joost van Doremalen (Centre for Language and Speech Technology, Radboud University Nijmegen)
Catia Cucchiarini (Centre for Language and Speech Technology, Radboud University Nijmegen)
Helmer Strik (Centre for Language and Speech Technology, Radboud University Nijmegen)
In this paper we show how a pronunciation quality measure can be improved by making use of information on frequent pronunciation errors made by non-native speakers. We propose a new measure, called weighted Goodness of Pronunciation (wGOP), and compare it to the much used GOP measure. We applied this measure to the task of discriminating correctly from incorrectly realized Dutch vowels produced by non-native speakers and observed a substantial increase in performance when sufficient training material is available.
#3Regularized-MLLR Speaker Adaptation for Computer-Assisted Language Learning System
Dean Luo (University of Tokyo)
Yu Qiao (University of Tokyo)
Nobuaki Minematsu (University of Tokyo)
Yutaka Yamauchi (Tokyo International Universiy)
Keikichi Hirose (Universiy of Tokyo)
In this paper, we propose a novel speaker adaptation technique, regularized-MLLR, for Computer Assisted Language Learning (CALL) systems. This method uses the linear combination of a group of teachers’ transformation matrices to represent each target learner’s transformation matrix, thus avoids the over-adaptation problem that erroneous pronunciations come to be judged as good pronunciations after conventional MLLR speaker adaptation, which uses learners’ “imperfect” speech as target utterances of adaptation. Experiments of automatic scoring and error detection on public databases show that the pro-posed method outperforms conventional MLLR adaption in pronunciation evaluation and can avoid the problem of over adaptation.
#4Automatic Evaluation of English Pronunciation by Japanese Speakers Using Various Acoustic Features and Pattern Recognition Techniques
Kuniaki Hirabayashi (Toyohashi University of Technology, Japan)
Seiichi Nakagawa (Toyohashi University of Technology, Japan)
In this paper, we propose a method for estimating a score for English pronunciation. Scores estimated by the proposed method were evaluated by correlating them with the teacher's pronunciation score.The average correlation between the estimated pronunciation scores and the teacher's pronunciation scores over 1, 5, and 10 sentences was 0.807, 0.873, and 0.921, respectively. When a text of spoken sentence was unknown, we obtained a correlation of 0.878 for 10 utterances. For English phonetic evaluation, we classified English phoneme pairs that are difficult for Japanese speakers to pronounce, using SVM, NN, and HMM classifiers. The correct classification ratios for native English and Japanese English phonemes were 94.6% and 92.3% for SVM, 96.5% and 87.4% for NN, 85.0% and 69.2% for HMM, respectively. We then investigated the relationship between the classification rate and a native English teacher's pronunciation score, and obtained a high correlation of 0.6 - 0.7.
#5Decision Tree Based Tone Modeling with Corrective Feedbacks for Automatic Mandarin Tone Assessment
Hsien-Cheng Liao (Information and Communications Research Laboratories, Industrial Technology Research Institute)
Jiang-Chun Chen (Information and Communications Research Laboratories, Industrial Technology Research Institute)
Sen-Chia Chang (Information and Communications Research Laboratories, Industrial Technology Research Institute)
Ying-Hua Guan (Department of Applied Chinese Language and Literature, National Taiwan Normal University)
Chin-Hui Lee (School of Electrical and Computer Engineering, Georgia Institute of Technology)
We propose a novel decision tree based approach to Mandarin tone assessment. In most conventional computer assisted pronunciation training (CAPT) scenarios a tone production template is prepared as a reference with only numeric scores as feedbacks for tone learning. In contrast decision trees trained with an annotated tone-balanced corpus make use of a collection of questions related to important cues in categories of tone production. By traversing the corresponding paths and nodes associated with a test utterance a sequence of corrective comments can be generated to guide the learner for potential improvement. Therefore a detailed pronunciation indication or a comparison between two paths can be provided to learners which are usually unavailable in score-based CAPT systems.
#6CASTLE: a Computer-Assisted Stress Teaching and Learning Environment for Learners of English as a Second Language
Jingli Lu (Massey University,New Zealand)
Ruili Wang (Massey University,New Zealand)
Liyanage C De Silva (University of Brunei Darussalam, Brunei Darussalam)
Yang Gao (State Key Laboratory for Novel Software Technology, Nanjing University, China)
Jia Liu (Tsinghua University, China)
In this paper, we describe the principle and functionality of the Computer-Assisted Stress Teaching and Learning Environment (CASTLE) that we have proposed and developed to help learners of English as a Second Language (ESL) to learn stress patterns of English language. There are three modules in the CASTLE system. The first module, individualised speech learning material providing module, can provide learners individualised speech material that possesses their preferred voice features, e.g., gender, pitch and speech rate. The second module, perception assistance module, is intended to help learners correctly perceive English stress patterns, which can automatically exaggerate the differences between stressed and unstressed syllables in a teacher’s voice. The third module, production assistance module, is developed to help learners to make aware of the rhythm of English language and provide learners feedback in order to improve their production of stress patterns.
#7Automatic reference independent evaluation of prosody quality using multiple knowledge fusions
Shen Huang (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
Honagyan Li (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
Shijin Wang (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
Jiaen Liang (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
Bo Xu (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
Automatic evaluation of GOR (Goodness Of pRosody) is a more advanced and challenging task in CALL (Computer Aided Language Learning) system. Apart from traditional prosodic features, we develop a method based on multiple knowledge sources without any prior condition of reading text. After speech recognition, apart from most state-of-the-art features in prosodic analysis, we cultivate more concise and effective feature set from the generation of prosody based on Fujisaki model, and influence of tempo in prosody—the variability of prosodic components based on PVI method. We also propose methods of boosting training without any annotation by mining larger corpus. Results in experiment investigate the GOR score on 1297 speech samples of excellent group of Chinese students aging from 14-16, we can draw several conclusions: On the one hand, adding the knowledge sources from generation and impact of prosody can contribute to 1.76% reduction in EER and 0.036 promotion in correlation than prosodic features alone; On the other hand, final result can be considerably improved by boosting training approach and topic-dependent scheme.
#8Landmark-based Automated Pronunciation Error Detection
Su-Youn Yoon (Educational Testing Service)
Mark Hasegawa-Johnson (University of Illinois at Urbana-Champaign)
Richard Sproat (Oregon Health and Science University)
We present a pronunciation error detection method for second language learners of English (L2 learners). The method is a combination of confidence scoring at the phone level and landmark-based Support Vector Machines (SVMs). Landmark-based SVMs were implemented to focus the method on targeting specific phonemes in which L2 learners make frequent errors. The method was trained on the phonemes that are difficult for Korean learners and tested on intermediate Korean learners. In the data where non-phonemic errors occurred in a high proportion, the SVM method achieved a significantly higher F-score (0.67) than confidence scoring (0.60). However, the combination of the two methods without the appropriate training data did not lead to improvement. Even for intermediate learners, a high proportion of errors (40%) was related to these difficult phonemes. Therefore, a method that is specialized for these phonemes would be beneficial for both beginners and intermediate learners.
#9HMM based TTS for Mixed language text
Zhiwei Shuang (University of Science and Technology of China)
Shiying Kang (Department of Computer Science, Tsinghua University, China)
Yong Qin (IBM Research China)
Lirong Dai (University of Science and Technology of China)
Lianhong Cai (Department of Computer Science, Tsinghua University)
In current text content especially web contents, there are many mixed language contents, i.e. Mandarin text mixed with English words. To make the synthesized speech of mixed language contents sound natural, we need to synthesize the mixed languages content with a single voice. However, this task is very challenging because we can hardly find a talent who can speak both languages well enough. The synthesized speech will sound unnatural if the HMM based TTS is directly built with the non-native speakers’ training corpus. In this paper, we propose to use speaker adaptation technology to leverage the native speaker’s data to generate more natural speech for the non-native speaker. Evaluation results show that the proposed method can significantly improve the speaker consistency and naturalness of synthesized speech for mixed language text.
#10An Analysis of Language Mismatch in HMM State Mapping-Based Cross-Lingual Speaker Adaptation
Hui Liang (Idiap Research Institute & Ecole Polytechnique Fédérale de Lausanne)
John Dines (Idiap Research Institute)
This paper provides an in-depth analysis of the impacts of language mismatch on the performance of cross-lingual speaker adaptation. Our work confirms the influence of language mismatch between average voice distributions for synthesis and for transform estimation and the necessity of eliminating this mismatch in order to effectively utilize multiple transforms for cross-lingual speaker adaptation. Specifically, we show that language mismatch introduces unwanted language-specific information when estimating multiple transforms, thus making these transforms detrimental to adaptation performance. Our analysis demonstrates speaker characteristics should be separated from language characteristics in order to improve cross-lingual adaptation performance.
#11Classroom Note-taking System for Hearing Impaired Students using Automatic Speech Recognition Adapted to Lectures
Tatsuya Kawahara (Kyoto University)
Norihiro Katsumaru (Kyoto University)
Yuya Akita (Kyoto University)
Shinsuke Mori (Kyoto University)
We are developing a real-time lecture transcription system for hearing impaired students in university classrooms. The automatic speech recognition (ASR) system is adapted to individual lecture courses and lecturers, to enhance the recognition accuracy. The ASR results are selectively corrected by a human editor, through a dedicated interface, before presenting to the students. An efficient adaptation scheme of the ASR modules has been investigated in this work. The system was tested for a hearing-impaired student in a lecture course on civil engineering. Compared with the current manual note-taking scheme offered by two volunteers, the proposed system generated almost double amount of texts with one human editor.
#12Exploring Web-Browser based Runtimes Engines for Creating Ubiquitous Speech Interfaces
Paul Richard Dixon (National Institute of Information and Communications Technology)
Sadaoki Furui (Tokyo Institute of Technology)
This paper describes an investigation into current browser based runtimes including Adobe’s Flash and Microsoft’s Silverlight as platforms for delivering web based speech interfaces. The key difference here is the browser plugin is used to perform all the computation without any server side processing. The first application is an HMM based text-to-speech engine running in the Adobe Flash plugin. The second application is a WFST based large vocabulary speech recognition decoder written in C# running inside the Silverlight plugin.

Pitch and glottal-waveform estimation and modeling I

Time:Monday 16:00 Place:International Conference Room D Type:Poster
Chair:T.V. Sreenivas
#1Efficient Three-stage Pitch Estimation for Packet Loss Concealment
Xuejing Sun (Cambridge Silicon Radio)
Sameer Gadre (Cambridge Silicon Radio)
This paper presents a low-complexity pitch estimation algorithm for packet loss concealment. The algorithm divides the pitch estimation into three stages with each additional stage providing further accuracy. Compared with a system based on G.711 Appendix I, the proposed algorithm requires approximately 32 percent fewer cycles on a DSP processor integrated in a Bluetooth chip. Furthermore, objective evaluation of the voice quality using PESQ showed the algorithm yields substantially higher scores.
#2On Evaluation of the F0 estimation based on time-varying complex speech analysis
KEIICHI FUNAKI (University of the Ryukyus)
We have already proposed a robust fundamental frequency (F0) estimation based on robust ELS (Extended Least Square) timevarying complex-valued speech analysis for an analytic speech signal. It has been reported that the method performs better for IRS filtered speech corrupted by white Gauss noise or pink noise since speech spectrum can be accurately estimated in low frequencies. However, the evaluation was performed by using only time-invariant speech analysis, in which order of basis expansion was 1. In this paper, the performance of time-varying speech analysis is evaluated using Keele pitch database with respect to degree of voiced stationarity of frame. The evaluation demonstrates that the time-varying ELS-based robust complex analysis performs best for strong stationary voiced frame although it does not perform better for non-stationary voiced frame.
#3Pitch Estimation in Noisy Speech Based on Temporal Accumulation of Spectrum Peaks
Feng Huang (The Chinese University of Hong Kong)
Tan Lee (The Chinese University of Hong Kong)
We present a study on robust pitch estimation by integrating spectral and temporal information. Spectrum harmonics are important representations of the speech fundamental frequency (F0). Harmonic-related spectral peaks of speech evolve much more slowly than the spectral peaks of noise. This motivates the proposition of temporally accumulated peak spectrum (TAPS), which is computed by cumulating spectrum peaks over consecutive analysis frames. In TAPS, harmonics-related peaks are concentrated around the F0 and its multiples, while the peaks caused by noise are irregularly distributed with relatively small amplitude. A pitch estimation method is derived based on TAPS. Peak locations on the autocorrelation of TAPS indicate the frequency separations between the harmonic peaks, which are used to estimate the F0. The proposed method is evaluated on speech signals corrupted by white noise, speech noise and babble noise. The results show that our method performs more robustly and reliably than conventional methods.
#4Multi-Pitch Estimation by a Joint 2-D Representation of Pitch and Pitch Dynamics
Tianyu T. Wang (MIT Lincoln Laboratory)
Thomas F. Quatieri (MIT Lincoln Laboratory)
Multi-pitch estimation of co-channel speech is especially challenging when the underlying pitch tracks are close in pitch value (e.g., when pitch tracks cross). Building on our previous work in [1], we demonstrate the utility of a two-dimensional (2-D) analysis method of speech for this problem by exploiting its joint representation of pitch and pitch-derivative information from distinct speakers. Specifically, we propose a novel multi-pitch estimation method consisting of 1) a data-driven classifier for pitch candidate selection, 2) local pitch and pitch-derivative estimation by k-means clustering, and 3) a Kalman filtering mechanism for pitch tracking and assignment. We evaluate our method on a database of all-voiced speech mixtures and illustrate its capability to estimate pitch tracks in cases where pitch tracks are separate and when they are close in pitch value (e.g., at crossings)
#5On the Effect of Fundamental Frequency on Amplitude and Frequency Modulation Patterns in Speech Resonances
Pirros Tsiakoulis (National Technical University of Athens, Greece)
Alexandros Potamianos (Technical University of Crete, Greece)
Amplitude modulation (AM) and frequency modulation (FM) in speech signals are believed to reflect various non-linear phenomena during the speech production process. In this paper, the amplitude and frequency modulation patterns are analyzed for the first three speech resonances in relation to the fundamental frequency (F0). The formant tracks are estimated, and the resonant signals are extracted and demodulated. The Amplitude Modulation Index (AMI) and Frequency Modulation Index (FMI) are computed, and examined in relation to the F0 value, as well as the relation between F0 and the first formant value (F1). Both AMI and FMI are significantly affected by pitch, with modulations being more frequently present in low F0 conditions. Evidence of non-linear interaction between the glottal source and the vocal tract is found in the dependence of the modulation patterns on the ratio of F1 over F0. AMI is amplified when pitch harmonics coincide with F1, while FMI shows complementary behavior.
#6Pitch Determination Using Autocorrelation Function in Spectral Domain
M. Shahidur Rahman (Saitama University, Japan)
Tetsuya Shimamura (Saitama University, Japan)
This paper proposes a pitch determination method utilizing the autocorrelation function in the spectral domain. The autocorrelation function is a popular measurement in estimating pitch in time domain. The performance of the method, however, is effected due to the position of dominant harmonics (usually the first formant) and the presence of spurious peaks introduced in noisy conditions. We applied a series of operations to obtain a noise-compensated and flattened version of the amplitude spectrum which takes a shape of harmonics train. Application of the autocorrelation function on this preconditioned spectrum produces a sequence where the true pitch peak can be readily located. Experiments on long speech signals spoken by a number of male and female speakers have been conducted. Experimental results demonstrate the merit of the proposed method when compared with other popular methods.
#7Chirp Complex Cepstrum-based Decomposition for Asynchronous Glottal Analysis
Thomas Drugman (University of Mons)
Thierry Dutoit (University of Mons)
It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech. In order to guarantee a correct estimation, some constraints on the window have been derived. Among these, the window has to be synchronized on a Glottal Closure Instant. This paper proposes an extension of the complex cepstrum-based decomposition by incorporating a chirp analysis. The resulting method is shown to give a reliable estimation of the glottal flow wherever the window is located. This technique is then suited for its integration in usual speech processing systems, which generally operate in an asynchronous way. Besides its potential for automatic voice quality analysis is highlighted.
#8Exploiting Glottal Formant Parameters for Glottal Inverse Filtering and Parameterization
Alan O Cinneide (Dublin Institute of Technology)
David Dorran (Dublin Institute of Technology)
Mikel Gainza (Dublin Institute of Technology)
Eugene Coyle (Dublin Institute of Technology)
It is crucial for many methods of inverse filtering that the time domain information of the glottal source waveform is known, e.g. the location of the instant of glottal closure. It is often the case that this information is unknown and/or cannot be determined due to e.g. recording conditions which can corrupt the phase spectrum. In these scenarios, alternative strategies are required. This paper describes a method which, given the parameters of the glottal formant of the signal frame, can accurately parameterize the glottal shape source and vocal filter for a broad range of voice quality types and which is robust to the corruption of the phase spectrum.
#9Glottal parameters estimation on speech using the Zeros of the Z Transform
Nicolas Sturmel (LIMSI-CNRS, B.P. 133, F-91403, ORSAY FRANCE)
Christophe d'Alessandro (LIMSI-CNRS, B.P. 133, F-91403, ORSAY FRANCE)
Boris Doval (LAM-IJLRA, UPMC Univ Paris 06, 11 Rue de Lourmel, F-75015, PARIS FRANCE)
This paper presents a method for the joint estimation of the open quotient and the asymmetry quotient of the open phase of the glottal flow on speech. An algorithm based on a source/filter de- composition (the Zeros of the Z Transform - ZZT) is presented. This algorithm is first tested on a database of sustained vowels spoken at different voice qualities, then on running speech. Results are evaluated in comparison to the value of the open quotient obtained by analysis of the synchronous ElectroGlottoGraphic (EGG) signal. Results of this test show that open quotient is estimated within the just noticeable difference from the EGG reference in more than 60% of the cases, and that 75% of the estimations give a value within 25% of the reference. The estimation results on asymmetry are also discussed and confirm previous studies.
#10Significance of Pitch Synchronous Analysis for Speaker Recognition using AANN Models
Sri Harish Reddy Mallidi (Speech and Vision lab., International Institute of Information Technology, Hyderabad, India)
Kishore Prahallad (Speech and Vision lab., International Institute of Information Technology, Hyderabad, India)
Suryakanth V Gangashetty (Speech and Vision lab., International Institute of Information Technology, Hyderabad, India)
Yegnanarayana B (Speech and Vision lab., International Institute of Information Technology, Hyderabad, India)
For speaker recognition studies, it is necessary to process the speech signal suitably to capture the speaker-specific information. There is complementary speaker-specific information in the excitation source and vocal tract system characteristics. Therefore it is necessary to separate these components, even approximately, from the speech signal. We propose linear prediction (LP) residual and LP coefficients to represent these two components. Analysis is performed in a pitch synchronous manner in order to focus on the significant portion of the speech signal in each glottal cycle, and also to reduce the artifacts of digital signal processing on the extracted features. Finally, the speaker-specific information is captured from the excitation and the vocal tract system components using autoassociative neural networks (AANN) models. We show that the pitch synchronous extraction of information from the residual and vocal tract system bring out the speaker-specific information much better than using the pitch asynchronous analysis as in the traditional block processing using an analysis window of fixed size.
Gang Chen (Department of Electrical Engineering, University of California, Los Angeles)
Xue Feng (Department of Electrical Engineering, University of California, Los Angeles)
Yen-Liang Shue (Department of Electrical Engineering, University of California, Los Angeles)
Abeer Alwan (Department of Electrical Engineering, University of California, Los Angeles)
Acoustic characteristics of speech signals differ with gender due to physiological differences of the glottis and the vocal tract. Previous research showed that adding the voice-source related measures H_1^*-H_2^* and H_1^*-A_3^* improved gender classification accuracy compared to using only the fundamental frequency (F0) and formant frequencies. H_i^* refers to the i--th source spectral harmonic magnitude, and A_i^* refers to the magnitude of the source spectrum at the i--th formant. In this paper, three other voice source related measures: CPP, HNR and H_2^*-H_4^* are used in gender classification of children's voices. CPP refers to the Cepstral Peak Prominence, HNR refers to the harmonic-to-noise ratio, and H_2^*-H_4^* refers to the difference between the 2nd and the 4th source spectral harmonic magnitudes. Results show that using these three features improves gender classification accuracy compared with [1].

Keynote 2: Tohru Ifukube - Sound-based Assistive Technology Supporting "Seeing", "Hearing" and "Speaking" for the Disabled and the Elderly

Time:Tuesday 08:30 Place:Hall A/B Type:Keynote
Chair:Keikichi Hirose
08:30Sound-based Assistive Technology Supporting "Seeing", "Hearing" and "Speaking" for the Disabled and the Elderly
Tohru Ifukube (Research Institute for Advanced Science and Technology, University of Tokyo, Japan)
With a rapid increase of a population rate of the elderly, disabled people also have been increasing in Japan. Over a period of 40 years, author has developed a basic research approach of assistive technology, especially for people with seeing, hearing, and speaking disorders. Although some of the required tools have been practically used for the disabled in Japan, the author has experienced how insufficient a function of the tools is for supporting them. Moreover, the author has been impressed by how amazingly potential ability of the human brain has in order to compensate the disorders.

Robust ASR

Time:Tuesday 10:00 Place:Hall A/B Type:Oral
Chair:Mike Seltzer
10:00Asymptotically Exact Noise-Corrupted Speech Likelihoods
Rogier van Dalen (Cambridge University Engineering Department)
Mark Gales (Cambridge University Engineering Department)
Model compensation techniques for noise-robust speech recognition approximate the corrupted speech distribution. This paper introduces a sampling method that, given speech and noise distributions and a mismatch function, in the limit calculates the corrupted speech likelihood exactly. Though it is too slow to compensate a speech recognition system, it enables a more fine-grained assessment of compensation techniques, based on the KL divergence of individual components. This makes it possible to evaluate the impact of approximations that compensation schemes make, such as the form of the mismatch function.
10:20A MMSE Estimator in Mel-Cepstral Domain for Robust Large Vocabulary Automatic Speech Recognition using Uncertainty Propagation
Ramón Fernández Astudillo (Chair of Electronics and Medical Signal Processing, Technical University Berlin, Germany)
Reinhold Orglmeister (Chair of Electronics and Medical Signal Processing, Technical University Berlin, Germany)
Uncertainty propagation techniques achieve a more robust automatic speech recognition by modeling the information missing after speech enhancement in the short-time Fourier transform (STFT) domain in probabilistic form. This information is then ropagated into the feature domain where recognition takes place and combined with observation uncertainty techniques like uncertainty decoding. In this paper we show how uncertainty propagation can also be used to yield minimum mean square error (MMSE) estimates of the clean speech directly in the recognition domain. We develop a MMSE estimator for the Mel-cepstral features by propagation of the Wiener filter posterior distribution and show how it outperforms conventional MMSE methods in the STFT domain on the AURORA4 large vocabulary test environment.
10:40Non-negative Matrix Factorization Based Compensation of Music for Automatic Speech Recognition
Bhiksha Raj (Carnegie Mellon University)
Tuomas Virtanen (Tampere University of Technology)
Sourish Chaudhuri (Carnegie Mellon University)
Rita Singh (Carnegie Mellon University)
This paper proposes to use non-negative matrix factorization based speech enhancement in robust automatic recognition of mixtures of speech and music. We represent magnitude spectra of noisy speech signals as the non-negative weighted linear combination of speech and noise spectral basis vectors, that are obtained from training corpora of speech and music. We use overcomplete dictionaries consisting of random exemplars of the training data. The method is tested on the Wall Street Journal large vocabulary speech corpus which is artificially corrupted with polyphonic music from the RWC music database. Various music styles and speech-to-music ratios are evaluated. The proposed methods are shown to produce a consistent, significant improvement on the recognition performance in the comparison with the baseline method.
11:00Feature versus Model Based Noise Robustness
Kris Demuynck (Katholieke Universiteit Leuven, ­ dept. ESAT)
Xueru Zhang (Katholieke Universiteit Leuven, ­ dept. ESAT)
Dirk Van Compernolle (Katholieke Universiteit Leuven, ­ dept. ESAT)
Hugo Van hamme (Katholieke Universiteit Leuven, ­ dept. ESAT)
Over the years, the focus in noise robust speech recognition has shifted from noise robust features to model based techniques such as parallel model combination and uncertainty decoding. In this paper, we contrast prime examples of both approaches in the context of large vocabulary recognition systems such as used for automatic audio indexing and transcription. We look at the approximations the techniques require to keep the computational load reasonable, the resulting computational cost, and the accuracy measured on the Aurora4 benchmark. The results show that a well designed feature based scheme is capable of providing recognition accuracies at least as good as the model based approaches at a substantially lower computational cost.
11:20SNR–Based Mask Compensation for Computational Auditory Scene Analysis Applied to Speech Recognition in a Car Environment
Ji Hun Park (Gwangju Institute of Science and Technology)
Seon Man Kim (Gwangju Institute of Science and Technology)
Jae Sam Yoon (Gwangju Institute of Science and Technology)
Hong Kook Kim (Gwangju Institute of Science and Technology)
Sung Joo Lee (Electronics and Telecommunications Research Institute)
Yunkeun Lee (Electronics and Telecommunications Research Institute)
In this paper, we propose a computational auditory scene analysis (CASA)–based front–end for two–microphone speech recognition in a car environment. One of the important issues associated with CASA is the accurate estimation of mask information for target speech separation within multiple microphone noisy speech. For such a task, the time–frequency mask information is compensated through the signal–to–noise ratio resulted from a beamformer to adjust the noise quantity included in noisy speech. We evaluate the performance of an automatic speech recognition system employing a CASA–based front–end with the proposed mask compensation method. Then, we compare its performance with those employing a CASA–based front–end without mask compensation and the beamforming–based front–end. As a result, the CASA–based front–end with the proposed method achieves relative WER reductions of 26.52% and 8.57%, compared that the beamformer and a CASA–based front–end alone, respectively.
11:40Automatic Selection of Thresholds for Signal Separation Algorithms Based on Interaural Delay
Chanwoo Kim (Carnegie Mellon University)
Kiwan Eom (Samsung Electronics)
Jaewon Lee (Samsung Electronics)
Richard Stern (Carnegie Mellon University)
In this paper we describe a system that separates signals by comparing the interaural time delays (ITDs) of their time frequency components to a fixed threshold ITD. While in previous algorithms, the fixed threshold ITD had been obtained empirically from training data in a specific environment, in real environments the characteristics that affect the optimal value of this threshold are unknown and possibly time varying. If these configurations are different from the environment under which ITD threshold had been pre-computed, the performance of the source separation system is degraded. In this paper, we present an algorithm which chooses a threshold ITD that minimizes the cross-correlation of the target and interfering signals, after a compressive nonlinearity. We demonstrate that the algorithm described in this paper provides speech recognition accuracy that is much more robust to changes in environment than would be obtained using a fixed threshold ITD.

Language and dialect identification

Time:Tuesday 10:00 Place:201A Type:Oral
Chair:Tomoko Matsui
10:00Channel Detectors for System Fusion in the Context of NIST LRE 2009
Florian Verdet (Université d'Avignon et des Pays du Vaucluse, Laboratoire Informatique d'Avignon, France)
Driss Matrouf (Université d'Avignon et des Pays du Vaucluse, Laboratoire Informatique d'Avignon, France)
Jean-François Bonastre (Université d'Avignon et des Pays du Vaucluse, Laboratoire Informatique d'Avignon, France)
Jean Hennebert (Département d'Informatique, Université de Fribourg, Fribourg, Switzerland)
One of the difficulties in Language Recognition is the variability of the speech signal due to speakers and channels. If channel mismatch is too big and when different categories of channels can be identified, one possibility is to build a specific language recognition system for each category and then to fuse them together. This article uses a system selector that takes, for each utterance, the scores of one of the channel-category dependent systems. This selection is guided by a channel detector. We analyze different ways to design such channel detectors: based on cepstral features or on the Factor Analysis channel variability term. The systems are evaluated in the context of NIST's LRE 2009 and run at 1.65% min-C_avg for a subset of 8 languages and at 3.85% min-C_avg for the 23 language protocol.
10:20Selecting Phonotactic Features for Language Recognition
Rong Tong (Institute for Infocomm Research, Singapore)
Bin Ma (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)
Eng Siong Chng (Nanyang Technological University, Singapore)
This paper studies feature selection in phonotactic language recognition. The phonotactic feature is presented by n-gram statistics derived from one or more phone recognizers in the form of high dimensional feature vectors. Two feature selection strategies are proposed to select the n-gram statistics for reducing the dimension of feature vectors, so that higher order n-gram features can be adopted in language recognition. With the proposed feature selection techniques, we achieved equal error rates (EERs) of 1.84% with 4-gram statistics on the 2007 NIST Language Recognition Evaluation 30s closed test sets.
10:40Improved Language Recognition using Mixture Components Statistics
Abualsoud Hanani (Department of Electronic, Electrical and Computer Engineering - The University of Birmingham - UK)
Michael Carey (Department of Electronic, Electrical and Computer Engineering - The University of Birmingham - UK)
Martin Russell (Department of Electronic, Electrical and Computer Engineering - The University of Birmingham - UK)
One successful approach to language recognition is to focus on the most discriminative high level features of languages, such as phones and words. In this paper, we applied a similar approach to acoustic features using a single GMM-tokenizer followed by discriminatively trained language models. A feature selection technique based on the Support Vector Machine (SVM) is used to model higher order n-grams. Three different ways to build this tokenizer are explored and compared using discriminative uni-gram and generative GMM-UBM. A discriminative uni-gram using very large GMM tokenizer with 24,576 components yields an EER of 1.66%, rising to 0.71% when fused with other acoustic approaches, on the NIST’03 LRE 30s evaluation.
11:00Using Cross-Decoder Co-Occurrences of Phone N-Grams in SVM-based Phonotactic Language Recognition
Mikel Penagarikano (University of the Basque Country)
Amparo Varona (University of the Basque Country)
Luis Javier Rodriguez-Fuentes (University of the Basque Country)
German Bordel (University of the Basque Country)
In common approaches to phonotactic language recognition, decodings are processed and scored in a fully uncoupled way, their time alignment being completely lost. Recently, we have presented a new approach to phonotactic language recognition which takes into account time alignment information, by considering cross-decoder co-occurrences of phones or phone n-grams at the frame level. In this work, the approach based on cross-decoder co-occurrences of phone n-grams is further developed and evaluated. Systems were built by means of open software and experiments were carried out on the NIST LRE2007 database. A system based on co-occurrences of phone n-grams (up to 4-grams) outperformed the baseline phonotactic system, yielding around 8% relative improvement in terms of EER. The best fused system attained 1,90% EER, which supports the use of cross-decoder dependencies for improved language modeling.
11:20Exploiting variety-dependent Phones in Portuguese Variety Identification applied to Broadcast News Transcription
Oscar Koller (Berlin University of Technology / INESC-ID Lisboa)
Alberto Abad (INESC-ID Lisboa)
Isabel Trancoso (INESC-ID Lisboa / IST Lisboa)
Céu Viana (CLUL)
This paper presents a Variety IDentification (VID) approach and its application to broadcast news transcription for Portuguese. The phonotactic VID system, based on Phone Recognition and Language Modelling, focuses on a single tokenizer that combines distinctive knowledge about differences between the target varieties. This knowledge is introduced into a Multi-Layer Perceptron phone recognizer by training mono-phone models for two varieties as contrasting phone-like classes. Significant improvements in terms of identification rate were achieved compared to conventional single and fused phonotactic and acoustic systems. The VID system is used to select data to automatically train variety-specific acoustic models for broadcast news transcription. The impact of the selection is analyzed and variety-specific recognition is shown to improve results by up to 13% compared to a standard variety baseline.
11:40Dialect Recognition Using a Phone-GMM-Supervector-Based SVM Kernel
Fadi Biadsy (Columbia University)
Julia Hirschberg (Columbia University)
Michael Collins (MIT Computer Science and Artificial Intelligence Laboratory)
In this paper, we introduce a new approach to dialect recognition which relies on the hypothesis that certain phones are realized differently across dialects. Given a speaker's utterance, we first obtain the most likely phone sequence using a phone recognizer. We then extract GMM Supervectors for each phone instance. Using these vectors, we design a kernel function that computes the similarities of phones between pairs of utterances. We employ this kernel to train SVM classifiers that estimate posterior probabilities, used during recognition. Testing our approach on four Arabic dialects from 30s cuts, we compare our performance to five approaches: PRLM; GMM-UBM; our own improved version of GMM-UBM which employs fMLLR adaptation; our recent discriminative phonotactic approach; and a state-of-the-art system: SDC-based GMM-UBM discriminatively trained. Our kernel-based technique outperforms all these previous approaches; the overall EER of our system is 4.9%.

Technologies for learning and education

Time:Tuesday 10:00 Place:201B Type:Oral
Chair:Jared Bernstein
10:00Discriminative Acoustic Model for Improving Mispronunciation Detection and Diagnosis in Computer-Aided Pronunciation Training (CAPT)
Xiaojun Qian (Department of Systems Engineering and Engineering Management, The Chinese University of Hongkong)
Frank Soong (Speech Group, Microsoft Research Asia)
Helen Meng (Department of Systems Engineering and Engineering Management, The Chinese University of Hongkong)
How to detect mispronounced words/phonemes and to provide appropriate, to-the-point diagnosis feedback correctly is a challenging task in Computer Aided Pronunciation Training. In this study, we propose a discriminative training algorithm to jointly optimize error detection performance (i.e. false rejection and false acceptance) and diagnosis feedback accuracy (i.e., pinpointing accurately the mispronounced words/phones and providing proper feedback). An optimization procedure, similar to the Minimum Word Error (MWE) discriminative training, is developed to refine the ML-trained HMMs. The errors to be minimized are obtained by comparing hand transcribed speech training utterances by phoneticians with canonical pronunciations of the words and common mispronunciations which are embedded in a “confusion network” (compiled by handcrafted rules or data-driven rules derived from labeled training data.) A database of 8,575 English utterances (split into 5,988 for training and 2,587 for testing) spoken by 100 Cantonese English learners is used to measure the performance of the new algorithm. Several conclusion can be drawn from the experiments: (1) data-driven rules are more effective than hand-crafted ones in capturing (modeling) mispronunciations; (2) compared with the ML training baseline, discriminative training can reduce the false rejection and diagnostic errors while degrading the false acceptance performance slightly due to a smaller number of false-acceptance samples in the training set.
10:20Automatic Pronunciation Scoring using Learning to Rank and DP-based Score Segmentation
Liang-Yu Chen (Institute of Information Systems and Applications, National Tsing Hua University, Taiwan)
Jyh-Shing Roger Jang (Department of Computer Science, National Tsing Hua University, Taiwan)
This paper proposes a novel automatic pronunciation scoring framework using learning to rank. Human scores of the utterances are treated as ranks and are used as the ranking ground truths. Scores generated from various existing scoring methods are used as the features to train the learning to rank function. The output of the function is then segmented by the proposed DP-based method and hence boundaries between clusters can be used to determine the discrete computer scores. Experimental results show that the proposed framework improves upon the existing scoring methods. A non-native corpus with human ranks is also released.
10:40Automatic Derivation of Phonological Rules for Mispronunciation Detection in a Computer-Assisted Pronunciation Training System
Wai-Kit Lo (The Chinese University of Hong Kong)
Shuang Zhang (The Chinese University of Hong Kong)
Helen Meng (The Chinese University of Hong Kong)
Computer-Assisted Pronunciation Training System (CAPT) has become an important learning aid in second language (L2) learning. Our approach to CAPT is based on the use of phonological rules to capture language transfer effects that may cause mispronunciations. This paper presents an approach for automatic derivation of phonological rules from L2 speech. The rules are used to generate an extended recognition network (ERN) that captures the canonical pronunciations of words, as well as the possible mispronunciations. The ERN is used with automatic speech recognition for mispronunciation detection. . Experimentation with an L2 speech corpus that contains recordings from 100 speakers aims to compare the automatically derived rules with manually authored rules. Comparable performance is achieved in mispronunciation detection (i.e. telling which phone is wrong). The automatically derived rules also offer improved performance in diagnostic accuracy (i.e. identify how the phone is wrong).
11:00Adapting a Prosodic Synthesis Model to Score Children’s Oral Reading
Minh Duong (Carnegie Mellon University)
Jack Mostow (Carnegie Mellon University)
We describe an automated method to assess children’s oral reading using a prosodic synthesis model trained on multiple adults’ speech. We evaluate it against a previous method that correlated the prosodic contours of children’s oral reading against adult narrations of the same sentences. We compare how well the two methods predict fluency and comprehension test scores and gains of 55 children ages 7-10 who used Project LISTEN’s Reading Tutor. The new method does better on both tasks without requiring an adult narration of every sentence.
11:20Predicting Word Accuracy for Automatic Speech Recognition of Non-Native Speech
Su-Youn Yoon (Educational Testing Service)
Lei Chen (Educational Testing Service)
Klaus Zechner (Educational Testing Service)
We have developed an automated method that predicts the word accuracy of a speech recognition system for non-native speech, in the context of speaking proficiency scoring. A model was trained using features based on speech recognizer scores, function word distributions, prosody, background noise, and speaking fluency. Since the method was implemented for non-native speech, fluency features, which have been used for non-native speakers' proficiency scoring, were implemented along with several feature groups used from past research. The fluency features showed promising performance by themselves, and improved the overall performance in tandem with other more traditional features. A model using stepwise regression achieved a correlation with word accuracy rates of 0.76, compared to a baseline of 0.63 using only confidence scores. A binary classifier for placing utterances in high-or low-word accuracy bins achieved an accuracy of 84%, compared to a majority class baseline of 64%.
11:40A New Approach for Automatic Tone Error Detection in Strong Accented Mandarin Based on Dominant Set
Taotao Zhu (Institute of Automation, Chinese Academy of Sciences)
Dengfeng Ke (Institute of Automation, Chinese Academy of Sciences)
Zhenbiao Chen (Institute of Automation, Chinese Academy of Sciences)
Bo Xu (Institute of Automation, Chinese Academy of Sciences)
In this paper, we proposed a new approach based on dominant set [1] for tone error detection in strong accented Mandarin. First, the final boundary generated from forced alignment is regulated by the F0 contour in order to locate the final domain more accurately. After that, proper normalization techniques are explored for the tone features. Finally, clustering and classification methods based on dominant set are utilized for the tone error detection. The proposed approach is tested in comparison with the traditional k-means based method, experimental results show that it achieves more satisfying performance with an average Cross-Correlation 0.84 between human and machine, reaches to that between humans, which have verified the effectiveness of the proposed approach. The main advantage of this approach lies in not only the error pronunciation of tone can be well identified, but also the F0 pattern of the tone error can be informatively provided as the feedback.

Emotional Speech

Time:Tuesday 10:00 Place:302 Type:Oral
Chair:Laurence Devillers
10:00Analysis of Excitation Source Information in Emotional Speech
S R M Prasanna (Indian Institute of Technology Guwahati)
D Govind (Indian Institute of Technology Guwahati)
The objective of this work is to analyze the effect of emotions on the excitation source of speech production. The neutral, angry, happy, boredom and fear emotions are considered for the study. Initially the electroglottogram (EGG) and its derivative signals are compared across different emotions. The mean, standard deviation and contour of instantaneous pitch, and strength of excitation parameters are derived by processing the derivative of the EGG and also speech using zero-frequency filtering (ZFF) approach. The comparative study of these features across different emotions reveals that the effect of emotions on the excitation source is distinct and significant. The comparative study of the parameters from the derivative of EGG and speech waveform indicate that both cases have the same trend and range, inferring any of them may be used. Use of the computed parameters are found to be effective in the prosodic modification task.
10:20Acoustic Feature Analysis in Speech Emotion Primitives Estimation
Dongrui Wu (University of Southern California)
Thomas Parsons (University of Southern California)
Shrikanth Narayanan (University of Southern California)
We recently proposed a family of robust linear and nonlinear estimation techniques for recognizing the three emotion primitives--valence, activation, and dominance--from speech. These were based on both local and global speech duration, energy, MFCC and pitch features. This paper aims to study the relative importance of these four categories of acoustic features in this emotion estimation context. Three measures are considered: the number of features from each category when all features are used in selection, the mean absolute error (MAE) when each category is used separately, and the MAE when a category is excluded from feature selection. We find that the relative importance is in the order of MFCC > Energy = Pitch > Duration. Additionally, estimator fusion almost always improves performance, and locally weighted fusion always outperforms average fusion regardless of the number of features used.
10:40Spectro-Temporal Modulations for Robust Speech Emotion Recognition
Lan-Ying Yeh (Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan 300, R.O.C.)
Tai-Shih Chi (Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan 300, R.O.C.)
Speech emotion recognition is mostly considered in clean speech. In this paper, joint spectro-temporal features (RS features) are extracted from an auditory model and are applied to detect the emotion status of noisy speech. The noisy speech is derived from the Berlin Emotional Speech database with added white and babble noises under various SNR levels. The clean train/noisy test scenario is investigated to simulate conditions with unknown noisy sources. The sequential forward floating selection (SFFS) method is adopted to demonstrate the redundancy of RS features and further dimensionality reduction is conducted. Compared to conventional MFCCs plus prosodic features, RS features show higher recognition rates especially in low SNR conditions.
11:00Quantification of Prosodic Entrainment in Affective Spontaneous Spoken Interactions of Married Couples
Chi-Chun Lee (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Matthew Black (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Athanasios Katsamanis (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Adam Lammert (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Brian Baucom (Department of Psychology, University of Southern California, Los Angeles, CA, USA)
Andrew Christensen (Department of Psychology, University of California, Los Angeles, Los Angeles, CA, USA)
Panayiotis G. Georgiou (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL) and Department of Psychology, University of Southern California, Los Angeles, California, USA)
Interaction synchrony among interlocutors happens naturally as people adapt their speaking style gradually to promote efficient communication. In this work, we quantify one aspect of interaction synchrony - prosodic entrainment, specifically pitch and energy, in married couples' problem-solving interactions using speech signal-derived measures. Statistical testings demonstrate that some of these measures capture useful information; they show higher values in interactions with couple having high positive attitude compared to high negative attitude. Further, by using quantized entrainment measures employed with statistical symbol sequence matching in a maximum likelihood framework, we obtained 76% accuracy in predicting positive affect vs. negative affect.
11:20A Cluster-Profile Representation of Emotion Using Agglomerative Hierarchical Clustering
Emily Mower (University of Southern California)
Kyu Han (University of Southern California)
Sungbok Lee (University of Southern California)
Shrikanth Narayanan (University of Southern California)
The proper representation of emotion is critical in classification systems. In previous research, we demonstrated that emotion profile (EP) based representations are effective for this task. In EP-based representations, emotions are expressed in terms of underlying affective components from the subset of anger, happiness, neutrality, and sadness. The current study explores cluster profiles (CP), an alternate profile representation in which the components are no longer semantic labels, but clusters inherent in the feature space. This unsupervised clustering of the feature space permits the application of a system-level semi-supervised learning paradigm. The results demonstrate that CPs are similarly discriminative to EPs (EP classification accuracy: 68.37% vs. 69.25% for the CP-based classification). This suggests that exhaustive labeling of a representative training corpus may not be necessary for emotion classification tasks.
11:40Incremental Acoustic Valence Recognition: an Inter-Corpus Perspective on Features, Matching, and Performance in a Gating Paradigm
Bjoern Schuller (CNRS-LIMSI)
Laurence Devillers (CNRS-LIMSI)
It is not fully known how long it takes a human to reliably recognize emotion in speech from the beginning of a phrase. However, many technical applications demand for very quick system responses, e.g. to prepare different feedback alternatives before the end of a speaker turn in a dialog system. We therefore investigate this ‘gating paradigm’ employing two spoken language resources in a cross- and combined manner with a focus on valence: we determine how quick a reliable estimate is obtainable and whether matching by models trained on the same length of speech prevails. In addition we analyze how individual feature groups by type and derived functionals respond and find considerably different behavior. The language resources have been chosen to cover for manually segmented and automatically segmented speech at the same time. In the result one second of speech is sufficient on the datasets considered.

Speech Synthesis III: HMM-based Speech Synthesis

Time:Tuesday 10:00 Place:International Conference Room A Type:Poster
Chair:Takao Kobayashi
#1Sinusoidal model parameterization for HMM-based TTS system
Slava Shechtman (IBM Research, Haifa Research Lab, Israel)
Alex Sorin (IBM Research, Haifa Research Lab, Israel)
A sinusoidal representation of speech is an alternative to the source-filter model. It is widely used in speech coding and unit-selection TTS, but is less common in statistical TTS frameworks. In this work we utilize Regularized Cepstral Coefficients (RCC) estimated in mel-frequency scale for amplitude spectrum envelope modeling within an HMM-based TTS platform. Improved subjective quality for mel-frequency RCC (MRCC) combined with the sinusoidal model based reconstruction is reported, compared to the state-of-the-art MGC-LSP parameters
#2Improved Training of Excitation for HMM-based Parametric Speech Synthesis
Yoshinori Shiga (National Institute of Information and Communications Technology, Japan)
Tomoki Toda (Nara Institute of Science and Technology, Japan)
Shinsuke Sakai (National Institute of Information and Communications Technology, Japan)
Hisashi Kawai (National Institute of Information and Communications Technology, Japan)
This paper presents an improved method of training for the unvoiced filter that comprises an excitation model, within the framework of parametric speech synthesis based on hidden Markov models. The conventional approach calculates the unvoiced filter response from the differential signal of the residual and voiced excitation estimate. The differential signal, however, includes the error generated by the voiced excitation estimates. Contaminated by the error, the unvoiced filter tends to be overestimated, which causes the synthetic speech to be noisy. In order for unvoiced filter training to obtain targets that are free from the contamination, the improved approach first separates the non-periodic component of residual signal from the periodic component. The unvoiced filter is then trained from the non-periodic component signals. Experimental results show that unvoiced filter responses trained with the new approach are clearly noiseless, in contrast to the responses trained with the conventional approach.
#3Excitation Modeling Based on Waveform Interpolation for HMM-based Speech Synthesis
June Sig Sung (Seoul National University)
Doo Hwa Hong (Seoul National University)
Kyung Hwan Oh (Seoul National University)
Nam Soo Kim (Seoul National University)
It is generally known that a well-designed excitation produces high quality signals in hidden Markov model (HMM)-based speech synthesis systems. This paper proposes a novel techniques for generating excitation based on the waveform interpolation (WI). For modeling WI parameters, we implemented statistical method like principal component analysis (PCA). The parameters of the proposed excitation modeling techniques can be easily combined with the conventional speech synthesis system under the HMM framework. From a number of experiments, the proposed method has been found to generate more naturally sounding speech.
#4Formant-based Frequency Warping for Improving Speaker Adaptation in HMM TTS
Xin Zhuang (Microsoft Research Asia, Beijing, China)
Yao Qian (Microsoft Research Asia, Beijing, China)
Frank K Soong (Microsoft Research Asia, Beijing, China)
Yijian Wu (Microsoft China, Beijing, China)
In this paper we investigate frequency warping explicitly on the mapping between the first four formant frequencies of 5 long vowels recorded by source and target speakers. A universal warping function is constructed for improving MLLR-based speaker adaptation performance in TTS. The function is used to warp the frequency scale of a source speaker’s data toward that of the target speaker’s data and an HMM of frequency warped feature of the source speaker is trained. Finally, the MLLR-based speaker adaptation is applied to the trained HMM for synthesizing the target speaker’s speech. When tested on a database of 4,000 sentences (source speaker) and 100 sentences of a male and a female speaker (target speakers), the formant based frequency warping has been found very effective in reducing log spectral distortion over the system without formant frequency warping and this improvement is also confirmed subjectively in AB preference and ABX speaker similarity listening tests.
#5Improved modelling of speech dynamics using non-linear formant trajectories for HMM-based speech synthesis
Hongwei Hu (University of Birmingham)
Martin Russell (University of Birmingham)
This paper describes the use of non-linear formant trajectories to model speech dynamics. The performance of the non-linear formant dynamics model is evaluated using HMM-based speech synthesis experiments, in which the 12 dimensional parallel formant synthesiser control parameters and their time derivatives are used as the feature vectors in the HMM. Two types of formant synthesiser control parameters, named piecewise constant and smooth trajectory parameters, are used to drive the classic parallel formant synthesiser. The quality of the synthetic speech is assessed using three kinds of subjective tests. This paper shows that the non-linear formant dynamics model can improve the performance of HMM-based speech synthesis.
#6Global Variance Modeling on the Log Power Spectrum of LSPs for HMM-based Speech Synthesis
Zhen-Hua Ling (iFLYTEK Speech Lab, University of Science and Technology of China)
Yu Hu (iFLYTEK Speech Lab, University of Science and Technology of China)
Li-Rong Dai (iFLYTEK Speech Lab, University of Science and Technology of China)
This paper presents a method to model the global variance (GV) of log power spectrums derived from the line spectral pairs (LSPs) in a sentence for HMM-based parametric speech synthesis. Different from the conventional GV method where the observations for GV model training are the variances of spectral parameters for each training sentence, our proposed method directly models the temporal variances of each frequency point in the spectral envelope reconstructed using LSPs. At synthesis stage, the likelihood function of trained GV model is integrated into the maximum likelihood parameter generation algorithm to alleviate the over-smoothing effect on the generated spectral structures. Experiment results show that the proposed method can outperform the conventional GV method when LSPs are used as the spectral parameters and improve the naturalness of synthetic speech significantly.
#7Autoregressive clustering for HMM speech synthesis
Matt Shannon (Cambridge University Engineering Department)
William Byrne (Cambridge University Engineering Department)
The autoregressive HMM has been shown to provide efficient parameter estimation and high-quality synthesis, but in previous experiments decision trees derived from a non-autoregressive system were used. In this paper we investigate the use of autoregressive clustering for autoregressive HMM-based speech synthesis. We describe decision tree clustering for the autoregressive HMM and highlight differences to the standard clustering procedure. Subjective listening evaluation results suggest that autoregressive clustering improves the naturalness of the resulting speech. We find that the standard minimum description length (MDL) criterion for selecting model complexity is inappropriate for the autoregressive HMM. Investigating the effect of model complexity on naturalness, we find that a large degree of overfitting is tolerated without a substantial decrease in naturalness.
#8An Implementation of Decision Tree-Based Context Clutering of Graphics Processing Units
Nicholas Pilkington (University of Cambridge)
Heiga Zen (Toshiba Research Europe Ltd.)
Decision tree-based context clustering is essential but time-consuming while building HMM-based speech synthesis systems. It seeks to cluster HMM states (or streams) based on their context to maximize the log likelihood of the model to the training data. Its widely used implementation is not designed to take advantage of highly parallel architectures, such as GPUs. This paper shows an implementation of tree-based clustering for these highly parallel architectures. Experimental results showed that the new implementation running on GPUs was an order of magnitude faster than the conventional one running on CPUs.
#9Quantized HMMs for Low Footprint Text-To-Speech Synthesis
Alexander Gutkin (Phonetic Arts, Ltd.)
Xavi Gonzalvo (Phonetic Arts, Ltd.)
Stefan Breuer (Phonetic Arts, Ltd.)
Paul Taylor (Phonetic Arts, Ltd.)
This paper proposes the use of Quantized HiddenMarkovModels (QHMMs) for reducing the footprint of conventional parametric HMM-based TTS system. Previously, this technique was successfully applied to automatic speech recognition in embedded devices without loss of recognition performance. In this paper we investigate the construction of different quantized HMM configurations that serve as input to the standard ML-based parameter generation algorithm. We use both subjective and objective tests to compare the resulting systems. Subjective results for specific compression configurations show no significant preference although some spectral distortion is reported. We conclude that a trade-off is necessary in order to satisfy both speech quality and low-footprint memory requirements.
#10The role of higher-level linguistic features in HMM-based speech synthesis
Oliver Watts (Centre for Speech Technology Research, University of Edinburgh, UK)
Junichi Yamagishi (Centre for Speech Technology Research, University of Edinburgh, UK)
Simon King (Centre for Speech Technology Research, University of Edinburgh, UK)
We analyse the contribution of higher-level elements of the linguistic specification of a data-driven speech synthesiser to the naturalness of the synthetic speech which it generates. The system is trained using various subsets of the full feature-set, in which features relating to syntactic category, intonational phrase boundary, pitch accent and boundary tones are selectively removed. Utterances synthesised by the different configurations of the system are then compared in a subjective evaluation of their naturalness. The work presented forms background analysis for an on-going set of experiments in performing text-to-speech (TTS) conversion based on shallow features: features that can be trivially extracted from text. By building a range of systems, each assuming the availability of a different level of linguistic annotation, we obtain benchmarks for our on-going work.
#11HMM-based singing voice synthesis system using pitch-shifted pseudo training data
Ayami Mase (Nagoya Institute of Technology)
Keiichiro Oura (Nagoya Institute of Technology)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)
A statistical parametric approach to singing voice synthesis based on hidden Markov models (HMMs) has been grown over the last few years. In this approach, spectrum, excitation, and duration of singing voices are simultaneously modeled by context-dependent HMMs, and waveforms are generated from HMMs themselves. However, pitches which hardly appear in training data cannot be generated properly because the system cannot model fundamental frequency (F0) contours of them. In this paper, we propose a technique for training HMMs using pitch-shifted pseudo data. Subjective listening test results show that the proposed technique improves the naturalness of the synthesized singing voices.
#12An unsupervised approach to creating web audio contents-based HMM voices
Jinfu Ni (Spoken language communication group, MASTAR project, National Institute of Information and Communications Technology, Japan)
Hisashi Kawai (Spoken language communication group, MASTAR project, National Institute of Information and Communications Technology, Japan)
This paper presents an approach toward rapid creation of varied synthetic voices at low cost. This consists of amassing audio web contents, extracting usable speech from them, further transcribing the speech to surface text and performing phone-time alignment, and using the speech and transcripts to build HMMbased voices. A set of experiments is conducted to evaluate this approach. The results indicate that: large volumes of audio content are available on the internet, in which more than 33.3% of web radio data are unusable for building voices due to noise, music, and the speaker’s overlapping. Among the 14 voices built from limited radio monologues in Japanese, there are three fair (middle of the five-point scale) voices but two voices are bad (the lowest level). The influence of erroneous transcripts on voice quality is significant. In order to achieve fair voice quality with limited speech data, the phone and word accuracy of speech transcriptions must be higher than 80% and 50%, respectively.
#13Conversational Spontaneous Speech Synthesis Using Average Voice Model
Tomoki Koriyama (Tokyo Institute of Technology)
Takashi Nose (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)
This paper describes conversational spontaneous speech synthesis based on hidden Markov model (HMM). To reduce the amount of data required for model training, we utilize average-voice-based speech synthesis framework, which has been shown to be effective for synthesizing speech with arbitrary speaker's voice using a small amount of training data. We examine several kinds of average voice model using reading-style speech and/or conversational speech. We also examine an appropriate utterance unit for conversational speech synthesis. Experimental results show that the proposed two-stage model adaptation method improves the quality of synthetic conversational speech.

New Paradigms in ASR I

Time:Tuesday 10:00 Place:International Conference Room B Type:Poster
Chair:Hermann Ney
#1Mandarin Digit Recognition Assisted by Selective Tone Distinction
Xiao-Dong WANG (Information Technology Laboratory, Asahi Kasei Corporation)
Kunihiko OWA (Information Technology Laboratory, Asahi Kasei Corporation)
Makoto SHOZAKAI (Information Technology Laboratory, Asahi Kasei Corporation)
Continuous Mandarin digit recognition is an important function to provide a useful user interface for in-car applications. In this paper, as opposed to the conventional N-best rescoring, we propose a direct modification approach on the 1-best hypothesis of recognition results using selective tone distinction. Experiments were performed on noisy speech at SNRs of 20dB and 9dB. Over the baseline without using tone information, our proposal achieved error reductions of 24%~27% for both SNRs, which is significantly better than the error reduction of 10-best rescoring. Moreover, the relatively constant error reduction seen in wide-ranging SNR demonstrates the robustness of our proposal.
#2Brazilian Portuguese Acoustic Model Training Based on Data Borrowing From Other Languages
Kazuhiko Abe (National Institute of Information and Communications Technology)
Sakti Sakriani (National Institute of Information and Communications Technology)
Isotani Ryosuke (National Institute of Information and Communications Technology)
Kawai Hisashi (National Institute of Information and Communications Technology)
Nakamura Satoshi (National Institute of Information and Communications Technology)
This paper presents the acoustic modeling method for Portuguese speech recognizers. To improve the acoustic model, other language data are used to offset the lack of the model training data. In using this data-borrowing approach, we select training data with consideration given to the influence of the other language. A simple solution is to minimize the volume of data borrowed. We developed a data selection strategy based on two principles: the Phonetic Frequency Principle and Maximum Entropy Principle. Refining the acoustic model with this strategy, word accuracy is improved, especially words that contain a low-frequency phoneme.
#3Rapid Bootstrapping of five Eastern European languages using the Rapid Language Adaptation Toolkit
Ngoc Thang Vu (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT))
Tim Schlippe (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT))
Franziska Kraus (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT))
Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT))
This paper presents our latest efforts toward large vocabulary speech recognition systems for five Eastern European languages such as Russian, Bulgarian, Czech, Croatian and Polish using the Rapid Language Adaptation Toolkit (RLAT) [1]. We investigated the possibility of crawling large quantities of text material from the Internet, which is very cheap but also requires text post-processing steps due to the varying text quality. The goal of this study is to determine the best strategy for language model optimization on the given domain in a short time period with minimal human effort. Our results show that we can build an initial ASR system for these five languages in only ten days using RLAT. On the multilingual GlobalPhone speech corpus [2] we achieved a Word Error Rate (WER) of 16.9% for Bulgarian, 23.5% for Czech, 20.4% for Polish, 32.8% for Croatian and 36.2% for Russian. [1] T. Schultz and A. Black. Rapid Language Adaptation Tools and Technologies for Multilingual Speech Processing. In: Proc. ICASSP Las Vegas, NV 2008. [2] T. Schultz. GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University. In: Proc. ICSLP Denver, CO, 2002.
#4Cross-lingual Speaker Adaptation via Gaussian Component Mapping
Houwei Cao (Department of Electronic Engineering, The Chinese University of Hong Kong)
Tan Lee (Department of Electronic Engineering, The Chinese University of Hong Kong)
P. C. Ching (Department of Electronic Engineering, The Chinese University of Hong Kong)
This paper is focused on the use of acoustic information from an existing source language (Cantonese) to implement speaker adaptation for a new target language (English). Speaker-independent (SI) model mapping between Cantonese and English is investigated at different levels of acoustic units. Phones, states, and Gaussian mixture components are used as the mapping units respectively. With the model mapping, cross-lingual speaker adaptation can be performed. The performance of the proposed cross-lingual speaker adaptation system is determined by two factors: model mapping effectiveness and speaker adaptation effectiveness. Experimental results show that the model mapping effectiveness increased with the refinement of mapping units, and the speaker adaptation effectiveness depends on the model mapping effectiveness. Mapping between Gaussian mixture components is proved effective for various speech recognition tasks. A relative error reduction of 10.12% on English words is achieved by using a small amount of (4 minutes) Cantonese adaptation data, compared with the SI English recognizer.
#5Cross-Lingual Acoustic modeling for Dialectal Arabic Speech Recognition
Mohamed Elmahdy (German University in Cairo)
Rainer Gruhn (SVOX AG)
Wolfgang Minker (University of Ulm)
Slim Abdennadher (German University in Cairo)
A major problem with dialectal Arabic acoustic modeling is due to the very sparse available speech resources. In this paper, we have chosen Egyptian Colloquial Arabic (ECA) as a typical dialect. In order to benefit from existing Modern Standard Arabic (MSA) resources, a cross-lingual acoustic modeling approach is proposed that is based on supervised model adaptation. MSA acoustic models were adapted using MLLR and MAP with an in-house collected ECA corpus. Phoneme-based and grapheme-based acoustic modeling were investigated. To make phoneme-based adaptation feasible, we have normalized the phoneme sets of MSA and ECA. Since dialectal Arabic is mainly spoken, graphemic form usually does not match actual spelling as in MSA, a graphemic MSA acoustic model was used to force align and to choose the correct ECA spelling from a set of automatically generated spelling variants lexicon. Results show that the adapted MSA acoustic models outperformed acoustic models trained with only ECA data.
#6Cross-lingual and Multi-stream Posterior Features for Low-resource LVCSR Systems
Samuel Thomas (Johns Hopkins University)
Sriram Ganapathy (Johns Hopkins University)
Hynek Hermansky (Johns Hopkins University)
We investigate approaches for large vocabulary continuous speech recognition (LVCSR) system for new languages or new domains using limited amounts of transcribed training data. In these low resource conditions, the performance of conventional LVCSR systems degrade significantly. We propose to train low resource LVCSR system with additional sources of information like annotated data from other languages (German and Spanish) and various acoustic feature streams (short-term and modulation features). We train multilayer perceptrons (MLPs) on these sources of information and use Tandem features derived from the MLPs for low resource LVCSR. In our experiments, the proposed system trained using only one hour of English conversational telephone speech (CTS) provides a relative improvement of 11% over the baseline system.
#7Latent Perceptual Mapping: A New Acoustic Modeling Framework for Speech Recognition
Shiva Sundaram (Deutsche Telekom Laboratories, Ernst-Reuter-Platz-7, Berlin 10587. Germany)
Jerome Bellegarda (Apple Inc., 3 Infinte Loop, Cupertino, 95014 California. USA.)
While hidden Markov modeling is still the dominant paradigm for speech recognition, in recent years there has been renewed interest in alternative, template-like approaches to acoustic modeling. Such methods sidestep usual HMM limitations as well as inherent issues with parametric statistical distributions, though typically at the expense of large amounts of memory and computing power. This paper introduces a new framework, dubbed latent perceptual mapping, which naturally leverages a reduced dimensionality description of the observations. This allows for a viable parsimonious template-like solution where models are closely aligned with perceived acoustic events. Context-independent phoneme classification experiments conducted on the TIMIT database suggest that latent perceptual mapping achieves results comparable to conventional acoustic modeling but at potentially significant savings in online costs.
#8Unsupervised model adaptation on targeted speech segments for LVCSR system combination
Richard Dufour (LIUM - University of Le Mans)
Fethi Bougares (LIUM - University of Le Mans)
Yannick Estève (LIUM - University of Le Mans)
Paul Deléglise (LIUM - University of Le Mans)
In context of Large-Vocabulary Continuous Speech Recognition, systems can reach a high level of performance when dealing with prepared speech, while their performance drops on spontaneous speech. This decrease is due to the fact that these two kind of speech are marked by strong acoustic and linguistic differences. Previous research works had been done to detect and repair some peculiarities of spontaneous speech, as disfluencies, and to create specific models to improve recognition accuracy: a large amount of data is needed to see improvements and is expensive to collect. In this paper, we present a solution to create specialized acoustic and language models, by automatically extracting a data subset from the initial training corpus containing spontaneous speech, and adapting initial acoustic and linguistic models on it. As we assume these models can be complementary, we propose to combine general and adapted ASR system outputs. Experimental results show statistically significant gain, for a negligible cost (no additional training data and no human intervention).
#9Incremental word learning using large-margin discriminative training and variance floor estimation
Irene Ayllon Clemente (Research Institute for Cognition and Robotics, Bielefeld University, Germany)
Martin Heckmann (Honda Research Institute Europe GmbH, Offenbach am Main, Germany)
Alexander Denecke (Research Institute for Cognition and Robotics, Bielefeld University, Germany)
Britta Wrede (Research Institute for Cognition and Robotics, Bielefeld University, Germany)
Christian Goerick (Honda Research Institute Europe GmbH, Offenbach am Main, Germany)
We investigate incremental word learning in a Hidden Markov Model (HMM) framework suitable for human-robot interaction. In interactive learning, the tutoring time is a crucial factor. Hence our goal is to use as few training samples as possible while maintaining a good performance level. To adapt the states of the HMMs, different large-margin discriminative training strategies for increasing the separability of the classes are proposed. We also present a novel estimation of the variance floor when a very low number of training data is used. Finally our approach is successfully evaluated on isolated digits taken from the TIDIGITS database.
#10State-based labelling for a sparse representation of speech and its application to robust speech recognition
Tuomas Virtanen (Department of Signal Processing, Tampere University of Technology, Finland)
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Antti Hurmalainen (Department of Signal Processing, Tampere University of Technology, Finland)
This paper proposes a state-based labeling for acoustic patterns of speech and a method for using this labelling in noise-robust automatic speech recognition. Acoustic time-frequency segments of speech, exemplars, are obtained from a training database and associated with time-varying state labels using the transcriptions. In the recognition phase, noisy speech is modeled by a sparse linear combination of noise and speech exemplars. The likelihoods of states are obtained by linear combination of the exemplar weights, which can then be used to estimate the most likely state transition path. The proposed method was tested in the connected digit recognition task with noisy speech material from the Aurora-2 database where it is shown to produce better results than the existing histogram-based labeling method.
#11Similarity Scoring for Recognizing Repeated Out-of-Vocabulary Words
Mirko Hannemann (Brno University of Technology, Speech@FIT, Czech Republic)
Stefan Kombrink (Brno University of Technology, Speech@FIT, Czech Republic)
Martin Karafiat (Brno University of Technology, Speech@FIT, Czech Republic)
Lukas Burget (Brno University of Technology, Speech@FIT, Czech Republic)
We develop a similarity measure to detect repeatedly occurring Out-of-Vocabulary words (OOV), since these carry important information. Sub-word sequences in the recognition output from a hybrid word/sub-word recognizer are taken as detected OOVs and are aligned to each other with the help of an alignment error model. This model is able to deal with partial OOV detections and tries to reveal more complex word relations such as compound words. We apply the model to a selection of conversational phone calls to retrieve other examples of the same OOV, and to obtain a higher-level description of it such as being a derivation of a known word.
#12Data Pruning for Template-based Automatic Speech Recognition
Dino Seppi (ESAT, Katholieke Universiteit Leuven)
Dirk Van Compernolle (ESAT, Katholieke Universiteit Leuven)
In this paper we describe and analyze a data pruning method in combination with template-based automatic speech recognition. We demonstrate the positive effects of polishing the template database by minimizing the word error rate scores. Data pruning allowed to effectively reduce the database size, and therefore the model size, by an impressive 30%, with consequent benefits on the computation time and memory usage.

Speech Production I: Various Approaches

Time:Tuesday 10:00 Place:International Conference Room C Type:Poster
Chair:G Ananthakrishnan
#1Speaking style dependency of formant targets
Akiko Amano-Kusumoto (Oregon Health & Science University)
John-Paul Hosom (Oregon Health & Science University)
Alexander Kain (Oregon Health & Science University)
Previous work on formant targets has assumed that these targets are independent of the speaking style. In this paper, we estimate consonant and vowel targets in a database of “clear” and “conversational” speech, using both style-independent and style-dependent models. The test-set errors and clustering of the estimated target values indicate that for this corpus, formant targets depend on the speaking style. As an application, the vowel classification accuracy was tested with both style-indepently and dependently based on observed formant values and estimated target values. Token-based style-independent classification shows greater accuracy for conversational speech (82.19%) than observed-value classification (73.97%).
#2Similarity of effects of emotions on the speech organ configuration with and without speaking
Tatsuya Kitamura (Konan University)
In this work we propose and verify a hypothesis on emotional speech production: emotions induce physical and physiological changes in the whole body including the speech organs, regardless of whether or not the person is speaking, and as a side effect, this changes the voice quality. To verify this hypothesis, we measured the speech organ configuration of actors simulating four emotions (neutral, hot anger, joy, and sadness) with and without speaking by MRI. The results showed that emotions affect the speech organ configuration, and the same tendency of changes was found regardless of whether or not the person was speaking.
#3A Study of Intra-Speaker and Inter-Speaker Affective Variability using Electroglottograph and Inverse Filtered Glottal Waveforms
Daniel Bone (Viterbi School of Engineering, University of Southern California, CA, USA)
Samuel Kim (Viterbi School of Engineering, University of Southern California, CA, USA)
Sungbok Lee (Department of Linguistics, University of Southern California, CA, USA)
Shrikanth Narayanan (Viterbi School of Engineering, University of Southern California, CA, USA)
It is well-known that different speakers utilize their vocal instruments in diverse ways to express linguistic intention with some paralinguistic coloring such as emotional quality. The study of voice source features, which describe the action of the vocal folds, is important for a deeper understanding of emotion encoding in speech. In this study we investigate inter and intra-speaker differences in voicing activities as a function of emotion using electroglottography (EGG) and inverse filtering technique. Results demonstrate that while voice quality features are good indicators of affective state, voice source descriptors vary in affective information across speakers. Glottal ratio measurements taken directly from the EGG signal are more reliable than measurements from the inverse-filtered glottal airflow signal, but the spectral harmonic amplitude differences of EGG are less useful than from inverse filtering.
#4Modal analysis of vocal fold vibrations using laryngotopography
Ken-Ichi Sakakibara (Department of Communication Disorders, Health Sciences University of Hokkaido)
Hiroshi Imagawa (Department of Otolaryngoloty, University of Tokyo)
Miwako Kimura (Department of Otolaryngoloty, University of Tokyo)
Hisayuki Yokonishi (Department of Otolaryngoloty, University of Tokyo)
Niro Tayama (Department of Otolaryngology, Head and Neck Surgery, National Center for Glogbal Health and Medicine)
In this paper, we propose a method for analyzing spatial characteristics of the larynx during phonation by high-speed digital imaging. The laryngotopography was applied to the high-speed digital images of normal subjects, and patients with paralysis and cyst. The results show various modes of vibration of the vocal folds particular to the patients with paralysis and cyst and usefulness of the laryngotopograph for clinical purposes.
#5Laryngeal Voice Quality in the Expression of Focus
Martti Vainio (Universty of Helsinki, Institute of Behavioural Sciences)
Matti Airas (Nokia Corp.)
Järvikivi Juhani (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Alku Paavo (Department of Signal Processing and Acoustics, Aalto University, Finland)
Prominence relations in speech are signaled by various ways including such phonetic means as voice fundamental frequency, intensity, and duration. A less studied acoustic feature affecting prominence is the so called voice quality which is determined by changes in the airflow caused by different laryngeal settings. We investigated the changes in voice quality with respect to linguistic prosodic signaling of focus in simple three word utterances. We used inverse filtering based methods for calculating and parametrizing the glottal flow in several different vowels and focus conditions. The results supported our hypothesis -- formed by an earlier study of voice quality changes in running speech -- that more prominent syllables are produced with a less tense voice quality and less prominent ones with a more tense phonation. We provide both physiological and linguistic explanations for the phenomena.
#6Laryngeal Characteristics during the Production of Geminate Consonants
Masako Fujimoto (Center for Corpus Development, National Institute for Japanese Language and Linguistics, Japan)
Kikuo Maekawa (Center for Corpus Development, National Institute for Japanese Language and Linguistics, Japan)
Seiya Funatsu (Science Information Center, Prefectural University of Hiroshima, Japan)
Analysis of high-speed digital video images showed that no apparent constriction or tense appeared in larynx and glottis during the production of geminate consonants. Glottal width for geminate consonants is slightly, but not much, wider than their singleton counterparts. Rather, the degree depends largely on consonant types. However, analysis of photo-electric glottogram showed that an interruption of glottal opening movement and/or abrupt cessation of preceding vowel are suggested to be involved during the production of geminate consonants.
#7Numerical study of turbulent flow-induced sound production in presence of a tooth-shaped obstacle: towards sibilant [s] physical modeling.
Julien Cisonni (The Center for Advanced Medical Engineering and Informatics, Osaka University, Japan)
Kazunori Nozaki (The Center for Advanced Medical Engineering and Informatics, Osaka University, Japan)
Annemie Van Hirtum (GIPSA-lab, UMR CNRS 5216, Grenoble Universities, France)
Shigeo Wada (The Center for Advanced Medical Engineering and Informatics, Osaka University, Japan)
The sound generated during the production of the sibilant [s] results from the impact of a turbulent jet on the incisors. Physical modeling of this phenomenon depends on the characterization of the properties of the turbulent flow within the vocal tract and of the acoustic sources resulting from the presence of an obstacle in the path of the flow. The properties of the flow-induced noise strongly depend on several geometric parameters of which the influence has to be determined. In this paper, a simplified vocal tract/tooth geometric model is used to carry out a numerical study on the flow-induced noise generated by a tooth-shaped obstacle placed in a channel. The performed simulations bring out a link between the level of the generated noise and the aperture of the constriction formed by the obstacle.
#8Morphological and predictability effects on schwa reduction: The case of Dutch word-initial syllables
Iris Hanique (Radboud University Nijmegen, The Netherlands; Max Planck Institute for Psycholinguistics, The Netherlands)
Barbara Schuppler (Radboud University Nijmegen, The Netherlands)
Mirjam Ernestus (Radboud University Nijmegen, The Netherlands; Max Planck Institute for Psycholinguistics, The Netherlands)
This corpus-based study shows that the presence and duration of schwa in Dutch word-initial syllables are affected by a word’s predictability and its morphological structure. Schwa is less reduced in words that are more predictable given the following word. In addition, schwa may be longer if the syllable forms a prefix, and in prefixes the duration of schwa is positively correlated with the frequency of the word relative to its stem. Our results suggest that the conditions which favor reduced realizations are more complex than one would expect on the basis of the current literature.
#9Acoustic-to-Articulatory Inversion based on Local Regression
Samer Al Moubayed (Centre for Speech Technology, Royal Institute of Technology (KTH), Stockholm, Sweden)
Ananthakrishnan G (Centre for Speech Technology, Royal Institute of Technology (KTH), Stockholm, Sweden)
This paper presents an Acoustic-to-Articulatory inversion method based on local regression. Two types of local regression, a non-parametric and a local linear regression have been applied on a corpus containing simultaneous recordings of positions of articulators and the corresponding acoustics. A maximum likelihood trajectory smoothing using the estimated dynamics of the articulators is also applied on the regression estimates. The average root mean square error in estimating articulatory positions, given the acoustics, is 1.56 mm for the non-parametric regression and 1.52 mm for the local linear regression. The local linear regression is found to perform significantly better than regression using Gaussian Mixture Models using the same acoustic and articulatory features.
#10Korean lenis, fortis, and aspirated stops: Effect of place of articulation on acoustic realization
Mirjam Broersma (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Unlike most of the world's languages, Korean distinguishes three types of voiceless stops, namely lenis, fortis, and aspirated stops. All occur at three places of articulation. In previous work, acoustic measurements are mostly collapsed over the three places of articulation. This study therefore provides acoustic measurements of Korean lenis, fortis, and aspirated stops at all three places of articulation separately. Clear differences are found among the acoustic characteristics of the stops at the different places of articulation.
#11Speech Synthesis by Modeling Harmonics Structure with Multiple Function
Toru Nakashika (Kobe University)
Ryuki Tachibana (IBM Research - Tokyo)
Masafumi Nishimura (IBM Research - Tokyo)
Tetsuya Takiguchi (Kobe University)
Yasuo Ariki (Kobe University)
In this paper, we present a new approach for the speech synthesis, in which speech utterances are synthesized using the parameters of spectro-modeling function (Multiple function). With this approach, only harmonic-parts are extracted from the phoneme spectrum, and the time-varying spectrum corresponding to the harmonics or sinusoidal components is modeled using the Multiple function. We introduce two types of the functions, and present the method to estimate the parameters of each function using the observed phoneme spectrum. In the synthesis stage, speech signals are generated from the parameters of the Multiple function. The advantage of this method is that it only requires a few speech synthesis parameters. We discuss the effectiveness of our proposed method through experimental results.
#12Physics of Body-Conducted Silent Speech – Production, Propagation and Representation of Non-Audible Murmur
Makoto Otani (Faculty of Engineering, Shinshu University)
Tatsuya Hirahara (Faculty of Engineering, Toyama Prefectural University)
The physical nature of weak body-conducted vocal-tract resonance signals called non-audible murmur (NAM) were investigated using numerical simulation and acoustic analysis of the NAM signals. Computational fluid dynamics simulation reveals that a weak vortex flow occurs in the supraglottal region when uttering NAM; a source of NAM is a turbulent noise source produced due to a vortex flow. Furthermore, computational acoustics simulation reveals that NAM signals attenuate 50 dB at 1 kHz consisting of 30-dB full-range attenuation due to air-to-body transmission loss and –10-dB/octave spectral decay due to a sound propagation loss within the body, which roughly equals to the measurement results.

Speech enhancement

Time:Tuesday 10:00 Place:International Conference Room D Type:Poster
Chair:Tetsuya Shimamura
#1Multichannel Noise Reduction using low order RTF estimate
Subhojit Chakladar (Seoul National University)
Nam Soo Kim (Seoul National University)
Yu Gwang Jin (Seoul National University)
Tae Gyoon Kang (Seoul National University)
The relative transfer function generalized sidelobe canceler (RTF-GSC) is a popular method for implementing multichannel speech enhancement. However, an accurate estimation of channel transfer function ratios pose a challenge, especially in noisy environments. In this work, we demonstrate that even a very low order RTF estimate can give superior performance in terms of noise reduction without incurring excessive speech distortion. We show that noise reduction is dependent on the correlation between the input noise and the noise reference generated by the Blocking Matrix (BM), and that a low order RTF estimate preserves this correlation better than a high order one. The performance for both high order and low order RTF estimates are compared using output SNR, Noise Reduction and a perceptual measure for speech quality.
#2Reinforced Blocking Matrix with Cross Channel Projection for Speech Enhancement
Inho Lee (Department of Visual Information Processing Engineering, Korea University)
Jongsung Yoon (Department of Visual Information Processing Engineering, Korea University)
Yoonjae Lee (School of Electrical Engineering, Korea University)
Hanseok Ko (Department of Visual Information Processing Engineering, Korea University & School of Electrical Engineering, Korea University)
In this paper, we propose a reinforced Blocking Matrix of TF-GSC by incorporating a cross channel projection for speech enhancement. Transfer function GSC (TF-GSC) proposed by Gannot was aimed at improving speech quality but the desired speech signal becomes somewhat distorted since the reference signal resulting from blocking matrix significantly contains the desired signal. The proposed reinforcement on the Blocking Matrix is a scheme to remove the highly correlated components between the inter-channel reference signals using orthogonal projection, thereby completely eliminating the desired signal. Representative experiments show that the proposed scheme is effective and its strength is demonstrated in terms of improved averaged signal noise ratio(SNR) and Log Spectral Distance(LSD).
#3Masking Property Based Microphone Array Post-filter Design
Ning Cheng (1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; 2 The Chinese University of Hong Kong, Shatin, Hong Kong)
Wen-ju Liu (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
Lan Wang (1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; 2 The Chinese University of Hong Kong, Shatin, Hong Kong)
This paper presents a novel post-filter for noise reduction. A subspace based noise estimation method is developed with the use of multiple statistical distributions to model the speech and noise. The signal-plus-noise subspace dimension is determined by maximizing the target speech presence probability in noisy frames, so as to estimate the noise power spectrum for post-filter design. Then, masking property is incorporated in the post-filter technique for residual noise shaping. Experimental results show that the proposed scheme outperforms the baseline systems in terms of various quality measurements of the enhanced speech.
#4Reduction of Broadband Noise In Speech Signals by Multilinear Subspace Analysis
Yusuke Sato (Department of Mathematics, College of Science and Technology, Nihon University)
Tetsuya Hoya (Department of Mathematics, College of Science and Technology, Nihon University)
Hovagim Bakardjian (Laboratory for Advanced Brain Signal Processing, BSI RIKEN)
Andrzej Cichocki (Laboratory for Advanced Brain Signal Processing, BSI RIKEN)
A new noise reduction method for speech signals is proposed in this paper. The method is based upon the N-mode singular value decomposition algorithm, which exploits the multilinear subspace analysis of given speech data. Simulation results using both synthetically generated and real broadband noise components show that the enhancement quality obtained by the multilinear subspace analysis method in terms of both segmental gain and cepstral distance, as well as informal listening tests, is superior to that by a conventional nonlinear spectral subtraction method and the previously proposed approach based upon sliding subspace projection.
#5Novel Probabilistic Control of Noise Reduction for Improved Microphone Array Beamforming
Hong Jungpyo (Korea Advanced Institute Science and Technology)
Han Seungho (Korea Advanced Institute Science and Technology)
Jeong Sangbae (Gyeongsang National University)
Hahn Minsoo (Korea Advanced Institute Science and Technology)
In this paper, a novel speech enhancement algorithm is proposed. The algorithm controls the amount of noise reduction according to whether speech absence or presence in noisy environments. Based on the estimated speech absence probability (SAP), the amount of noise reduction is adaptively controlled. To calculate the SAP, normalized cross correlation of linear predictive residual signals instead of that of original input signals is utilized. It is especially robust and effective in reverberant and realistic environments. Experimental results show that the proposed algorithm improves speech recognition rates compared with conventional linearly constraint minimum variance beamforming
#6Speech Enhancement Using Improved Generalized Sidelobe Canceller in Frequency Domain With Multi-channel Postfilter
Kai Li (Institute of Acoustics, Chinese Academy of Sciences)
Qiang Fu (Institute of Acoustics, Chinese Academy of Sciences)
Yonghong Yan (Institute of Acoustics, Chinese Academy of Sciences)
In this paper, we propose a speech enhancement algorithm which has the feature of interaction between adaptive beamforming and multi-channel postfilter.A novel subband feedback controller based on speech presence probability is applied to Generalized Sidelobe Canceller algorithm to obtain a more robust adaptive beamforming in adverse environment and alleviate the problem of signal cancellation. A multi-channel postfiltering is used not only to further suppress diffuse noises and some transient interferences, but also to give the speech presence probability information in each subband. Experimental results show that the proposed algorithm achieves considerable improvement on signal preservation of the desired speech in adverse noise environments,consisting of both directional and diffused noises over the comparative algorithms.
#7Close speaker cancellation for suppression of non-stationary background noise for hands-free speech interface
Jani Even (ATR-IRC)
Carlos Ishi (ATR-IRC)
Hiroshi Saruwatari (NAIST)
Norihiro Hagita (ATR-IRC)
This paper presents a noise cancellation method based on the ability to efficiently cancel a close target speaker contribution from the signals observed at a microphone array. The proposed method exploits this specificity in the case of the hands-free speech interface. This method is in particular able to deal with non-stationary noise. The method can be divided in three steps. First, the steering vector pointing at the target user is estimated from the covariance of the observed signals. Then the noise estimate is obtained by cancelling the user's contribution. During this step the speech pauses are also estimated. Finally a post-filter is used to suppress this estimated noise from the observed signals. The post-filter strength is controlled by using the estimated noise during the speech pauses as reference. A 20k-words dictation task in presence of non-stationary diffuse background noise at different SNR levels illustrates the effectiveness of the proposed method.
#8Multi-channel Iterative Dereverberation based on Codebook Constrained Iterative Multi-channel Wiener Filter
Ajay Srinivasamurthy (Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore-560012, India)
Thippur Sreenivas (Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore-560012, India)
A novel Multi-channel Iterative Dereverberation (MID) algorithm based on Codebook Constrained Iterative Multi-channel Wiener Filter (CCIMWF) is proposed. We extend the classical iterative wiener filter (IWF) to the multi-channel dereverberation case. The late reverberations are estimated using Long-term Multi-step Linear Prediction (LTMLP). This estimate is used in CCIMWF framework through a doubly iterative formulation. A clean speech VQ codebook is effective for inducing intra-frame constraints and improve the convergence of IWF, thus, a joint-CCIMWF algorithm is proposed for the multi-channel case. The signal to reverberation ratio (SRR) and log spectral distortion (LSD) measures improve through the double-iterations, showing that the algorithm suppresses the effect of late reverberations and improves speech quality and intelligibility. The algorithm also has fair convergence properties through the iterations.
#9Speaker Dependent Mapping of Source and System features for Enhancement of Throat Microphone Speech
Anand Joseph Xavier Medabalimi (International Institute of Information Technology)
Sri Harish Reddy Mallidi (International Institute of Information Technology)
Yegnanarayana Bayya (International Institute of Information Technology)
A throat microphone (TM) produces speech which is perceptually poorer than that produced by a close speaking microphone (CSM) speech. Many attempts at improving the quality of TM speech have been made by mapping the features corresponding to the vocal tract system. These techniques are limited by the methods used to generate the excitation signal. In this paper a method to map the source (excitation) using multilayer feed-forward neural networks is proposed for voiced segments. This method anchors the analysis windows at the regions around the instants of glottal closure, so that the non-linear characteristics in these region of TM and CSM microphone is emphasized in the mapping process. The features obtained from these regions for both TM and CSM speech are used to train a MLFFNN to capture the non-linear relation between them. An improved technique for mapping the system features is also proposed. Speech synthesized using the proposed techniques was evaluated through subjective tests and was found to be significantly better than TM speech.
#10An Analytic Modeling Approach to Enhancing Throat Microphone Speech Commands for Keyword Spotting
Jun Cai (Faculté des Sciences Appliquées, Université Libre de Bruxelles, Belgium)
Stefano Marini (Faculté des Sciences Appliquées, Université Libre de Bruxelles, Belgium)
Pierre Malarme (Faculté des Sciences Appliquées, Université Libre de Bruxelles, Belgium)
Francis Grenez (Faculté des Sciences Appliquées, Université Libre de Bruxelles, Belgium)
Jean Schoentgen (Faculté des Sciences Appliquées, Université Libre de Bruxelles, Belgium)
This research was carried out on enhancing throat microphone speech for noise-robust speech keyword spotting. The enhancement was performed by mapping the log-energy in the Mel-frequency bands of throat microphone speech to those of the corresponding close-talk microphone speech. An analytic equation detection system, Eureqa, which can infer nonlinear relations directly from observed data, was used to identify the enhancement models. Speech recognition experiments with the enhanced throat microphone speech keywords indicate that the analytic enhancement models performed well in terms of recognition accuracy. Unvoiced consonants, however, could not be enhanced well enough, mostly because they were not effectively recorded by the throat microphone.
#11Single-channel speech enhancement using Kalman filtering in the modulation domain
Stephen So (Signal Processing Laboratory, Griffith University)
Kamil K. Wojcicki (Signal Processing Laboratory, Griffith University)
Kuldip K. Paliwal (Signal Processing Laboratory, Griffith University)
In this paper, we propose the modulation-domain Kalman filter (MDKF) for speech enhancement. In contrast to previous modulation domain-enhancement methods based on bandpass filtering, the MDKF is an adaptive and linear MMSE estimator that uses models of the temporal changes of the magnitude spectrum for both speech and noise. Also, because the Kalman filter is a joint magnitude and phase spectrum estimator, under non-stationarity assumptions, it is highly suited for modulation-domain processing, as modulation phase tends to contain more speech information than acoustic phase. Experimental results from the NOIZEUS corpus show the ideal MDKF (with clean speech parameters) to outperform all the acoustic and time-domain enhancement methods that were evaluated, including the conventional time-domain Kalman filter with clean speech parameters. A practical MDKF that uses the MMSE-STSA method to enhance noisy speech in the acoustic domain prior to LPC analysis was also evaluated and showed promising results.
#12Integrated Feedback and Noise Reduction Algorithm In Digital Hearing Aids Via Oscillation Detection
Miao Yao (Institute of Microelectronics, Tsinghua University, China)
Weiqian Liang (Department of Electronic Engineering, Tsinghua University, China)
In this paper, an integrated feedback and noise reduction scheme in hearing aids is developed. The technique presented is based on the adaptive feedback cancellation (AFC) and general sidelobe canceller (GSC) with a band-limited adaptation method to better the convergence behavior of both AFC and GSC. The band pass pre-filter is applied to AFC and the band stop pre-filter is applied to GSC to increase the portion of desired signal. An oscillation detector based on the zero crossing rates of the autocorrelation of sub-band signals is designed to calculate the center frequency of the oscillation to make the band-limited adaptation more robust. Convergence analysis and computer simulation illustrate that the proposed algorithm performs effectively to reduce the feedback and noise.
#13A blind signal-to-noise ratio estimator for high noise speech recordings
Charles Mercier (Université de Sherbrooke)
Roch Lefebvre (Université de Sherbrooke)
Blind estimation of the signal-to-noise ratio in noisy speech recordings is useful to enhance the performance of many speech processing algorithms. Most current techniques are efficient in low noise environments only, justifying the need for a high noise estimator, such as the one presented here. A pitch tracker robust in high noise was developed and is used to create a two-dimensional representation of the audio input. Signal-to-noise ratio estimation is then performed using an image processing algorithm, effectively combining the short-term and long-term properties of speech. The proposed technique is shown to perform accurately even in high noise situations.

Special Session: Open Vocabulary Spoken Document Retrieval

Time:Tuesday 10:00 Place:301 Type:Special
Chair:Seiichi Nakagawa & Kiyoaki Aikawa
10:00Constructing Japanese Test Collections for Spoken Term Detection
Yoshiaki Itoh (Iwate Prefectural University)
Hiromitsu Nishizaki (University of Yamanashi)
Xinhui Hu (NICT)
Hiroaki Nanjo (Ryukoku University)
Tomoyosi Akiba (Toyohashi University of Technology)
Tatsuya Kawahara (Kyoto University)
Seiichi Nakagawa (Toyohashi University of Technology)
Tomoko Matsui (The Institute of Statistical Mathematics)
Yoichi Yamashita (Ritsumeikan University)
Kiyoaki Aikawa (Tokyo University of Technology)
Spoken Document Retrieval (SDR) and Spoken Term Detection have been one of hottest topics in spoken document processing society. TREC (Text Retrieval Conference) has dealt with SDR from 1996 [1] and NIST has already set up STD test collections and collected the results of attendees [2]. For the Japanese spoken documents processing has also needed such test collections for SDR and STD. We set up a working group for this purpose in SIG-SLP (Spoken Language Processing) of Information Processing Society of Japan. The working group has constructed and offered a test collection for SDR [3]. We are now constructing new test collections for STD that is going to be open for researchers. The paper introduces the policy, the outline, and the schedule of new test collections. Some comparison is performed with the NIST STD tasks.
10:15Japanese Spoken Term Detection Using Syllable Transition Network Derived from Multiple Speech Recognizers' Outputs
Satoshi Natori (University of Yamanashi)
Hiromitsu Nishizaki (University of Yamanashi)
Yoshihiro Sekiguchi (University of Yamanashi)
This paper proposes a spoken term detection using syllable transition network (STN) derived from multiple speech recognizers. An STN is similar to a sub-word based confusion network, which is derived from the output of a speech recognizer. The one we proposed is derived from the outputs of multiple speech recognition systems, which is well known to be robust to certain recognition errors and the out-of-vocabulary problem. Therefore, the STN should also be robust to recognition errors on the STD. This experiment showed that the STN was very effective at detecting out-of-vocabulary terms, improving detection rate to 83%, which was as high as the in-vocabulary term detection performance.
10:30Combining Chinese Spoken Term Detection Systems via Side-information Conditioned Linear Logistic Regression
Sha Meng (Spoken Language Processing Group, LIMSI-CNRS, France)
Wei-Qiang Zhang (Tsinghua University)
Jia Liu (Tsinghua University)
This paper examines the task of Spoken Term Detection (STD) for the Chinese language. We propose to use Linear Logistic Regression (LLR) to combine various Chinese STD systems built with different decoding units, detection units, features and phone sets. In order to solve the missing-sample problem in STD system combination, side-information reflecting the reliability of the scores for fusion is used to condition the parameters of the standard LLR model. In addition, a two-stage combination solution is proposed to overcome the data-sparse problem. The experimental results show that the proposed methods improve the overall detection performance significantly. Compared with the best single system, a relative 11.3% improvement is achieved.
10:45Metric Subspace Indexing for Fast Spoken Term Detection
Taisuke Kaneko (Toyohashi University of Technology)
Tomoyosi Akiba (Toyohashi University of Technology)
In this paper, we propose a novel indexing method for Spoken Term Detection (STD). The proposed method can be considered as using metric space indexing for the approximate string-matching problem, where the distance between a phoneme and a position in the target spoken document is defined. The proposed method does not require the use of thresholds to limit the output, instead being able to output the results in increasing order of distance. It can also deal easily with the multiple candidates obtained via Automatic Speech Recognition (ASR). The results of preliminary experiments show promise for achieving fast STD.
11:00Unsupervised Spoken-Term Detection with Spoken Queries Using Segment-based Dynamic Time Warping
Chun-an Chan (National Taiwan University)
Lin-shan Lee (National Taiwan University)
Spoken term detection is important for retrieval of multimedia and spoken content over the Internet. Because it is difficult to have acoustic/language models well matched to the huge quantities of spoken documents produced under various conditions, unsupervised approaches using frame-based dynamic time warping (DTW) has been proposed to compare the spoken query with spoken documents frame by frame. In this paper, we propose a new approach of unsupervised spoken term detection using segment-based DTW. Speech signals are segmented into sequences of acoustically similar segments using hierarchical agglomerative clustering, and a DTW procedure is formulated for segment sequences along with the clustering tree structures. In this way, the number of highly redundant parameters can be reduced, and the relatively unstable feature vectors can be replaced by more stable segments which describe the sequence of vocal track stages during the uttering process. Preliminary experiments indicate a high reduction of computation time as compared to frame-based DTW, although the slightly degraded detection performance implies much room for further improvements.
11:15Contextual Verification for Open Vocabulary Spoken Term Detection
Daniel Schneider (Fraunhofer IAIS, Germany)
Timo Mertens (Norwegian University of Science and Technology, Norway)
Martha Larson (Delft University of Technology, Netherlands)
Joachim Köhler (Fraunhofer IAIS, Germany)
In spoken term detection, subword speech recognition is a viable means for addressing the out-of-vocabulary (OOV) problem at query time. Applying fuzzy error compensation techniques is needed for coping with inevitable recognition errors, but can lead to high false alarm rates especially for short queries. We propose two novel methods which reject false alarms based on the context of the hypothesized result and the distance to phonetically similar queries. Using the proposed methods, we obtain an increase in precision of 11% absolute at equal recall.
11:30Augmented set of features for confidence estimation in spoken term detection
Javier Tejedor (HCTLab-UAM)
Doroteo Torre (ATVS-UAM)
Miguel Bautista (ATVS-UAM)
Simon King (CSTR-University of Edinburgh)
Dong Wang (CSTR-University of Edinburgh)
Jose Colas (HCTLab-UAM)
Discriminative confidence estimation along with confidence normalisation have been shown to construct robust decision maker modules in spoken term detection (STD) systems. Discriminative confidence estimation, making use of termdependent features, has been shown to improve the widely used lattice-based confidence estimation in STD. In this work, we augment the set of these term-dependent features and show a significant improvement in the STD performance both in terms of ATWV and DET curves in experiments conducted on a Spanish geographical corpus. This work also proposes a multiple lineal regression analysis to carry out the feature selection. Next, the most informative features derived from it are used within the discriminative confidence on the STD system.
11:45Cluster-Based Language Model for Spoken Document Retrieval Using NMF-based Document Clustering
Xinhui Hu (National Institute of Information and Communications Technology, Japa)
Ryosuke Isotani (National Institute of Information and Communications Technology, Japa)
Hisashi Kawai (National Institute of Information and Communications Technology, Japa)
Satoshi Nakamura (National Institute of Information and Communications Technology, Japa)
In this paper, a non-negative matrix factorization (NMF)-based document clustering approach is proposed for the cluster-based language model for spoken document retrieval. The retrieval language model comprises three different unigram models: a whole corpus collect-based unigram, document-based unigram, and a document clustering-based unigram. They are combined with double linear interpolations. Document clustering is realized via the NMF method; each document is clustered into an axis in which it has maximum projection in the latent semantic space derived by the NMF. The initialization of NMF, which is an important factor influencing NMF performance, is based on the clustered results of the K-means clustering approach. Using these approaches, retrieval experiments are conducted on a test collection from the corpus of spontaneous Japanese (CSJ). It is found that the proposed method significantly outperforms the conventional vector space model (VSM), the maximum improvement of the retrieval perform-ance (mean average precision: MAP) exceeds 36%, outstripping the conventional query likelihood model, which has improvement of 7.4%. It is also found that the proposed method surpasses the K-means clustering method when adequate initialization of NMF is used.

ASR: Language Modeling

Time:Tuesday 13:30 Place:Hall A/B Type:Oral
Chair:Renato De Mori
13:30Decoding with Shrinkage-based Language Models
Ahmad Emami (IBM T J Watson Research Center)
Stanley Chen (IBM T J Watson Research Center)
Abraham Ittycheriah (IBM T J Watson Research Center)
Hagen Soltau (IBM T J Watson Research Center)
Bing Zhao (IBM T J Watson Research Center)
In this paper we investigate the use of a class-based exponential language model when directly integrated into speech recognition or machine translation decoders. Recently, a novel class-based language model, Model M, was introduced and was shown to outperform regular n-gram models on moderate amounts of Wall Street Journal data. This model was motivated by the observation that shrinking the sum of the parameter magnitudes in an exponential language model leads to better performance on unseen data. In this paper we directly integrate the shrinkage-based language model into two different state-of-the-art machine translation engines as well as a large-scale dynamic speech recognition decoder. Experiments on official GALE and NIST development and evaluation sets show considerable and consistent improvement in both machine translation quality and speech recognition word error rate.
13:50Enhanced Word Classing for Model M
Stanley F. Chen (IBM T. J. Watson Research Center)
Stephen M. Chu (IBM T. J. Watson Research Center)
Model M is a superior class-based n-gram model that has shown improvements on a variety of tasks and domains. In previous work with Model M, bigram mutual information clustering has been used to derive word classes. In this paper, we introduce a new word classing method designed to closely match with Model M. The proposed classing technique achieves gains in speech recognition word-error rate of up to 1.1% absolute over the baseline clustering, and a total gain of up to 3.0% absolute over a Katz-smoothed trigram model, the largest such gain ever reported for a class-based language model.
14:10Improved Neural Network Based Language Modelling and Adaptation
Junho Park (University of Cambridge)
Xunying Liu (University of Cambridge)
Mark Gales (University of Cambridge)
Philip Woodland (University of Cambridge)
Neural network language models (NNLM) have become an increasingly popular choice for large vocabulary continuous speech recognition (LVCSR) tasks, due to their inherent generalisation and discriminative power. This paper present two techniques to improve performance of standard NNLMs. First, the form of NNLM is modelled by introduction an additional output layer node to model the probability mass of out-of-shortlist (OOS) words. An associated probability normalisation scheme is explicitly derived. Second, a novel NNLM adaptation method using a cascaded network is proposed. Consistent WER reductions were obtained on a state-of-the-art Arabic LVCSR task over conventional NNLMs. Further performance gains were also observed after NNLM adaptation.
14:30Recurrent neural network based language model
Tomas Mikolov (Speech@FIT, Brno University of Technology, Czech Republic)
Martin Karafiat (Speech@FIT, Brno University of Technology, Czech Republic)
Lukas Burget (Speech@FIT, Brno University of Technology, Czech Republic)
Jan Cernocky (Speech@FIT, Brno University of Technology, Czech Republic)
Sanjeev Khudanpur (Department of Electrical and Computer Engineering, Johns Hopkins University, USA)
A new recurrent neural network based language model (RNN LM) with applications to speech recognition is presented. Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model. Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, even when the backoff model is trained on much more data than the RNN LM. We provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity.
14:50Discriminative Language Modeling Using Simulated ASR Errors
Preethi Jyothi (Department of Computer Science and Engineering, The Ohio State University, USA)
Eric Fosler-Lussier (Department of Computer Science and Engineering, The Ohio State University, USA)
In this paper, we approach the problem of discriminatively training language models using a weighted finite state transducer (WFST) framework that does not require acoustic training data. The phonetic confusions prevalent in the recognizer are modeled using a confusion matrix that takes into account information from the pronunciation model (word-based phone confusion log likelihoods) and information from the acoustic model (distances between the phonetic acoustic models). This confusion matrix, within the WFST framework, is used to generate confusable word graphs that serve as inputs to the averaged perceptron algorithm to train the parameters of the discriminative language model. Experiments on a large vocabulary speech recognition task show significant word error rate reductions when compared to a baseline using a trigram model trained with the maximum likelihood criterion.
15:10Learning a Language Model from Continuous Speech
Graham Neubig (Graduate School of Informatics, Kyoto University)
Masato Mimura (Graduate School of Informatics, Kyoto University)
Shinsuke Mori (Graduate School of Informatics, Kyoto University)
Tatsuya Kawahara (Graduate School of Informatics, Kyoto University)
This paper presents a new approach to language model construction, learning a language model not from text, but directly from continuous speech. A phoneme lattice is created using acoustic model scores, and Bayesian techniques are used to robustly learn a language model from this noisy input. A novel sampling technique is devised that allows for the integrated learning of word boundaries and an n-gram language model with no prior linguistic knowledge. The proposed techniques were used to learn a language model directly from continuous, potentially large-vocabulary speech. This language model was able to significantly reduce the ASR phoneme error rate over a separate set of test data, and the proposed lattice processing and lexical acquisition techniques were found to be important factors in this improvement.

Speaker characterization and recognition II

Time:Tuesday 13:30 Place:201A Type:Oral
Chair:Haizhou Li
13:30Looking for relevant features for speaker role recognition
Benjamin Bigot (IRIT- Université de Toulouse)
Isabelle Ferrané (IRIT- Université de Toulouse)
Julien Pinquier (IRIT- Université de Toulouse)
Régine André-Obrecht (IRIT- Université de Toulouse)
When listening to foreign radio or TV programs we are able to pick up some information from the way people are interacting with each others and easily identify the most dominant speaker or the person who is interviewed. Our work relies on the existence of clues about speaker roles in acoustic and prosodic low-level features extracted from audio files and from speaker segmentations. In this paper we describe an original language-independent method which achieves the recognition of 5 roles (Anchor, Journalist, Other, Punctual Journalist, Punctual Other) with an accuracy of 85% on a 13-hour corpus composed of 46 documents among which can be found different radio shows. A feature selection method is exploited in order to highlight the most relevant features for every speaker role.
13:50Prosodic Speaker Verification using Subspace Multinomial Models with Intersession Compensation
Marcel Kockmann (Brno University of Technology)
Lukas Burget (Brno University of Technology)
Ondrej Glembek (Brno University of Technology)
Luciana Ferrer (Speech Technology and Research Laboratory, SRI International)
Honza Cernocky (Brno University of Technology)
We propose a novel approach to modeling prosodic features. Inspired by Joint Factor Analysis model (JFA), our model is based on the same idea of introducing subspace of model parameters. However, the underlying Gaussian Mixture distribution of JFA is replaced by multinomial distribution to model sequences of discrete units rather than continuous features. In this work, we use the subspace model as a feature extractor for support vector machines (SVMs), similar to the recently proposed JFA in total variability space. We can show the capability to reduce high-dimensional count vectors to low dimension while keeping system performance stable. With additional intersession compensation, we can improve 30% relative to the baseline system and reach an equal error rate of 8.8% on the NIST 2006 SRE dataset.
14:10The Estimation and Kernel Metric of Spectral Correlation for Text-Independent Speaker Verification
Eryu Wang (iFly Speech Lab, University of Science and Technology of China, China&Institute for Infocomm Research,Agency for Science, Technology and Research (A*STAR), Singapore)
Kong Aik Lee (Institute for Infocomm Research,Agency for Science, Technology and Research (A*STAR), Singapore)
Bin Ma (Institute for Infocomm Research,Agency for Science, Technology and Research (A*STAR), Singapore)
Haizhou Li (Institute for Infocomm Research,Agency for Science, Technology and Research (A*STAR), Singapore)
Wu Guo (iFly Speech Lab, University of Science and Technology of China, China)
Lirong Dai (iFly Speech Lab, University of Science and Technology of China, China)
Gaussian mixture models (GMMs) are commonly used in text-independent speaker verification for modeling the spectral distribution of speech. Recent studies have shown the effectiveness of characterizing speaker information using just the mean vectors of the GMM in conjunction with support vector machine (SVM). This paper advocates the use of spectral correlation captured by covariance matrices, and investigates its effectiveness compared to and in complement with the mean vectors. We examine two approaches, i.e., homoscedastic and heteroscedastic modeling, in estimating the spectral correlation. We introduce two kernel metrics, i.e., Frobenius angle and log-Euclidean inner product, for measuring the similarity between speech utterances in terms of spectral correlation. Experiment conducted on the NIST 2006 speaker verification task shows that approximately 10% of improvement is achieved by using the spectral correlation in conjunction with the mean vectors.
14:30Improving Monaural Speaker Identification by Double-Talk Detection
Rahim Saeidi (School of Computing, University of Eastern Finland)
Pejman Mowlaee (Dept. of Electronic Systems, Aalborg University, Denmark)
Tomi Kinnunen (School of Computing, University of Eastern Finland)
Zheng-Hua Tan (Dept. of Electronic Systems, Aalborg University, Denmark)
Mads Græsbøll Christensen (Dept. of Media Technology, Aalborg University, Denmark)
Søren Holdt Jensen (Dept. of Electronic Systems, Aalborg University, Denmark)
Pasi Fränti (School of Computing, University of Eastern Finland)
This paper describes a novel approach to improve monoaural speaker identification where two speakers are present in a single-microphone recording. The goal is to identify both of the underlying speakers in the given mixture. The proposed approach is composed of a double-talk detector (DTD) as a pre- processor and speaker identification back-end. We demonstrate that including the double-talk detector improves the speaker identification accuracy. Experiments on GRID corpus show that including the DTD improves average recognition accuracy from 96.53% to 97.43%.
14:50Exploring subsegmental and suprasegmental features for a text-dependent speaker verification in distant speech signals
Avinash B. (International Institute of Information Technology, Hyderabad, India)
Guruprasad S. (Department of Computer Science and Engineering, Indian Institute of Technology Madras, India)
Yegnanarayana B. (International Institute of Information Technology, Hyderabad, India)
Existing automatic speaker verification (ASV) systems perform with high accuracy when the speech signal is collected close to the mouth of the speaker (< 1 ft). However, the performance of these systems reduces significantly when speech signals are collected at a distance from the speaker (2-6 ft). The objective of this paper is to address some issues in the processing of speech signals collected at a distance from the speaker, for text-dependent ASV system. An acoustic feature derived from short segments of speech signals is proposed for the ASV task. The key idea is to exploit the high signal-to-noise nature of short segments of speech in the vicinity of impulse-like excitations. We show that the proposed feature yields better performance of speaker verification than the mel-frequency cepstral coefficients (MFCCs). In addition, regions of high signal-to-reverberation ratio, duration and pitch information are used to improve the performance of the ASV system for distant speech.
15:10A Fast Implementation of Factor Analysis for Speaker Verification
Qingsong Liu (University of Science and Technology of China)
Wei Huang (Shanda Innovation Institute)
Dongxing Xu (Shanda Innovation Institute)
Hongbin Cai (Shanda Innovation Institute)
Beiqian Dai (University of Science and Technology of China)
The problem of session variability in text-independent speaker verification has been tackled actively for a few years. The factor analysis approach has been successfully applied for solving the session variablity problem. However, it suffers from a large amount of computational overhead. In this paper, a fast implementation of factor analysis approach with GMM Gaussian pre-selection is proposed. In our method, the EM statistics are calculated only using the Gaussians selected by cluster UBM to improve the speed of estimating factor analysis model. Experimental results on the NIST SRE 2006 evaluation show that the presented approach can provide as much as a 7 to 8x speedup over the baseline factor analysis system with the similar performance.

Single-channel speech enhancement

Time:Tuesday 13:30 Place:201B Type:Oral
Chair:Tan Lee
13:30Fast converging iterative Kalman filtering for speech enhancement using long and overlapped tapered windows with large side lobe attenuation
Stephen So (Signal Processing Laboratory, Griffith University)
Kuldip K Paliwal (Signal Processing Laboratory, Griffith University)
In this paper, we propose an iterative Kalman filtering scheme that has faster convergence and introduces less residual noise, when compared with the iterative scheme of Gibson, et al. This is achieved via the use of long and overlapped frames as well as using a tapered window with a large side lobe attenuation for linear prediction analysis. We show that the Dolph-Chebychev window with a -200 dB side lobe attenuation tends to enhance the dynamic range of the formant structure of speech corrupted with white noise, reduce prediction error variance bias, as well as provide for some spectral smoothing, while the long overlapped frames provide for reliable autocorrelation estimates and temporal smoothing. Speech enhancement experiments on the NOIZEUS corpus show that the proposed method outperformed conventional iterative and non-iterative Kalman filters as well as other enhancement methods such as MMSE-STSA and PSC.
13:50Robust Noise Estimation Using Minimum Correction with Harmonicity Control
Xuejing Sun (Cambridge Silicon Radio)
Kuan-Chieh Yen (Cambridge Silicon Radio)
Rogerio Alves (Cambridge Silicon Radio)
In this paper a new noise spectrum estimation algorithm is described for single-channel acoustic noise suppression systems. To achieve fast convergence during abrupt change of noise floor, the proposed algorithm uses a minimum correction module to adjust an adaptive noise estimator. The minimum search duration is controlled by a harmonicity module for improved noise tracking under continuous voicing condition. Objective test results show that the proposed algorithm consistently outperforms competitive noise estimation methods.
14:10New Insights into Subspace Noise Tracking
Mahdi Triki (Philips Research Laboratories)
Various speech enhancement techniques rely on the knowledge of the clean signal and noise statistics. In practice, however, these statistics are not explicitly available, and the overall enhancement accuracy critically depends on the estimation quality of the unknown statistics. The estimation of noise (and speech) statistics is particularly challenging under non-stationary noise conditions. In this respect, subspace-based approaches have been shown to provide a good tracking vs. final misadjustment tradeoff. Subspace-based techniques hinge critically on both rank-limited and spherical assumptions of the speech and the noise DFT matrices, respectively. The speech rank-limited assumption was previously experimentally tested and validated. In this paper, we will investigate the structure of nuisance sources. We will discuss the validity of the spherical assumption for a variety of nuisance sources (environmental noise, reverberation), and preprocessing (overlapping segmentation).
14:30Bias Considerations for Minimum Subspace Noise Tracking
Mahdi Triki (Philips Research Laboratories)
Kees Janse (Philips Research Laboratories)
Speech enhancement schemes rely generally on the knowledge of the noise power spectral density. The estimation of these statistics is particularly a critical issue and a challenging problem under non-stationary noise conditions. With this respect, subspace based approaches have shown to allow for reduced estimation delay and perform a good tracking vs. final misadjustment tradeoff. One key attribute for noise floor tracking is the estimation bias: an overestimation leads to over-suppression and to more speech distortion; while an underestimation leads to a high level of residual noise. The present paper investigates the bias of the subspace-based scheme, and particularly the robustness of the bias compensation factor to the desired speaker characteristics and the input SNR.
14:50A Corpus-Based Approach to Speech Enhancement from Nonstationary Noise
Ming Ji (Queen's University Belfast)
Ramji Srinivasan (Queen's University Belfast)
Danny Crookes (Queen's University Belfast)
This paper addresses single-channel speech enhancement assuming difficulties in predicting the noise statistics. We describe an approach which aims to maximally extract the two features of speech - its temporal dynamics and speaker characteristics - to improve the noise immunity. This is achieved by recognizing long speech segments as whole units from noise. In the recognition, clean speech sentences, taken from a speech corpus, are used as examples. Experiments have been conducted on the TIMIT database for separating various types of nonstationary noise including song, music, and crosstalk speech. The new approach has demonstrated improved performance over conventional speech enhancement algorithms in both objective and subjective evaluations.
15:10Bandwidth Expansion of Speech Based on Wavelet Transform Modulus Maxima Vector Mapping
Chen Zhe (School of Electronic & Information Engineering, Dalian University of Technology, Dalian, China)
Cheng You-Chi (School of Electrical & Computer Engineering, Georgia Institute of Technology, Atlanta, USA)
Yin Fuliang (School of Electronic & Information Engineering, Dalian University of Technology, Dalian, China)
Lee Chin-Hui (School of Electrical & Computer Engineering, Georgia Institute of Technology, Atlanta, USA)
A novel approach to speech bandwidth expansion based on wavelet transform modulus maxima vector mapping is proposed. By taking advantage of the similarity of the modulus maxima vectors between narrowband and wideband wavelet-analyzed signals a neural network mapping structure can be established to perform bandwidth expansion given only the narrowband version of speech. Since the proposed algorithm works on the time-domain waveforms it offers a flexibility of variable-length frame selection that facilitates low delay and potentially data-dependent speech segment processing to further improve the speech quality. Evaluations based on both objective and subjective measures show that the proposed bandwidth expansion approach results in high-quality synthesized wideband speech with little perceivable distortion from the original wideband speech signals.

Speech Synthesis IV: Miscellaneous Topics

Time:Tuesday 13:30 Place:302 Type:Oral
Chair:Jan van Santen
13:30Hidden Markov Models with Context-Sensitive Observations for Grapheme-to-Phoneme Conversion
Udochukwu Kalu Ogbureke (CNGL, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland)
Peter Cahill (CNGL, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland)
Julie Carson-Berndsen (CNGL, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland)
Hidden Markov models (HMMs) have proven useful in various aspects of speech technology from automatic speech recognition through speech synthesis, speech segmentation and grapheme-to-phoneme conversion to part-of-speech tagging. Traditionally, context is modelled at the hidden states in the form of context-dependent models. This paper constitutes an extension to this approach; the underlying concept is to model context at the observations for HMMs with discrete observations and discrete probability distributions. The HMMs emit context-sensitive discrete observations and are evaluated with a grapheme-to-phoneme conversion system.
13:50Evaluating a Dialog Language Generation System: Comparing the MOUNTAIN System to other NLG Approaches
Brian Langner (Carnegie Mellon University)
Stephan Vogel (Carnegie Mellon University)
Alan W Black (Carnegie Mellon University)
This paper describes the MOUNTAIN language generation system, a fully-automatic, data-driven approach to natural language generation aimed at spoken dialog applications. MOUNTAIN uses statistical machine translation techniques and natural corpora to generate human-like language from a structured internal language, such as a representation of the dialog state. We briefly describe the training process for the MOUNTAIN approach, and show results of automatic evaluation in a standard language generation domain: the METEO weather forecasting corpus. Further, we compare output from the MOUNTAIN system to several other NLG systems in the same domain, using both automatic and human-based evaluation metrics; our results show our approach is comparable in quality to other advanced approaches. Finally, we discuss potential extensions, improvements, and other planned tests.
14:10Active Appearance Models for Photorealistic Visual Speech Synthesis
Wesley Mattheyses (Vrije Universiteit Brussel)
Lukas Latacz (Vrije Universiteit Brussel)
Werner Verhelst (Vrije Universiteit Brussel)
The perceived quality of a synthetic visual speech signal greatly depends on the smoothness of the presented visual articulators. This paper explains how concatenative visual speech synthesis systems can apply active appearance models to achieve a smooth and natural visual output speech. By modeling the visual speech contained in the system's speech database, a diversification between the synthesis of the shape and the texture of the talking head is feasible. This allows the system to accurately balance between the articulation strength of the visual articulators and the signal smoothness of the visual mode in order to optimize the synthesis. To improve the synthesis quality, an automatic database normalization strategy has been designed that removes variations from the database which are not related to speech production. As was verified by a perception experiment, this normalization strategy significantly improves the perceived signal quality.
14:30Latent Affective Mapping: A Novel Framework for the Data-Driven Analysis of Emotion in Text
Jerome Bellegarda (Apple Inc.)
A necessary step in the generation of expressive speech synthesis is the automatic detection and classification of emotions most likely to be present in textual input. We have recently advocated [1] a new emotion analysis strategy leveraging two separate semantic levels: one that encapsulates the foundations of the domain considered, and one that specifically accounts for the overall affective fabric of the language. This paper expands this premise into a more general framework, dubbed latent affective mapping, to expose the emergent relationship between these two levels. Such connection in turn advantageously informs the emotion classification process. The benefits gained though a richer description of the underlying affective space are illustrated via an empirical comparison of two different mapping instantiations (latent affective folding and latent affective embedding) with more conventional techniques based on expert knowledge of emotional keywords and keysets.
14:50Native and Non-native Speaker Judgements on the Quality of Synthesized Speech
Anna Janska (IMPRS NeuroCom, University of Leipzig, Germany)
Robert Clark (CSTR, The University of Edinburgh, U.K.)
The difference between native speakers’ and non-native speakers’ naturalness judgements of synthetic speech is investigated. Similar/different judgements are analysed via a multidimensional scaling analysis and compared to Mean opinion scores. It is shown that although the two groups generally behave in a similar manner the variance of non-native speaker judgements is generally higher. While both groups of subject can clearly distinguish natural speech from the best synthetic examples, the groups’ responses to different artefacts present in the synthetic speech can vary.
15:10Machine Learning for Text Selection with Expressive Unit-Selection Voices
Dominic Espinosa (The Ohio State University)
Michael White (The Ohio State University)
Eric Fosler-Lussier (The Ohio State University)
Chris Brew (The Ohio State University)
We show that a ranking model produced by machine learning outperforms two baselines when applied to the task of selecting texts for use in creating a unit-selection synthesis voice with good domain coverage. The model learns to predict the estimated utility of an utterance based on features relating it to the utterances selected so far and a corpus of target utterances. Our analyses indicate that our discriminative approach continues to work well even though the presence of rich prosodic and non-prosodic features significantly expands the search space beyond what has previously been handled by greedy methods.

Prosody: Basics & Applications

Time:Tuesday 13:30 Place:International Conference Room A Type:Poster
Chair:Nick Campbell
#1Acoustic Correlates of Meaning Structure in Conversational Speech
Alexei V. Ivanov (DISI, University of Trento, Italy)
Giuseppe Riccardi (DISI, University of Trento, Italy)
Sucheta Ghosh (DISI, University of Trento, Italy)
Sara Tonelli (FBK-IRST, Trento, Italy)
Evgeny Stepanov (DISI, University of Trento, Italy)
We are interested in the problem of extracting meaning structures from spoken utterances in human communication. In SLU systems, parsing of meaning structures is carried over the word hypotheses generated by the ASR. This approach suffers from high word error rates and ad-hoc conceptual representations. In contrast, in this paper we aim at discovering meaning components from direct measurements of acoustic and non-verbal linguistic features. The meaning structures are taken from the frame semantics model proposed in FrameNet. We give a quantitative analysis of meaning structures in terms of speech features across human--human dialogs from the manually annotated LUNA corpus. We show that the acoustic correlations between pitch, formant trajectories, intensity and harmonicity and meaning features are statistically significant over the whole corpus as well as relevant in classifying the target words evoked by a semantic frame.
#2HMM-based Prosodic Structure Model Using Rich Linguistic Context
Nicolas Obin (IRCAM)
Xavier Rodet (IRCAM)
Anne Lacheret (Modyco Lab., University of Paris-La Défense)
This paper presents a study on the use of deep syntactical features to improve prosody modeling. A French linguistic processing chain based on linguistic preprocessing, morpho-syntactical labeling, and deep syntactical parsing is used in order to extract syntactical features from an input text. These features are used to define more or less high-level syntactical feature sets. Such feature sets are compared on the basis of a HMM-based prosodic structure model. High-level syntactical features are shown to significantly improve the performance of the model (up to 21% error reduction combined with 19% BIC reduction).
#3Audiovisual Congruence and Pragmatic Focus Marking
Charlotte Wollermann (Institute of Communication Sciences, University of Bonn, Germany)
Bernhard Schröder (German Linguistics, University of Duisburg-Essen, Germany)
Ulrich Schade (Fraunhofer Institute for Communication, Information Processing and Ergonomics FKIE, Germany)
This paper presents an empirical study on the interplay between audio and visual information for pragmatic focus marking. Nine German speakers were instructed to read dialogues with embedded question-answer pairs and varied context regarding certainty and exhaustivity. Results show that H* accompanied by a raising of eyebrows or head occurs significantly more often for the context intended to advantage uncertainty and nonexhaustivity. Furthermore, when two noun phrases are coordinated, a higher number of audiovisually equivalent realizations occur for the context intended to advantage certainty and exhaustivity, whereas audiovisually incongruent cues occur more often for the context intended to advantage uncertainty and nonexhaustivity.
#4Redescribing Intonational Categories with Functional Data Analysis
Margaret Zellers (Research Centre for English & Applied Linguistics, University of Cambridge)
Michele Gubian (Centre for Language & Speech Technology, Radboud University)
Brechtje Post (Research Centre for English & Applied Linguistics, University of Cambridge)
Intonational research is often dependent upon hand-labeling by trained listeners, which can be prone to bias or error. We apply tools from Functional Data Analysis (FDA) to a set of fundamental frequency (F0) data to demonstrate how these tools can provide a less theory-dependent way of investigating F0 contours by allowing statistical analyses of whole contours rather than depending on theoretically-determined “important” parts of the signal. The results of this analysis support the predictions of current intonational phonology while also providing additional information about phonetic variability in the F0 contours that these theories do not currently model.
#5Exploring goodness of prosody by diverse matching templates
Shen Huang (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
Honagyan Li (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
Shijin Wang (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
Jiaen Liang (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
Bo Xu (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science)
In automatic speech grading systems, rare research is followed through addressing the issue of GOR (Goodness Of pRosody). In this paper we ropose a novel method by taking the advantage of our QBH (Query By Humming) techniques in 2008 MIREX evaluation task. A set of standard samples related to the top-cream students are initially picked up as templates, a cascade QBH structure is then taken from two metrics: the MOMEL stylization followed by DTW distance; the Fujisaki model followed by EMD distance. Sentence GOR is obtained by the fused confidence between target and each template, which is then weighted by a PRI factor in the passage level. Experiment results indicate that performance increases with the count of template, and Fujisaki-EMD metric outperforms MOMEL-DTW one in terms of correlation. Their combination can be treated as template based GOR score, compensated with our previous feature based GOR score, the approach can achieve 0.432 in correlation and 17.90% in EER in our corpus.
#6A Language identification inspired method for spontaneous speech detection
Mickael Rouvier (CERI-LIA, University of Avignon)
Richard Dufour (LIUM, University of Le Mans)
Georges Linarès (CERI-LIA, University of Avignon)
Yannick Estève (LIUM, University of Le Mans)
Most of spontaneous speech detection systems relies on disfluency analysis or on combination of acoustic and linguistic features. This paper presents a method that considers spontaneous speech as a specific language, which could be identified by using language-recognition methods, such as shifted delta cepstrum parameters, dimensionality reduction by linear discriminant analysis and factor-analysis based filtering process. Experiments are conducted on the French EPAC corpus. On a 3 spontaneity-level task, this approach obtains a relative gain of about 22% of identification rates, in comparison to the classical MFCC/GMM technique. Then, we combine these techniques to others previously proposed for spontaneous speech detection. Finally, the proposed system obtains a recognition rate of 65% on high spontaneous speech segments.
#7Speech dominoes and phonetic convergence
Gérard Bailly (GIPSA-Lab, Grenoble-France)
Amélie Lelong (GIPSA-Lab, Grenoble-France)
Interlocutors are known to mutually adapt during conversation. Recent studies have questioned the adaptation of phonological representations and kinematics of phonetic variables such as loudness, speech rate or fundamental frequency. Results are often contradictory and the effectiveness of phonetic convergence during conversation is still an open issue. This paper describes an original experimental paradigm – a game played in primary schools known as verbal dominoes - that enables us to collect several hundreds of syllables uttered by both speakers in different conditions: alone, in ambient speech or in full interaction. Speech recognition techniques are then applied to globally characterize phonetic convergence if any. We hypothesize here that convergence of phonetic representations such as vocalic dispersions is not immediate especially when considering common words of the target language.
#8A Quick Sequential Forward Floating Feature Selection Algorithm for Emotion Detection from Speech
Mátyás Brendel (LIMSI-CNRS)
Riccardo Zaccarelli (LIMSI-CNRS)
Laurence Devillers (LIMSI-CNRS)
In this paper we present an improved Sequential Forward Floating Search algorithm. Subsequently, extensive tests are carried out on a selection of French emotional language resources well suited for a first impression on general applicability. A detailed analysis is presented to test the various modifications suggested one-by-one. Our conclusion is that the modification in the forward step result in a considerable improvement in speed (~80%) while no considerable and systematic loss in quality is experienced. The modifications in the backward step seem to have only significance when a higher number of features is achieved. The final clarification of this issue remains the task of future work. As a result we may suggest a quick feature selection algorithm, which is practically more suitable for the state of the art, larger corpora and wider feature-banks. Our quick SFFS is general: it can also be used in any other field of application.
#9Automated Vocal Emotion Recognition Using Phoneme Class Specific Features
Géza Kiss (Center for Spoken Language Understanding, Oregon Health & Science University)
Jan van Santen (Center for Spoken Language Understanding, Oregon Health & Science University)
Methods for automated vocal emotion recognition often use acoustic feature vectors that are computed for each frame in an utterance, and global statistics based on these acoustic feature vectors. However, at least two considerations argue for usage of phoneme class specific features for emotion recognition. First, there are well-known effects of phoneme class on some of these features. Second, it is plausible that emotion influences the speech signal in ways that differ between phoneme classes. A new method based on the concept of phoneme class specific features is proposed in which different features are selected for regions associated with different phoneme classes and then optimally combined, using machine learning algorithms. A small but significant improvement was found when this method was compared with an otherwise identical method in which features were used uniformly over different phoneme classes.
#10Feature Selection for Pose Invariant Lip Biometrics
Adrian Pass (Queens University Belfast)
Jianguo Zhang (Queens University Belfast)
Darryl Stewart (Queens University Belfast)
For the first time in this paper we present results showing the effect of out of plane speaker head pose variation on a lip based speaker verification system. Using appearance DCT based features, we adopt a Mutual Information analysis technique to highlight the class discriminant DCT components most robust to changes in out of plane pose. Experiments are conducted using the initial phase of a new multi view Audio-Visual database designed for research and development of pose-invariant speech and speaker recognition. We show that verification performance can be improved by substituting higher order horizontal DCT components for vertical, particularly in the case of a train/test pose angle mismatch. We further show that the best performance can be achieved by combining this alternative feature selection with multi view training, reporting a relative 45% Equal Error Rate reduction over a common energy based selection.
#11Signal-Based Accent and Phrase Marking Using the Fujisaki Model
Hussein Hussein (Laboratory of Acoustics and Speech Communication, Dresden University of Technology, 01062 Dresden, Germany)
The automatic prosodic marking is very important in speech signal processing, since its results are required in many subsections, e.g. speech synthesis and speech recognition. The most important prosodic features on the linguistic level are the marking of accents and phrases. In this paper, we develop an automatic algorithm for marking accents and phrases, which analyzes the F0 contour using the quantitative Fujisaki model. The results of automatic extraction of accents and phrases have been compared to the human labeling performance. The success rate of accent and phrase marking amounts to 77.11% and 67.12% respectively.
#12A Study of Interplay between Articulatory Movement and Prosodic Characteristics in Emotional Speech Production
Jangwon Kim (Department of Electrical Engineering, USC)
Sungbok Lee (Department of Linguistics, USC)
Shrikanth Narayanan (Department of Electrical Engineering)
This paper investigates the interplay between articulatory movement and voice source activity as a function of emotions in speech production. Our hypothesis is that humans use different modulation methods in which articulatory movements and prosodic modulations are differently weighted across different emotions. This hypothesis was examined by joint analysis of the two domains, using two statistical representations: (1) the sample distribution comparison using two-sigma ellipses of the articulatory speed statistics and prosodic feature (pitch or intensity) statistics, (2) the comparison of correlation coefficients. In the articulatory-prosodic spaces, we find (1) distinctive weighting patterns for angry and happy emotional speech and (2) distinctive correlation patterns depending on articulators and target emotions. These findings support the hypothesis that humans use different modulation methods of emphasizing articulatory motions and/or prosodic activities depending on emotion.

ASR: Feature Extraction I

Time:Tuesday 13:30 Place:International Conference Room B Type:Poster
Chair:Tetsuya Takiguchi
#1Improved Phoneme Recognition by Integrating Evidence from Spectro-temporal and Cepstral Features
Shang-wen Li (Graduate Institute of Communication Engineering, National Taiwan University, Taiwan)
Liang-che Sun (Graduate Institute of Communication Engineering, National Taiwan University, Taiwan)
Lin-shan Lee (Graduate Institute of Communication Engineering, National Taiwan University, Taiwan)
Gabor features have been proposed for extracting spec-tro-temporal modulation information, and yielding significant improvements in recognition performance. In this paper, we propose the integration of Gabor posteriors with MFCC post-eriors, yielding a relative improvement of 14.3% over an MFCC Tandem system. We analyze for different types of acoustic units the complementarity between Gabor features with long-term spectro-temporal modulation information in the mel-spectrogram and MFCC features with short-term temporal information in the cepstral domain. It is found that Gabor features are better for vowel recognition while MFCCs are better for consonants. This explains why their integration offers improvements.
#2Using Spectro-Temporal Features to Improve AFE Feature Extraction for ASR
Suman Ravuri (International Computer Science Institute/University of California - Berkeley)
Nelson Morgan (International Computer Science Institute/University of California - Berkeley)
Previous work has shown that spectro-temporal features reduce WER for automatic speech recognition under noisy conditions. The spectro-temporal framework, however, is not the only way to process features in order to reduce errors due to noise in the signal. The two-stage mel-warped Wiener filtering method used in the "Advanced Front End'' (AFE), now a standard front end for robust recognition, is another way. Since the spectro-temporal approach can be applied to a noise-reduced spectrum, we wanted to explore whether spectro-temporal features could improve the performance of AFE for ASR. We show that computing spectro-temporal features after AFE processing results in a 45% relative improvement compared to AFE in clean conditions and a 6% to 30% improvement in noisy conditions on the Aurora2 clean training setup.
#3Using Harmonic Phase Information to Improve ASR Rate
Ibon Saratxaga (Aholab Signal Processing Laboratory, University of the Basque Country)
Inma Hernáez (Aholab Signal Processing Laboratory, University of the Basque Country)
Igor Odriozola (Aholab Signal Processing Laboratory, University of the Basque Country)
Eva Navas (Aholab Signal Processing Laboratory, University of the Basque Country)
Iker Luengo (Aholab Signal Processing Laboratory, University of the Basque Country)
Daniel Erro (Aholab Signal Processing Laboratory, University of the Basque Country)
Spectral phase information is usually discarded in automatic speech recognition (ASR). The Relative Phase Shift (RPS), a novel representation of the phase information of the speech, has features which seem to be appropriate to improve the ASR recognition rate. In this paper we describe the RPS representation, discuss different ways to parameterize this information in a suitable way for the HMM modelling, and present the results of the evaluation experiments. WER improvements ranging from 12 to 22% open promising perspectives for the use of this information jointly with the classical MFCC parameterization. Index Terms: ASR, phase spectrum, harmonic analysis
#4Speech Recognition using Long-Term Phase Information
Kazumasa Yamamoto (Toyohashi University of Technology)
Eiichi Sueyoshi (Toyohashi University of Technology)
Seiichi Nakagawa (Toyohashi University of Technology)
Current speech recognition systems use mainly amplitude spectrum-based features such as MFFC for acoustic feature parameters, while discarding phase spectral information. The results of perceptual experiments, however, suggested that phase spectral information based on long-term analysis includes certain linguistic information. In this paper, we propose the use of phase features based on long-term analysis for speech recognition. We use two types of parameters: the delta phase parameter as a group delay and analytic group delay features. Isolated word and continuous digit recognition experiments were performed, resulting in a greater than 90% word or digit accuracy for each of the experiments. The experimental results confirmed that a long-term phase spectrum includes sufficient information for recognizing speech. Furthermore, combining likelihoods of MFCC and long-term group delay cepstrum outperformed the baseline MFCC relatively 20% for clean speech.
#5Low-dimensional Space Transforms of Posteriors in Speech Recognition
Jan Zelinka (Department of Cybernetics, University of West Bohemia)
Jan Trmal (Department of Cybernetics, University of West Bohemia)
Ludek Muller (Department of Cybernetics, University of West Bohemia)
In this paper we present three novel posterior transforms with the primary goal to achieve a high reduction of a feature vector size. The presented methods transform the posteriors to 1,D or 2,D space. For such a high reduction ratio the usually applied methods fail to keep the discriminative information. Contrary, the presented methods were specifically designed to retain most of the discriminative information. In our experiments, we used several different combinations of feature extraction methods nowadays commonly used, i.e. the PLP features (augmented with delta and acceleration coefficients) and two kinds of MLP-ANN features: the bottleneck (BN) and posterior estimates (PE). The experiments were designed with special attention to the assessment of possible improvements of the performance when the PLP features are combined either with the BN features or with the PE features whose dimensionality was reduced using the proposed feature transforms. The performance of the designed transforms was tested on two different speech corpora: a telephone speech SpeechDat-East corpus and multi-modal Czech Audio-Visual corpus.
#6Hierarchical Bottle Neck Features for LVCSR
Christian Plahl (RWTH Aachen University)
Ralf Schlüter (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)
This paper investigates the combination of different neural network topologies for probabilistic feature extraction. On one hand, a five-layer neural network used in bottle neck feature extraction allows to obtain arbitrary feature size without dimensionality reduction by transform, independently of the training targets. On the other hand, a hierarchical processing technique is effective and robust over several conditions. Even though the hierarchical and bottle neck processing performs equally well, the combination of both topologies improves the system by 5% relative. Furthermore, the MFCC baseline system is improved by up to 20% relative. This behaviour could be confirmed on two different tasks. In addition, we analyse the influence of multi-resolution RASTA filtering and long-term spectral features as input for the neural network feature extraction.
#7Hierarchical Neural Net Architectures for Feature Extraction in ASR
Frantisek Grezl (Brno University of Technology,Brno, Czech Republic)
Martin Karafiat (Brno University of Technology,Brno, Czech Republic)
This paper presents the use of neural net hierarchy for feature extraction in ASR. The recently proposed Bottle-Neck feature extraction is extended and used in hierarchical structures to enhance the discriminative property of the features. Although many ways of hierarchical classification/feature extraction have been proposed, we restricted ourselves to use the outputs of the first stage neural network together with its inputs. This approach is evaluated on meeting speech recognition using RT'05 and RT'07 test sets. The evaluated hierarchical feature extraction brings consistent improvement over the use of just the first level neural net.
#8Mutual Information analysis for feature and sensor subset selection in surface electromyography based speech recognition
Vivek Kumar Rangarajan Sridhar (Raytheon BBN Technologies)
Rohit Prasad (Raytheon BBN Technologies)
Prem Natarajan (Raytheon BBN Technologies)
In this paper, we investigate the use of surface electromyographic (sEMG) signals collected from articulatory muscles on the face and neck for performing automatic speech recognition. We present a systematic information-theoretic analysis for feature selection and optimal sensor subset selection. Our results indicate that Mel-cepstral frequency features are best suited for sEMG-based discrimination. Further, the sensor subset ranking obtained through the mutual information experiments are consistent with the results obtained from hidden Markov model based recognition. The framework presented here can be used for determining the best feature and sensor subset for a given speaker a priori, instead of determining them a posteriori from recognition experiments. We achieve a mean recognition accuracy of 80.6% with the best 5 sensor subset chosen by the MI analysis in comparison with 79.6% obtained from using all the sensors.
#9Learning from human errors: Prediction of phoneme confusions based on modified ASR training
Bernd T. Meyer (University of Oldenburg)
Birger Kollmeier (University of Oldenburg)
In an attempt to improve models of human perception, the recognition of phonemes in nonsense utterances was predicted with automatic speech recognition (ASR) in order to analyze its applicability for modeling human speech recognition (HSR) in noise. In the first experiments, several feature types are used as input for an ASR system; the resulting phoneme scores are compared to listening experiments using the same speech data. With conventional training, the highest correlation between predicted and measured recognition was observed for perceptual linear prediction features (r = 0.84). Secondly, a new training paradigm for ASR is proposed with the aim of improving the prediction of phoneme intelligibility. For this ‘perceptual training’, the original utterance labels are modified based on the confusions measured in HSR tests. The modified ASR training improved the overall prediction, with the best models (r = 0.89) exceeding those obtained with conventional training (r = 0.80).

Speech Perception II: Cross Language and Age

Time:Tuesday 13:30 Place:International Conference Room C Type:Poster
Chair:Robert Port
#1Speech Intelligibility of Diagonally Localized Speech with Competing Noise Using Bone-Conduction Headphones
Kazuhiro Kondo (Yamagata University)
Takayuki Kanda (Tohoku University)
Yosuke Kobayashi (Yamagata University)
Hiroyuki Yagyu (Tohoku University)
We investigated the speech intelligibility differences of normal and bone-conduction stereo headphones of target speech localized at 45 degrees on the horizontal plane when competing noise is present. This was our effort to study the possible effect of crosstalk found in bone-conduction headphones on speech intelligibility. All sound sources were localized on the horizontal plane. Target speech was localized at 45 degrees diagonal relative to the listener, while the noise was localized at various azimuths and distances from the listener. The SNR was set to 0, -6, -12 [dB]. We found little difference in intelligibility by headphone types, suggesting that cross-talk in bone-conduction headphones have negligible effect on intelligibility.
#2Masking of vowel-analog transitions by vowel-analog distracters
Pierre Divenyi (VA Northern California Health Care System)
Single-formant dynamically changing harmonic vowel analogs, a target with a single frequency excursion and a longer distracter with a different fundamental frequency and repeated excursions were generated to assess informational and energetic masking of target transitions in young and elderly listeners. Results indicate the presence of informational masking that is significant only for formant excursions of sub-phonemic changes. Elderly listeners perform similarly to the young, with the exception that they require a target/distracter ratio about 10 to 20 dB larger.
#32010, a speech oddity: Phonetic transcription of reversed speech
François Pellegrino (Laboratoire Dynamique Du Langage, CNRS – Université de Lyon, France)
Emmanuel Ferragne (CLILLAC-ARP - Université Paris 7, France)
Fanny Meunier (Laboratoire Dynamique Du Langage, CNRS – Université de Lyon, France)
Time reversal is often used in experimental studies on language perception and understanding, but little is known on its precise impact on speech sounds. Strikingly, some studies consider reversed speech chunks as “speech” stimuli lacking lexical information while others use them as “non speech” control conditions. The phonetic perception of reversed speech has not been thoroughly studied so far, and only impressionistic evaluation has been proposed. To fill this gap, we give here the results of a phonetic transcription task of time-reversed French pseudo-words by 4 expert phoneticians. Results show that for most phonemes (except unvoiced stops), several phonetic features are preserved by time reversal, leading to rather accurate transcriptions of reversed words. Other phenomena are also investigated, such as the emergence of epenthetic segments, and discussed with insight from the neurocognitive bases of the perception of time-varying sounds.
#4Perception on Pitch Reset at Discourse Boundaries
Hsin-Yi Lin (Graduate Institute of Linguistics, National Taiwan University)
Janice Fon (Graduate Institute of Linguistics, National Taiwan University)
This study investigates the role of pitch reset in discourse boundary perception. Previous production studies showed that pitch reset is a robust correlate of discourse boundaries. It not only signals boundary location, but also reflects boundary sizes. In this study, one aims to investigate how listeners perceive and utilize this cue for boundary detection. Results showed that listeners’ perception on this cue corresponded to the patterns found in speech production. What is more, evidence showed that what listeners rely on is the amount of reset, rather than the rest pitch height.
#5Effect of spatial separation on speech-in-noise comprehension in dyslexic adults
Marjorie Dole (Laboratoire Dynamique du Langage)
Michel Hoen (Stem-Cells and Brain Research Institute)
Fanny Meunier (Laboratoire Dynamique du Langage)
This study tested the use of binaural cues in adult dyslexic listeners during speech-in-noise comprehension. Participants listened to words presented in three different noise-types (Babble-, Fluctuating- and Stationary-noise) in three different listening configurations: dichotic, monaural and binaural. In controls, we obtained an important informational masking in the monaural configuration mostly attributable to linguistic interferences. This was not observed with binaural noise, suggesting that this interference was suppressed by spatial separation. Dyslexic listeners showed a monaural deficit in Babble, but no deficit of the binaural processing, suggesting compensation based on the use of spatial cues.
#6Speech categorization context effects in seven- to nine-month-old infants
Ellen Marklund (Stockholm University)
Francisco Lacerda (Stpckholm University)
Anna Ericsson (Stockholm University)
Adults have been shown to categorize an ambiguous syllable differently depending on which sound precedes it. The present paper reports preliminary results from an on-going experiment, investigating seven- to nine-month-olds on their sensitivity to non-speech contexts when perceiving an ambiguous syllable. The results suggest that the context effect is present already in infancy. Additional data is currently collected and results will be presented in full at the conference.
#7Changes in Temporal Processing of Speech Across the Adult Lifespan
Diane Kewley-Port (Department of Speech and Hearing Sciences, Indiana University, Bloomington, Indiana)
Larry Humes (Department of Speech and Hearing Sciences, Indiana University, Bloomington, Indiana)
Daniel Fogerty (Department of Speech and Hearing Sciences, Indiana University, Bloomington, Indiana)
Speech is a rapidly varying signal. Temporal processing generally slows with age and many older adults experience difficulties in understanding speech. This research involved over 250 young, middle-aged and older listeners. Temporal processing abilities were assessed in numerous vowel sequence tasks, and analyses examined several factors that might contribute to performance. Significant factors included age and cognitive function as measured by the WAIS-III, but not hearing status for the audible vowels. In addition, learning effects were assessed by retesting two tasks. All groups significantly improved vowel temporal-order identification to a similar degree, but large differences in performance between groups were still observed.
#8Fluency and Structural Complexity as Predictors of L2 Oral Proficiency
Jared Bernstein (Knowledge Technologies, Pearson)
Cheng Jian (Knowledge Technologies, Pearson)
Masanori Suzuki (Knowledge Technologies, Pearson)
Automaticity and real-time aspects of performance are directly relevant to L2 spoken language proficiency. This paper analyzes data from L2 speakers of English and Spanish spread over a range of proficiency levels as identified by traditional holistic, rubric-based human ratings. In spontaneous speech samples from these L2 populations, we studied timed measures of spoken fluency (linguistic units per time) that co-vary with proficiency level and compared the timed measures to indices of the linguistic complexity of the same spoken material. Results indicate that duration-based fluency measures yield as much or more information about proficiency as do structural complexity measures. These empirical findings suggest that expert perception of oral proficiency relate to automatic, real-time aspects of speaking and that the oral proficiency construct may be enriched by adding timing to its communicative/functional framework.
#9Semantic facilitation in bilingual everyday speech comprehension
Marco van de Ven (Max Planck Institute for Psycholinguistics, The Netherlands)
Benjamin V. Tucker (University of Alberta, Canada)
Mirjam Ernestus (Radboud University Nijmegen, The Nederlands; Max Planck Institute for Psycholinguistics, The Netherlands)
Previous research suggests that bilinguals presented with low and high predictability sentences benefit from semantics in clear but not in conversational speech [1]. In everyday speech, however, many words are not highly predictable. Previous research has shown that native listeners can use also more subtle semantic contextual information [2]. The present study reports two auditory lexical decision experiments investigating to what extent late Asian-English bilinguals benefit from subtle semantic cues in their processing of English unreduced and reduced speech. Our results indicate that these bilinguals are less sensitive to semantic cues than native listeners for both speech registers.
#10L2 Experience and Non-Native Vowel Categorization of L1-Mandarin Speakers
Bo-ren Hsieh (National Chiao Tung University, Taiwan)
Ho-hsien Pan (National Chiao Tung University, Taiwan)
This study investigates the effect of L2-English experience on the perception of the English tense-lax high vowel contrast. Experienced L1-Mandarin, inexperienced L1-Mandarin and L1-English listeners identified and discriminated synthetic heed-hid and who’d-hood continua varying in five steps of F1, F2, and F3 variations and seven steps of duration variations. The results show a strong reliance on formant variations by L1-English listeners; reliance on duration variations by the inexperienced L1-Mandarin listeners, and a more dominant reliance on formant cues than on duration cues by the experienced L1-Mandarin listeners.
#11Cross-lingual talker discrimination
Mirjam Wester (Centre for Speech Technology Research, University of Edinburgh, UK)
This paper describes a talker discrimination experiment in which native English listeners were presented with two sentences spoken by bilingual talkers (English/German and English/Finnish) and were asked to judge whether they thought the sentences were spoken by the same person or not. Equal amounts of cross-lingual and matched-language trials were presented. The experiments showed that listeners are able to complete this task well, they can discriminate between talkers significantly better than chance. However, listeners are significantly less accurate on cross-lingual talker trials than on matched-language pairs. No significant differences were found on this task between German and Finnish. Bias (B'') and Sensitivity (A') values are presented to analyse the listeners' behaviour in more detail. The results are promising for the evaluation of EMIME, a project covering speech-to-speech translation with speaker adaptation.
#12Dajare is not the lowest form of wit
Takashi Otake (E-Listening Laboratory)
The Japanese form of word play called dajare was investigated in the light of the concurrent multiple word activation mechanism proposed for spoken-word recognition. Analyses of a dajare database revealed distinct types of punning strategies, each of which can be seen as reflecting the activation mechanism. In the present study spontaneous conversations, compiled from a live Tokyo radio talk show, provided the dajare evidence. These results in spontaneous Japanese again confirm the activation predictions, suggesting that dajare is not a low form of wit: instead, it is clever exploitation of the the natural availability of multiple words in spoken-word recognition.

SLP systems

Time:Tuesday 13:30 Place:International Conference Room D Type:Poster
Chair:Gary Geunbae Lee
#1Comparison of Methods for Topic Classification in a Speech-Oriented Guidance System
Rafael Torres (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
Shota Takeuchi (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
Hiromichi Kawanami (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
Tomoko Matsui (Department of Statistical Modeling, The Institute of Statistical Mathematics, Japan)
Hiroshi Saruwatari (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
Kiyohiro Shikano (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
This work addresses the classification in topics of utterances in Japanese, received by a speech-oriented guidance system operating in a real environment. For this, we compare the performance of Support Vector Machine and PrefixSpan Boosting, against a conventional Maximum Entropy classification method. We are interested in evaluating their strength against automatic speech recognition (ASR) errors and the sparseness of the features present in spontaneous speech. To deal with the shortness of the utterances, we also proposed to use characters as features instead of words, which is possible with the Japanese language due to the presence of kanji; ideograms from Chinese characters that represent not only sound but meaning. Experimental results show a classification performance improvement from 92.2% to 94.4%, with Support Vector Machine using character unigrams and bigrams as features, in comparison to the conventional method.
#2Using Dependency Parsing and Machine Learning for Factoid Question Answering on Spoken Documents
Pere R. Comas (TALP Research Center, Technical University of Catalonia (UPC))
Lluís Màrquez (TALP Research Center, Technical University of Catalonia (UPC))
Jordi Turmo (TALP Research Center, Technical University of Catalonia (UPC))
This paper presents our experiments in question answering for speech corpora. These experiments focus on improving the answer extraction step of the QA process. We present two approaches to answer extraction in question answering for speech corpora that apply machine learning to improve the coverage and precision of the extraction. The first one is a reranker that uses only lexical information, the second one uses dependency parsing to score similarity between syntactic structures. Our experimental results show that the proposed learning models improve our previous results using only hand-made ranking rules with small syntactic information. We evaluate the system on manual transcripts of speech from EPPS English corpus and a set of questions transcribed from spontaneous oral questions. This data belongs to the CLEF 2009 evaluation track on QA on speech transcripts (QAst).
#3A Spoken Term Detection Framework for Recovering Out-of-Vocabulary Words Using the Web
Carolina Parada (Johns Hopkins University)
Abhinav Sethy (IBM TJ Watson Research Center)
Mark Dredze (Johns Hopkins University)
Frederick Jelinek (Johns Hopkins University)
Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into the system output, recovering up to 40% of the OOV terms and resulting in a reduction in system error.
#4Improved Spoken Term Detection by Discriminative Training of Acoustic Models based on User Relevance Feedback
Hung-yi Lee (Graduate Institute of Communication Engineering, National Taiwan University)
Chia-ping Chen (Graduate Institute of Communication Engineering, National Taiwan University)
Ching-feng Yeh (Graduate Institute of Communication Engineering, National Taiwan University)
Lin-shan Lee (Graduate Institute of Communication Engineering, National Taiwan University)
In a previous paper, we proposed a new framework for spoken term detection by exploiting user relevance feedback information to estimate better acoustic model parameters to be used in rescoring the spoken segments. In this way, the acoustic models can be trained with a criterion of better retrieval performance, and the retrieval performance can be less dependent on the existence of a set of acoustic models well matched to the corpora to be retrieved. In this paper, a new set of objective functions for acoustic model training in the above framework was proposed considering the nature of retrieval process and its performance measure, and discriminative training algorithms maximizing the objective functions were developed. Significant performance improvements were obtained in preliminary experiments.
#5A lightweight keyword and tag-cloud retrieval algorithm for automatic speech recognition transcripts
Sebastian Tschöpel (Fraunhofer IAIS)
Daniel Schneider (Fraunhofer IAIS)
The Fraunhofer IAIS AudioMining system for vocabulary independent spoken term detection is able to provide automatic speech recognition (ASR) transcripts for audio-visual data. These transcripts can be used to search for information, e.g., in audio-visual archives. We experienced difficulties in the process of browsing for desired content when only these transcripts are given, especially since they are erroneous due to the ASR. Hence, we propose a lightweight and fast algorithm to retrieve keywords and tag-clouds from ASR transcripts to support content browsing. In contrast to similar algorithms, it calculates keywords ad-hoc and query-dependent while searching on a corresponding index. The proposed algorithm takes into account the relation between keywords and the search query, text weighting and linguistic constraints. For visualization we chose a scalable tag-cloud. An evaluation yielded comparable precision-recall scores and promising usability ratings.
#6Lecture subtopic retrieval by retrieval keyword expansion using subordinate concept
Noboru Kanedera (Ishikawa National College of Technology, Japan)
Tetsuo Funada (Kanazawa University, Japan)
Seiichi Nakagawa (Department of Information and Computer Sciences, Toyohashi University of Technology, Japan)
We developed a supporting system for creation of educational video contents. The system automatically segments a lecture video material into subtopics based on speech signals by a statistical model for text segmentation. In this paper, we reports on the result of retrieving the lecture subtopics by keyword expansion using the knowledge of the dictionary and so on. The keyword expansion using the subordinate concept improved the average reciprocal order (MRR:Mean Reciprocal Rank) from 0.51 to 0.55 when subtopics are retrieved by a set of three search keywords for the lecture voice text recognized by automatic speech recognition.
#7Spoken Document Retrieval for Oral Presentations Integrating Global Document Similarities into Local Document Similarities
Hiroaki Nanjo (Faculty of Science and Technology, Ryukoku University, Japan)
Yusuke Iyonaga (Faculty of Science and Technology, Ryukoku University, Japan)
Takehiko Yoshimi (Faculty of Science and Technology, Ryukoku University, Japan)
A spoken document retrieval (SDR) method for oral presentations is addressed. We propose an integration method of global information and local information based on a topic hierarchy of presentations. Specifically, for detecting a part of an oral presentation about one to two minutes (local document), we integrate similarities between a given query and longer units (global documents), for example a whole presentation, into a similarity between the given query and a local document contained in the global documents. For a short speech segment retrieval task from 604-hours presentation speech, we confirmed a statistical improvement of information retrieval performance.
#8Combining word-based features, statistical language models, and parsing for named entity recognition
Joseph Polifroni (Nokia Research Center)
Stephanie Seneff (MIT-CSAIL)
As users become more accustomed to using their mobile devices to organize and schedule their lives, there is more of a demand for applications that can make that process easier. Automatic speech recognition technology has already been developed to enable essentially unlimited vocabulary in a mobile setting. Understanding the words that are spoken is the next challenge. In this paper, we describe efforts to develop a dataset and classifier to recognize named entities in speech. Using sets of both real and simulated data, in conjunction with a very large set of real named entities, we created a challenging corpus of training and test data. We developed a multi-stage framework to parse these utterances and simultaneously tag names and locations. Our combined system achieved an f-measure of 0.87 on extracted proper nouns, with a 95% accuracy on distinguishing names from locations.
#9Efficient combined approach for named entity recognition in spoken language
Azeddine Zidouni (LSIS-CNRS Lab. Aix-Marseille 2 University)
Sophie Rosset (LIMSI-CNRS Lab.)
Hervé Glotin (LSIS-CNRS Lab. Toulon University)
We focus in this paper on the named entity recognition task in spoken data. The proposed approach investigates the use of various contexts of the words to improve recognition. Experimental results carried out on speech data from French broadcast news, using conditional random fields (CRF) show that the use of semantic information, generated using symbolic analyzer outperform the classical approach in reference transcriptions, and it is more robust in automatic speech recognition (ASR) output.
#10Prominence based scoring of speech segments for automatic speech-to-speech summarization
Sree Harsha Yella (International Institute of Information Technology, Hyderabad)
Vasudeva Varma (International Institute of Information Technology, Hyderabad)
Kishore Prahallad (International Institute of Information Technology, Hyderabad)
In order to perform speech summmarization it is necessary to identify important segments in speech signal. The importance of a speech segment can be effectively determined by using infomation from lexical and prosodic features. Standard speech summarization systems depend on ASR transcripts or gold standard human reference summaries to train a supervised system which combines lexical and prosodic features to choose a segment into summary. We propose a method which uses prominence values of syllables in a speech segment to rank the segment for summarization. The proposed method does not depend on ASR transcripts or gold standard human summaries. Evaluation results showed that summaries generated by the proposed method are as good as the summaries generated using tf*idf scores and supervised system trained on gold standard summaries.
#11Maximum Lexical Cohesion for Fine-Grained News Story Segmentation
Zihan Liu (School of Computer Science, Northwestern Polytechnical University, Xi'an, China)
Lei Xie (School of Computer Science, Northwestern Polytechnical University, Xi'an, China)
Wei Feng (School of Creative Media, City University of Hong Kong, Hong Kong SAR)
We propose a maximum lexical cohesion (MLC) approach to news story segmentation. Unlike sentence-dependent lexical methods, our approach is able to detect story boundaries at finer word/subword granularity, and thus is more suitable for speech recognition transcripts which have no sentence delimiters. The proposed segmentation goodness measure takes account of both lexical cohesion and a prior preference of story length. We measure the lexical cohesion of a segment by the KL-divergence from its word distribution to an associated piecewise uniform distribution. Taking account of the uneven contributions of different words to a story, the cohesion measure is further refined by two word weighting schemes, i.e. the inverse document frequency (IDF) and a new weighting method called difference from expectation (DFE). We then propose a dynamic programming solution to exactly maximize the segmentation goodness and efficiently locate story boundaries in polynomial time. Experimental results show that our MLC approach outperforms several state-of-the-art lexical methods.
#12Phoneme Lattice based TextTiling towards Multilingual Story Segmentation
Wang Xiaoxuan (Northwestern Polytechnical University)
Xie Lei (Northwestern Polytechnical University)
Ma Bin (Institute for Infocomm Research)
Chng Eng Siong (Nanyang Technological University)
Li Haizhou (Institute for Infocomm Research)
This paper proposes a phoneme lattice based TextTiling approach towards multilingual story segmentation. The phoneme is the smallest segmental unit in a language and the number of phonemes in a language is usually far smaller than the number of words. Furthermore, many phonemes are shared by different languages. These properties make phonemes particularly appropriate for representing multilingual speech. As phoneme recognition is far from perfect, phoneme lattices, which carry much richer statistics than the 1-best hypotheses, are adopted in this paper as the input to the TextTiling approach. The term frequencies used in traditional TextTiling are replaced by the expected counts of phoneme n-gram units calculated from phoneme lattices. Experiments on TDT2 English and Mandarin corpora show that the phoneme lattice based TextTiling outperforms the phoneme 1-best based TextTiling and word based TextTiling in broadcast news story segmentation.

Special Session: Fact and Replica of Speech Production

Time:Tuesday 13:30 Place:301 Type:Special
Chair:Kiyoshi Honda & Jianwu Dang
13:30Estimation of Glottal Area Function Using Stereo-endoscopic High-Speed Digital Imaging
Hiroshi Imagawa Hiroshi Imagawa (Department of Otolaryngology, University of Tokyo, Japan)
Ken-Ichi Sakakibara Ken-Ichi Sakakibara (Department of Communication Disorders, Health Sciences University of Hokkaido, Japan)
Isao T. Tokuda Isao T. Tokuda (Japan Advanced Institute of Science and Technology, Japan)
Mamiko Otsuka Mamiko Otsuka (Kumada Clinic, Japan)
Niro Tayama Niro Tayama (Department of Otolaryngology, Head and Neck Surgery, National Center for Global Health and Medicine, Japan)
In this paper, a novel stereo-endoscopic high-speed digital imaging system and a method to estimate the glottal area function are proposed. Glottal length, width, and area of one female participant were estimated in three different fundamental frequencies (F0s).
13:45Toward Aero-acoustical Analysis of the Sibilant /s/: An Oral Cavity Modeling
Kazunori Nozaki (The Center for Advanced Medical Engineering and Informatics, Osaka University, Japan)
Ohnishi Youhei (Cybermedia Center, Osaka University, Japan)
Takashi Suda (Cybermedia Center, Osaka University, Japan)
Shigeo Wada (The Center for Advanced Medical Engineering and Informatics, Osaka University, Japan)
Shinji Shimojo (National Institute of Information and Communications Technology, Japan)
We analyzed the sibilant /s/, an unvoiced consonant, by considering turbulence and vortex sound. A segmentation technique for four dimensional magnetic resonance imaging (4D-MRI) of data considering the movements of the oral cavity is proposed. The conventional Snakes can not segment the quick deformation of the tip of the tongue in 4D-MRI images. To segment the quick motion, optical flow of 4D-MRI imaging helps give criteria for deciding the geometry of the control points of Snakes. Our proposed method utilize optical flow to modify the conventional Snakes and demonstrated superior segmentation accuracy to represent the movement of an oral cavity. Experiments demonstrated that our proposed methods accurately segmented 4D-MRI images and can thereby be used to perform numerical analyses of the sibilant /s/.
14:00Effects of Wall Impedance on Transmission and Attenuation of Higher-order Modes in Vocal-tract Model
Kunitoshi Motoki (Department of Electronics and Information Engineering, Hokkai-Gakuen University, Japan)
This paper presents the effects of a wall impedance on the propagation of higher-order modes in a three-dimensional vocal-tract model. This model is constructed using an asymmetrically connected structure of rectangular acoustic tubes, and can parametrically represent acoustic characteristics in higher frequencies where the assumption of the plane wave propagation does not hold. The propagation constants of the higher-order modes are calculated taking account of the wall impedance. The resonance characteristics of the vocal-tract model are evaluated based on the transfer impedance between an input volume velocity and an output sound-pressure.
14:15Articulatory Synthesis and Perception of Plosive-Vowel Syllables with Virtual Consonant Targets
Peter Birkholz (Department of Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and RWTH Aachen University, Aachen, Germany)
Bernd J. Kröger (Department of Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and RWTH Aachen University, Aachen, Germany)
Christiane Neuschaefer-Rube (Department of Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and RWTH Aachen University, Aachen, Germany)
Virtual articulatory targets are a concept to explain the different trajectories of primary and secondary articulators during consonant production, as well as the different places of the tongue-palate contact in [g] depending on the context vowel. The virtual targets for the tongue tip and the tongue body in apical and dorsal plosives are assumed to lie above the palate, and for bilabial consonants, the target is a negative degree of lip opening. In the present study, we discuss the concept of virtual targets and its application for articulatory speech synthesis. In particular, we examined how the location of virtual targets affects the acoustics and intelligibility of synthetic plosiv-vowel syllables. It turned out that virtual targets that lie about 10 mm beyond the consonantal closure location allow a more precise reproduction of natural speech signals than virtual targets at a distance of about 1 mm. However, we found no effect on the intellegibility of the consonants.
14:30Speech Robot Mimicking Human Articulatory Motion
Kotaro Fukui (Waseda University)
Toshihiro Kusano (Waseda Univerisity)
Mukaeda Yoshikazu (Waseda University)
Yuto Suzuki (Waseda University)
Takanishi Atsuo (Waseda University)
Honda Masaaki (Waseda University)
We have developed a mechanical talking robot, Waseda Talker No. 7 Refined II, to study the human speech mechanism. The conventional control method for this robot is based on a concatenation rule of the phoneme-specific articulatory configurations. With this method, the speech mechanism of the robot is much slower than is required for human speech, because the robot requires momentary movement of motors. To resolve this problem, we have developed a control method that mimics human articulatory trajectory data. The human trajectory data for continuous speech was obtained by using an electromagnetic articulography (EMA) system. The EMA data was converted to the robot control parameters by applying inverse kinematics as well as geometric transformation. Experimental results show that the robot can produce continuous speech with human-like speed and smooth movement.
14:45Mechanical Vocal-tract Models for Speech Dynamics
Takayuki Arai (Sophia University)
Arai has developed several physical models of the human vocal tract for education and has reported that they are intuitive and helpful for students of acoustics and speech science. We first reviewed dynamic models, including the sliding three-tube (S3T) model and the flexible-tongue model. We then developed a head-shaped model with a sliding tongue, which has the advantages of both the S3T and flexible-tongue models. We also developed a computer-controlled version of the Umeda & Teranishi model, as the original model was hard to manipulate precisely by hand. These models are useful when teaching the dynamic aspects of speech.
15:00Prosodic Timing Analysis for Articulatory Re-synthesis Using a Bank of Resonators with an Adaptive Oscillator
Michael Brady (Boston University)
A method for the analysis of prosodic-level temporal structure is introduced. The method is based on measured phase angles of an oscillator as that oscillator is made to synchronize with reference points in a signal. Reference points are the predicted peaks of acoustic change as determined by the output of a bank of tuned resonators. A framework for articulatory re-synthesis is then described. Jaw movements of a robotic vocal tract are made to replicate the mean phase portrait of an utterance with reference to a production oscillator. These jaw movements are modeled to inform the dynamics of within-syllable phonemic articulations.

ASR: Acoustic Models II

Time:Tuesday 16:00 Place:Hall A/B Type:Oral
Chair:Mark J. F. Gales
16:00Boosting Systems for LVCSR
George Saon (IBM T.J. Watson Research Center)
Hagen Soltau (IBM T.J. Watson Research Center)
We employ a variant of the popular Adaboost algorithm to train multiple acoustic models such that the aggregate system exhibits improved performance over the individual recognizers. Each model is trained sequentially on re-weighted versions of the training data. At each iteration, the weights are decreased for the frames that are correctly decoded by the current system. These weights are then multiplied with the frame-level statistics for the decision trees and Gaussian mixture components of the next iteration system. The composite system uses a log-linear combination of HMM state observation likelihoods. We report experimental results on several broadcast news transcription setups which differ in the language being spoken (English and Arabic) and amounts of training data. Our findings suggest that significant gains can be obtained for small amounts of training data even after feature and model-space discriminative training.
16:20Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families
Vaibhava Goel (IBM T.J. Watson Research Center)
Tara Sainath (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
Peder Olsen (IBM T.J. Watson Research Center)
David Nahamoo (IBM T.J. Watson Research Center)
Dimitri Kanevsky (IBM T.J. Watson Research Center)
Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned on an acoustic feature vector. In this paper, we explore incorporating SPIF phone posterior probability estimates in large vocabulary continuous speech recognition (LVCSR) task by including them as additional features of exponential densities that model the HMM state emission likelihoods. We compare our proposed approach to a number of other well known methods of combining feature streams or multiple LVCSR systems. Our experiments show that using exponential models to combine features results in a word error rate reduction of 0.5% absolute (18.7% down to 18.2%); this is comparable to best error rate reduction obtained from system combination methods, but without having to build multiple systems or tune the system combination weights.
Xin Chen (Univ. of Missouri)
Yunxin Zhao (Univ. of Missouri)
In this paper, we propose to incorporate the widely used Multiple Layer Perceptron (MLP) features and discriminative training (DT) into our recent data-sampling based ensemble acoustic models to further improve the quality of the individual models as well as the diversity among the models. We also propose applying speaker-model distance based speaker clustering for data sampling to construct ensembles of acoustic models for speaker independent speech recognition. By using these methods on the speaker independent TIMIT phone recognition task, we have obtained a phoneme recognition accuracy of 77.1% on the TIMIT complete test set, an absolute improvement of 5.4% over our conventional HMM baseline system, making this one of the best reported results on the TIMIT continuous phoneme recognition task.
17:00Semi-Supervised Training of Gaussian Mixture Models by Conditional Entropy Minimization
Jui-Ting Huang (University of Illinois at Urbana-Champaign)
Mark Hasegawa-Johnson (University of Illinois at Urbana-Champaign)
In this paper, we propose a new semi-supervised training method for Gaussian Mixture Models. We add a conditional entropy minimizer to the maximum mutual information criteria, which enables to incorporate unlabeled data in a discriminative training fashion. The training method is simple but surprisingly effective. The preconditioned conjugate gradient method provides a reasonable convergence rate for parameter update. The phonetic classification experiments on the TIMIT corpus demonstrate significant improvements due to unlabeled data via our training criteria.
17:20A Study of Irrelevant Variability Normalization Based Training and Unsupervised Online Adaptation for LVCSR
Guangchuan Shi (Microsoft Research Asia, and Shanghai Jiao Tong University)
Yu Shi (Microsoft Research Asia)
Qiang Huo (Microsoft Research Asia)
This paper presents an experimental study of a maximum likelihood (ML) approach to irrelevant variability normalization (IVN) based training and unsupervised online adaptation for large vocabulary continuous speech recognition. A moving-window based frame labeling method is used for acoustic sniffing. The IVN-based approach achieves a 10% relative word error rate reduction over an ML-trained baseline system on a Switchboard-1 conversational telephone speech transcription task.
17:40Improvements to Generalized Discriminative Feature Transformation for Speech Recognition
Roger Hsiao (Language Technologies Institute, Carnegie Mellon University)
Florian Metze (Language Technologies Institute, Carnegie Mellon University)
Tanja Schultz (Language Technologies Institute, Carnegie Mellon University)
Generalized Discriminative Feature Transformation (GDFT) is a feature space discriminative training algorithm for automatic speech recognition (ASR). GDFT uses Lagrange relaxation to transform the constrained maximum likelihood linear regression (CMLLR) algorithm for feature space discriminative training. This paper presents recent improvements on GDFT, which are achieved by regularization to the optimization problem. The resulting algorithm is called regularized GDFT (rGDFT) and we show that many regularization and smoothing techniques developed for model space discriminative training are also applicable to feature space training. We evaluated rGDFT on a real-time Iraqi ASR system and also on a large scale Arabic ASR task.

Language Processing

Time:Tuesday 16:00 Place:201A Type:Oral
Chair:Dilek Hakkani-Tur
16:00Improving ASR-based topic segmentation of TV programs with confidence measures and semantic relations
Camille Guinaudeau (INRIA/IRISA)
Guillaume Gravier (IRISA/CNRS)
Pascale Sébillot (IRISA/INSA)
The increasing quantity of video material requires methods to help users navigate such data, among which topic segmentation techniques. The goal of this article is to improve ASR-based topic segmentation methods to deal with peculiarities of professionnal-video transcripts (transcription errors and lack of repetitions) while remaining generic enough. To this end, we introduce confidence measures and semantic relations in a segmentation method based on lexical cohesion. We show significant improvements of the F1-measure, +1.7 and +1.9 when integrating confidence measures and semantic relations respectively. Such improvement demonstrates that simple clues can conteract errors in automatic transcripts and lack of repetitions.
16:20The Relevance of Timing, Pauses and Overlaps in Dialogues: Detecting Topic Changes in Scenario Based Meetings
Saturnino Luz (School of Computer Science and Statistics, Trinity College Dublin)
Jing Su (School of Computer Science and Statistics, Trinity College Dublin)
We present an investigation of the relevance of simple conversational features as indicators of topic shifts in small-group meetings. Three proposals for representation of dialogue data are described, and their effectiveness assessed at detecting topic boundaries on a large section of the Augmented Multi-Party Interaction (AMI) corpus. These proposals consist in representing a speech event though combinations of features such as the lengths of vocalisations, pauses and speech overlaps in the immediate temporal context of the event. Results show that timing of vocalisations alone, within a 7 vocalisation window (3 on each side of the vocalisation under consideration), can be an effective predictor of topic boundaries, outperforming topic segmentation methods based on lexical features. Pause and overlap information on their own also yield comparably good segmentation accuracy, suggesting that simple methods could complement or even serve as alternatives to methods which require more demanding speech processing for meeting browsing.
16:40Semi-supervised Part-of-speech Tagging in Speech Applications
Richard Dufour (LIUM - University of Le Mans)
Benoit Favre (LIUM - University of Le Mans)
When no training or adaptation data is available, semi-supervised training is a good alternative for processing new domains. We perform Bayesian training of a part-of-speech (POS) tagger from unannotated text and a dictionary of possible tags for each word. We extend that method with supervised prediction of possible tags for out-of-vocabulary words and study the impact of both semi-supervision and starting dictionary size on three representative downstream tasks (named entity tagging, semantic role labeling, ASR output post-processing) that use POS tags as features. The outcome is no impact or a small decrease in performance compared to using a fully supervised tagger, with even potential gains in case of domain mismatch for the supervised tagger. Tasks that trust the tags completely (like ASR post-processing) are more affected by a reduction of the starting dictionnary, but still yield positive outcome.
17:00Memory-based active learning for French broadcast news
Frédéric Tantini (LORIA-CNRS, Campus Scientifique, 54506 Vandoeuvre-les-Nancy)
Christophe Cerisara (LORIA-CNRS, Campus Scientifique, 54506 Vandoeuvre-les-Nancy)
Claire Gardent (LORIA-CNRS, Campus Scientifique, 54506 Vandoeuvre-les-Nancy)
Stochastic dependency parsers can achieve very good results when they are trained on large corpora that have been manually annotated. Active learning is a procedure that aims at reducing this annotation cost by selecting as few sentences as possible that will produce the best possible parser. We propose a new selective sampling function for Active Learning that exploits two memory-based distances to find a good compromise between parser uncertainty and sentence representativeness. The reduced dependency between both parsing and selection models opens interesting perspectives for future models combination. The approach is validated on a French broadcast news corpus creation task dedicated to dependency parsing. It outperforms the baseline uncertainty entropy-based selective sampling on this task. We plan to extend this work with self- and co-training methods in order to enlarge this corpus and produce the first French broadcast news Tree Bank.
17:20Can Conversational Word Usage be Used to Predict Speaker Demographics?
Dan Gillick (University of California, Berkeley)
This work surveys the potential for predicting demographic traits of individual speakers (gender, age, education level, ethnicity, and geographic region) using only word usage features derived from the output of a speech recognition system on conversational American English. Significant differences in word usage patterns among the different classes allows for reasonably high classification accuracy (60%-82%), even without extensive training data.
17:40Prosodic Word-Based Error Correction in Speech Recognition Using Prosodic Word Expansion and Contextual Information
Chao-Hong Liu (Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan)
Chung-Hsien Wu (Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan)
In this study, considering the effect of phrase grouping in spontaneous speech, prosodic words, instead of lexical words, are adopted as the units for error correction of speech recognition results. The prosodic words and the corresponding mis-recognized word fragments are obtained from a speech database to construct a mis-recognized word fragment table for the extracted prosodic words. For each word fragment in a recognized word sequence, the potential prosodic words which are likely to be misrecognized as input word fragments are retrieved from the table for prosodic word candidate expansion. The prosodic word-based contextual information, considering substitution and concatenation scores, is then employed into a probabilistic model to find the best word fragment sequence as the corrected output. Experimental results show that the proposed method achieved a 0.32 F1 score, with improvements of 0.18 and 0.10 compared to the SMT-based and lexical word-based approaches, respectively.

Speech and audio segmentation

Time:Tuesday 16:00 Place:201B Type:Oral
Chair:Yasunari Obuchi
16:00Fully Automatic Segmentation for Prosodic Speech Corpora
Sarah Hoffmann (Speech Processing Group, ETH Zurich, Switzerland)
Beat Pfister (Speech Processing Group, ETH Zurich, Switzerland)
While automatic methods for phonetic segmentation of speech can help with rapid annotation of corpora, most methods rely either on manually segmented data to initially train the process or manual post-processing. This is very time-consuming and slows down porting of speech systems to new languages. In the context of prosody corpora for text-to-speech (TTS) systems, we investigated methods for fully automatic phoneme segmentation using only the corpora to be segmented and an automatically generated transcription. We present a new method that improves the performance of HMM-based segmentation by correcting the boundaries between the training stages of the phoneme models with high precision. We show that, while initially aimed at single speaker corpora, it performs equally well for multi-speaker corpora.
16:20A Novel text-independent phonetic segmentation algorithm based on the Microcanonical Multiscale Formalism
Vahid Khanagha (INRIA Bordeaux Sud-Ouest)
Khalid Daoudi (INRIA Bordeaux Sud-Ouest)
Oriol Pont (INRIA Bordeaux Sud-Ouest)
Hussein Yahia (INRIA Bordeaux Sud-Ouest)
We propose a radically novel approach to analyze speech signals from a statistical physics perspective. Our approach is based on a new framework, the Microcanonical Multiscale Formalism (MMF), which is based on the computation of singularity exponents, defined at each point in the signal domain. The latter allows nonlinear analysis of complex dynamics and, particularly, characterizes the intermittent signature. We study the validity of the MMF for the speech signal and show that singularity exponents convey indeed valuable information about its local dynamics. We define an accumulative measure on the exponents which reveals phoneme boundaries as the breaking points of a piecewise linear-like curve. We then develop a simple automatic phonetic segmentation algorithm using piecewise linear curve fitting. We present experiments on the full TIMIT database. The results show that our algorithm yields considerably better accuracy than recently published ones.
You-Yu Lin (Institute of Communication, National Chiao Tung University, Hsinchu, Taiwan, ROC)
Yih-Ru Wang (Institute of Communication, National Chiao Tung University, Hsinchu, Taiwan, ROC)
Yuan-Fu Liao (Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan, ROC)
A sample-based phone boundary detection algorithm is proposed in this paper. Some sample-based acoustic parameters are first extracted in the proposed method, including six sub-band signal envelopes, sample-based KL distance and spectral entropy. Then, the sample-based KL distance is used for boundary candidates pre-selection. Last, a supervised neural network is employed for final boundary detection. Experimental results using the TIMIT speech corpus showed that EERs of 13.2% and 15.1% were achieved for the training and test data sets, respectively. Moreover, 43.5% and 88.2% of boundaries detected were within 80- and 240-sample error tolerance from manual labeling results at the EER operating point.
17:00HMM-based Automatic Visual Speech Segmentation Using Facial Data
Utpala Musti (Université Nancy 2, LORIA)
Asterios Toutios (Université Nancy 2, LORIA)
Slim Ouni (Université Nancy 2, LORIA)
Vincent Colotte (Université Henri Poincaré Nancy 1, LORIA)
Brigitte Wrobel-Dautcourt (Université Henri Poincaré Nancy 1, LORIA)
Marie-Odile Berger (INRIA, LORIA)
We describe automatic visual speech segmentation using facial data captured by a stereo-vision technique. The segmentation is performed using an HMM-based forced alignment mechanism widely used in automatic speech recognition. The idea is based on the assumption that using visual speech data alone for the training might capture the uniqueness in the facial component of speech articulation, asynchrony (time lags) in visual and acoustic speech segments and significant coarticulation effects. This should provide valuable information that helps to show the extent to which a phoneme may affect surrounding phonemes visually. This should provide information valuable in labeling the visual speech segments based on dominant coarticulatory contexts.
17:20Bayes Factor Based Speaker Segmentation for Speaker Diarization
David Wang (Queensland University of Technology)
Robert Vogt (Queensland University of Technology)
Sridha Sridharan (Queensland University of Technology)
This paper proposes the use of the Bayes Factor as a distance metric for speaker segmentation within a speaker diarization system. The proposed approach uses a pair of constant sized, sliding windows to compute the value of the Bayes Factor between the adjacent windows over the entire audio. Results obtained on the 2002 Rich Transcription Evaluation dataset show an improved segmentation performance compared to previous approaches reported in literature using the Generalized Likelihood Ratio. When applied in a speaker diarization system, this approach results in a 5.1% relative improvement in the overall Diarization Error Rate compared to the baseline.
17:40Using High-level Information to Detect Key Audio Events in a Tennis Game
QIANG HUANG (University of East Anglia)
STEPHEN COX (University of East Anglia)
This paper describes how the detection of key audio events in a sports game (tennis) can be enhanced by the use of high-level information. High-level features are able to provide useful constraints on the detection procedure, and thus to improve detection performance. We define two types of event based information: event dependency and inter-event timing. These respectively characterize the identity of the next event and the time at which the next event will occur. Probabilistic models of high-level constraints are developed, and then integrated into our event detection framework. We test this approach on audio tracks extracted from two different tennis games. The results show that significant improvements in both accuracy and computational efficiency are obtained when applying high-level information.

Prosody: Analysis

Time:Tuesday 16:00 Place:302 Type:Oral
Chair:Hansjoerg Mixdorff
16:00What do you mean, you're uncertain?: The interpretation of cue words and rising intonation in dialogue
Catherine Lai (University of Pennsylvania)
This paper investigates how rising intonation affects the interpretation of cue words in dialogue. Both cue words and rising intonation express a range of speaker attitudes like uncertainty and surprise. However, it is unclear how the perception of these attitudes relates to dialogue structure and belief co-ordination. Perception experiment results suggest that rises reflect difficulty integrating new information rather than signaling a lack of credibility. This leads to a general analysis of rising intonation as signaling that the current question under discussion is unresolved. However, the interaction with cue word semantics restricts how much their interpretation can vary with prosody.
16:20Coping Imbalanced Prosodic Unit Boundary Detection with Linguistically-motivated Prosodic Features
Yi-Fen Liu (Institute of Information Systems and Applications, National Tsing Hua University, Taiwan)
Shu-Chuan Tseng (Institute of Linguistics, Academia Sinica, Taiwan)
J.-S. Roger Jang (Department of Computer Science, National Tsing Hua University, Taiwan)
C.-H. Alvin Chen (Graduate Institute of Linguistics, National Taiwan University, Taiwan)
Continuous speech input for ASR processing is usually pre-segmented into speech stretches by pauses. In this paper, we propose that smaller, prosodically defined units can be identified by tackling the problem on imbalanced prosodic unit boundary detection using five machine learning techniques. A parsimonious set of linguistically motivated prosodic features has been proven to be useful to characterize prosodic boundary information. Furthermore, BMPM is prone to have true positive rate on the minority class, i.e. the defined prosodic units. As a whole, the decision tree classifier, C4.5, reaches a more stable performance than the other algorithms.
16:40Improving Prosodic Phrase Prediction by Unsupervised Adaptation and Syntactic Features Extraction
Zhigang Chen (iFly Speech Lab, University of Science and Technology of China, Hefei, Anhui, 230027)
Guoping Hu (iFly Speech Lab, University of Science and Technology of China, Hefei, Anhui, 230027)
Wei Jiang (iFly Speech Lab, University of Science and Technology of China, Hefei, Anhui, 230027)
In the state-of-the-art speech synthesis system, prosodic phrase prediction is the most serious problem which leads to about 40% of text analysis errors. Two targeted optimization strategies are proposed in this paper to deal with two major types of prosodic phrase prediction errors. First, unsupervised adaptation method is proposed to relief the mismatching problem between training and testing, and syntactic features are extracted from parser and integrated into prediction model to ensure the predicted prosodic structure somehow be consistent with syntactic structure. We verify our solutions on a mature Mandarin speech synthesis system and experiment results show that both of the two strategies have positive influences and the sentence unacceptable rate significantly drops from 15.9% to 8.75%.
17:00Perception-based automatic approximation of F0 contours in Cantonese speech
Yujia Li (The Chinese University of Hong Kong)
Tan Lee (The Chinese University of Hong Kong)
In our previous studies, it was found that F0 variations in Cantonese speech can be adequately represented by linear approximations of the observed F0 contours, in the sense that comparable perception with natural speech can be attained. The approximated contours were determined manually. In this study, a framework is developed for automatic approximation of F0 contours. Based on the knowledge learned from perceptual studies, the approximation process is carried out in three steps: contour smoothing, locating turning points and determining F0 values at turning points. Perceptual evaluation was performed on re-synthesized speech of hundreds of Cantonese polysyllabic words. The results show that the proposed framework produces good approximations for the observed F0 contours. For 93% of the utterances, the re-synthesized speech can attain comparable perception to the natural speech.
17:20Discriminative Training and Unsupervised Adaptation for Labeling Prosodic Events with Limited Training Data
Raul Fernandez (IBM Research)
Bhuvana Ramabhadran (IBM Research)
Many applications of spoken-language systems can benefit from having access to annotations of prosodic events. Unfortunately, obtaining human annotations of these events, even sensible amounts to train a supervised system, can become a laborious and costly effort. In this paper we explore applying conditional random fields to automatically label major and minor break indices and pitch accents from a corpus of recorded and transcribed speech using a large set of fully automatically-extracted acoustic and linguistic features. We demonstrate the robustness of these features when used in a discriminative training framework as a function of reducing the amount of training data. We also explore adapting the baseline system in an unsupervised fashion to a target dataset for which no prosodic labels are available, and show how, when operating at point where only limited amounts of data are available, an unsupervised approach can offer up to an additional 3% improvement.
17:40Prosody for the Eyes: Quantifying Visual Prosody using Guided Principal Components Analysis
Erin Cvejic (MARCS Auditory Laboratories, University of Western Sydney)
Jeesun Kim (MARCS Auditory Laboratories, University of Western Sydney)
Chris Davis (MARCS Auditory Laboratories, University of Western Sydney)
Guillaume Gibert (MARCS Auditory Laboratories, University of Western Sydney)
Although typically studied as an auditory phenomenon, prosody can also be conveyed by the visual speech signal, through increased movements of articulators during speech production, or through eyebrow and rigid head movements. This paper aimed to quantify such visual correlates of prosody. Specifically, the study was concerned with measuring the visual correlates of prosodic focus and prosodic phrasing. In the experiment, four participants’ speech and face movements were recorded while they completed a dialog exchange task with an interlocutor. Acoustic analysis showed that prosodic contrasts differed on duration, pitch and intensity parameters, which is consistent with previous findings in the literature. The visual data was processed using guided principal component analysis. The results showed that compared to the broad focused statement condition, speakers produced greater movement on both articulatory and non-articulatory parameters for prosodically focused and intonated words.

Speaker characterization and recognition III

Time:Tuesday 16:00 Place:International Conference Room A Type:Poster
Chair:Tomi Kinnunen
#1An Investigation into Direct Scoring Methods without SVM Training in Speaker Verification
Ce Zhang (Digital Content Technology Research Center, Institute of Automation,Chinese Academy of Sciences, Beijing 100190, China)
Rong Zheng (Digital Content Technology Research Center, Institute of Automation,Chinese Academy of Sciences, Beijing 100190, China)
Bo Xu (Digital Content Technology Research Center, Institute of Automation,Chinese Academy of Sciences, Beijing 100190, China)
In the paper, we first propose a new method to handle the problem of scoring a test utterance against a speaker model in the JFA Speaker Verification System, called Symmetric Scoring. The SS method is derived from the GMM log-likelihood-ratio approximation and is both symmetrical and efficient. Then we show that SS method and the JFA-SVM system using GMM super-vector space as input have nearly the same form in scoring phase. We also show that the performance of SS method is better than the JFA-SVM system, which indicates that applying the KL kernel function directly to obtain a score in GMM super-vector space is as effective as the JFA-SVM trained using the same kernel. As an inspiration of this direct scoring method in which kernel function is only used to calculate score without SVM training procedure, we try to extend the relationship to speaker factor space and evaluate some results based on different kernels.
#2Large Margin Gaussian mixture models for speaker identification
Reda Jourani (SAMoVA Group, IRIT - UMR 5505 du CNRS / Laboratoire LRIT, Unité associée au CNRST, URAC 29)
Khalid Daoudi (INRIA Bordeaux-Sud Ouest)
Régine André-Obrecht (SAMoVA Group, IRIT - UMR 5505 du CNRS)
Driss Aboutajdine (Laboratoire LRIT, Unité associée au CNRST, URAC 29)
Gaussian mixture model (GMM) have been widely and successfully used in speaker recognition during the last decade. However, they are generally trained using the generative criterion of maximum likelihood estimation. In this paper, we propose a simple and efficient discriminative approach to learn GMM with a large margin criterion to solve the classification problem. Our approach is based on a recent work about the Large Margin GMM (LM-GMM) where each class is modeled by a mixture of ellipsoids and which has shown good results in speech recognition; we propose a simplification of the original algorithm. We carry out preliminary experiments on a speaker identification task using NIST-SRE'2006 data and we compare the traditional generative GMM approach, the original LM-GMM one and our own version. The results suggest that our algorithm outperforms the two others.
#3On the Use of Gaussian Component Information in the Generative Likelihood Ratio Estimation for Speaker Verification
rong zheng (Digital Content Technology Research Center, Institute of Automation,Chinese Academy of Sciences, Beijing 100190, China)
bo xu (Digital Content Technology Research Center, Institute of Automation,Chinese Academy of Sciences, Beijing 100190, China)
This paper presents an experimental study of exploiting Gaussian component information for speaker verification. The motivation of the proposed algorithm is to examine detailed component information by using individual Gaussian component’s contribution to the final output score. Analysis of component-specific score is important to understand in-depth Gaussian mixture’s impact on performance. We present a new method, called Gaussian component information based likelihood ratio (GCILR), to introduce and weight component-dependent information based on adapted Gaussian mixture models. Performance evaluations comparing our system to the well-known technique, generative likelihood ratio estimation, are provided. The paper discusses how the performance is influenced by different significance in the informative component-specific scores. Comparison experiments conducted on the NIST 2006 SRE dataset show the effectiveness of the proposed method.
#4Acoustic Vector Resampling for GMMSVM-Based Speaker Verification
Man-Wai Mak (The Hong Kong Polytechnic University)
Wei Rao (The Hong Kong Polytechnic University)
Using GMM-supervectors as the input to SVM classifiers (namely, GMM-SVM) is one of the promising approaches to text-independent speaker verification. However, one unaddressed issue is the severe imbalance between the numbers of speaker-class utterances and impostor-class utterances available for training a speaker-dependent SVM. This paper proposes a resampling technique -- namely utterance partitioning with acoustic vector resampling (UP-AVR) -- to mitigate this problem. Specifically, the sequence order of acoustic vectors in an enrollment utterance is first randomized; then the randomized sequence is partitioned into a number of segments. Each segment is used to produce a GMM-supervector via MAP adaptation and mean-vector concatenation. A desirable number of speaker-class supervectors can be produced by repeating this process a number of times. Experimental evaluations suggest that UP-AVR can reduce the EER of GMM-SVM systems by about 10%.
#5A Fast Speaker Indexing Using Vector Quantization and Second Order Statistics with Adaptive Threshold Computation
Konstantin Biatov (Biatov Lab)
This paper describes an effective unsupervised speaker indexing approach. We suggest a two stage algorithm to speed-up the state-of-the-art algorithm based on the Bayesian Information Criterion (BIC). In the first stage of the merging process a computationally cheap method based on the vector quantization (VQ) is used. Then in the second stage a more computational expensive technique based on the BIC is applied. In the speaker indexing task a turning parameter or a threshold is used. We suggest an on-line procedure to define the value of a turning parameter without using development data. The results are evaluated using ESTER corpus.
#6Using Phoneme Recognition and Text-dependent Speaker Verification to Improve Speaker Segmentation for Chinese Speech
Gang Wang (Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua National Laboratory for Information Science and Technology)
Xiaojun Wu (Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua National Laboratory for Information Science and Technology)
Thomas Fang Zheng (Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua National Laboratory for Information Science and Technology)
Speaker segmentation is widely used in many tasks such as multi-speaker detection and speaker tracking. The segmentation performance depends on the performance of speaker verification (SV) between two short utterances to a large extent, so the improvement of the SV performance for short utterances would give the segmentation performance a great help. In this paper, a method based on phoneme recognition and text-dependent speaker recognition is proposed. During segmentation, a phoneme sequence is first recognized using a phoneme recognizer and then text-dependent speaker recognition based on dynamic time warping (DTW) is performed on the same phoneme in two adjacent windows. Experiments over Chinese Corpus Consortium (CCC) MSS database showed that better performance was achieved compared with the BIC method and the GLR method.
#7On enhancing feature sequence filtering with filter-bank energy transformation in speaker verification with telephone speech
Claudio Garreton (Universidad de Chile)
Yoma Nestor Becerra (Universidad de Chile)
In this paper a novel feature enhancing method for channel robustness with short utterances is employed. The transform reduces the time-varying component of the channel distortion by applying a band-pass filter along the filter-bank domain on a frame-by-frame basis. This procedure enhances the channel cancelling effect given by techniques based on feature trajectory filtering. The transformation parameters are defined employing relative importance analysis based on a discriminant function. In text-dependent speaker verification with telephone speech the transform leads to a reduction in the EER of 10.8%, and further improvements of 23.5% and 40% when combined with RASTA or CMN, respectively.
#8MAP Estimation of Subspace Transform for Speaker Recognition
Donglai Zhu (Institute for Infocomm Research, A*STAR, Singapore)
Bin Ma (Institute for Infocomm Research, A*STAR, Singapore)
Kong Aik Lee (Institute for Infocomm Research, A*STAR, Singapore)
Cheung-Chi Leung (Institute for Infocomm Research, A*STAR, Singapore)
Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore)
We propose using the maximum-a-posteriori (MAP) estimation of subspace transform for speaker recognition. The transform function is defined on the means of the Gaussian mixture model (GMM), where transform matrices and bias vectors are associated with separate regression classes so that both can be estimated with sufficient statistics given limited training data. The transform matrices are further defined as a linear combination of a set of base transforms so that the linear weights are parameters to be estimated. We characterize the speakers with transform parameters and model them using support vector machine (SVM). Experiments on the 2008 NIST SRE task illustrate the effectiveness of the method.
#9A Longest Matching Segment Approach for Text-Independent Speaker Recognition
Ayeh Jafari (Queen's University Belfast)
Ramji Srinivasan (Queen's University Belfast)
Danny Crookes (Queen's University Belfast)
Ming Ji (Queen's University Belfast)
We describe a new approach for segment-based speaker recognition, given text-independent training and test data. We assume that utterances from the same speaker have more and longer matching acoustic segments, compared to utterances from different speakers. Therefore, we identify the longest matching segments, at each frame location, between the training and test utterances, and base recognition on the similarity of these longest matching segments. The new system scores the speaker higher who has greater number, length and similarity of matching segments. Focusing on long acoustic segments effectively exploits the spectral dynamics. We have compared our new system with the conventional frame-based GMM-UBM system for the NIST 2002 SRE task, and achieved better performance.
#10Approaching Human Listener Accuracy with Modern Speaker Verification
Ville Hautamäki (Institute for Infocomm Research, A*STAR, Singapore)
Tomi Kinnunen (School of Computing, University of Eastern Finland)
Mohaddeseh Nosratighods (School of Electrical Engineering and Telecommunication, University of New South Wales, Australia)
Kong Aik Lee (Institute for Infocomm Research, A*STAR, Singapore)
Bin Ma (Institute for Infocomm Research, A*STAR, Singapore)
Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore)
Being able to recognize people from their voice is a natural ability that we take for granted. Recent advances have shown significant improvement in automatic speaker recognition performance. Besides being able to process large amount of data in a fraction of time required by human, automatic systems are now able to deal with diverse channel effects. The goal of this paper is to examine how state-of-the-art automatic system performs in comparison with human listeners, and to investigate the strategy for human-assisted form of automatic speaker recognition, which is useful in forensic investigation. We set up an experimental protocol using data from the NIST SRE 2008 core set. A total of 36 listeners have participated in the listening experiments from three sites, namely Australia, Finland and Singapore. State-of-the-art automatic system achieved 20% error rate, whereas fusion of human listeners achieved 22%.
#11Extended Weighted Linear Prediction (XLP) Analysis of Speech and its Application to Speaker Verification in Adverse Conditions
Jouni Pohjalainen (Department of Signal Processing and Acoustics, Aalto University School of Science and Technology, Finland)
Rahim Saeidi (School of Computing, University of Eastern Finland, Finland)
Tomi Kinnunen (School of Computing, University of Eastern Finland, Finland)
Paavo Alku (Department of Signal Processing and Acoustics, Aalto University School of Science and Technology, Finland)
This paper introduces a generalized formulation of linear prediction (LP), including both conventional and temporally weighted LP analysis methods as special cases. The temporally weighted methods have recently been successfully applied to noise robust spectrum analysis in speech and speaker recognition applications. In comparison to those earlier methods, the new generalized approach allows more versatility in weighting different parts of the data in the LP analysis. Two such weighted methods are evaluated and compared to the conventional spectrum modeling methods FFT and LP, as well as the temporally weighted methods WLP and SWLP, by substituting each of them in turn as the spectrum estimation method of the MFCC feature extraction stage of a GMM-UBM based speaker verification system. The new methods are shown to lead to performance improvement in several cases involving channel distortion and additive noise mismatch between the training and recognition conditions.
#12The Use of Subvector Quantization and Discrete Densities for Fast GMM Computation for Speaker Verification
Guoli Ye (The Hong Kong University of Science and Technology)
Brian Mak (The Hong Kong University of Science and Technology)
Last year, we showed that the computation of a GMM-UBM-based speaker verification (SV) system may be sped up by 30 times by using a high-density discrete model (HDDM) on the NIST 2002 evaluation task. The speedup was obtained using a special case of the product-code vector quantization in which each dimension is scalar-quantized in the construction of the discrete model. However, the speedup resulted in a drop of an absolute 1.5% in equal-error rate (EER). In this paper, our previous work is generalized to the use of subvector quantization (SVQ) in the construction of HDDM. For the same NIST 2002 SV task, the use of SVQ leads to an overall speedup by a factor of 8-25 with no significant loss in EER performance.

Systems for LVCSR and rich transcription

Time:Tuesday 16:00 Place:International Conference Room B Type:Poster
Chair:Lee Akinobu
#0Parallel Lexical-tree Based LVCSR on Multi-core Processors
Naveen Parihar (Dept. of Electrical and Computer Engineering, Mississippi State University, USA)
Ralf Schlueter (Human Lang. and Pattern Recognition, Comp. Sc. Dept., RWTH Aachen University, Germany)
David Rybach (Human Lang. and Pattern Recognition, Comp. Sc. Dept., RWTH Aachen University, Germany)
Eric Hansen (Dept. of Computer Science and Engineering, Mississippi State University, US)
Exploiting the computational power of multi-core processors for large vocabulary continuous speech recognition (LVCSR) requires changes in the recognizer architecture. In this paper, we consider how to parallelize the search component of a lexical-tree based speech recognizer. We introduce a hybrid-parallel method for dynamically dividing the lexical-tree copies among the cores at each frame. Each core is responsible for graph traversal in the lexical-tree copies allocated to it. This approach is compared to a previously-introduced static method that divides the lexical tree itself, so that each core is responsible for a different subtree of each of the lexical-tree copies. The new method outperforms the previous one when applied to the RWTH TC-STAR EPPS English LVCSR system running on four cores of an Intel Core-i7 processor with varying pruning-beam width settings.
#1Exploring Recognition Network Representations for Efficient Speech Inference on Highly Parallel Platforms
Jike Chong (University of California, Berkeley; Parasians, LLC)
Ekaterina Gonina (University of California, Berkeley)
Kisun You (Seoul National University)
Kurt Keutzer (Unversity of California, Berkeley)
The emergence of highly parallel computing platforms is enabling new trade-offs in algorithm design for automatic speech recognition. It naturally motivates the following investigation: Do the most computationally efficient sequential algorithms lead to the most computationally efficient parallel algorithms? In this paper we explore two contending recognition network representations for speech inference engines: the linear lexical model (LLM) and the weighted finite state transducer (WFST). We demonstrate that while an inference engine using the simpler LLM representation evaluates 22x more transitions per second than the advanced WFST representation, the simple structure of the LLM representation allows 4.7-6.4x faster evaluation and 53-65x faster operand-gathering for each state transition. We use the 5k Wall Street Journal Corpus to experiment on the NVIDIA GTX480 (Fermi) and the NVIDIA GTX285 Graphics Processing Units (GPUs), and illustrate that the performance of a speech inference engine based on the LLM representation is competitive with the WFST representation on highly parallel implementation platforms.
Diamantino Caseiro (AT&T Labs Research)
The large size of Weighted Finite-State Transducers (WFSTs) used in Automatic Speech Recognition (ASR), such as the language model or integrated networks, is an important problem for many ASR applications. To address this problem, we present a general purpose compression technique for WFSTs that is specially designed for the finite-state machines most commonly used in ASR. Experiments run on 2 large tasks show the method to be very effective, typicality reducing memory and disk requirements to less than 35%. By combining it with "on-the-fly" composition, the memory requirements are further reduced to below 14%. These reductions show no negative impact on recognition speed.
#3Speech Recognizer Optimization under Speed Constraints
Ivan Bulyko (Raytheon BBN Technologies)
We present an efficient algorithm for optimizing parameters of a speech recognizer aimed at obtaining maximum accuracy at a specified decoding speed. This algorithm is not tied to any particular decoding architecture or type of tunable parameter being used. It can also be applied to any performance metric (e.g. WER, keyword search or topic ID accuracy) and thus allows tuning to the target application. We demonstrate the effectiveness of this approach by tuning BBN’s Byblos recognizer to run at 15 times faster than real time while maximizing keyword search accuracy.
#6The 2010 CMU GALE Speech-to-Text System
Florian Metze (Carnegie Mellon University)
Roger Hsiao (Carnegie Mellon University)
Qin Jin (Carnegie Mellon University)
Udhyakumar Nallasamy (Carnegie Mellon University)
Tanja Schultz (Karlsruhe Institute of Technology)
This paper describes the latest Speech-to-Text system developed for the Global Autonomous Language Exploitation ("GALE") domain by Carnegie Mellon University (CMU). This systems uses discriminative training, bottle-neck features and other techniques that were not used in previous versions of our system, and is trained on 1150 hours of data from a variety of Arabic speech sources. In this paper, we show how different lexica, pre-processing, and system combination techniques can be used to improve the final output, and provide analysis of the improvements achieved by the individual techniques.
#7Speaker Diarization in Meeting Audio for Single Distant Microphone
Tin Lay Nwe (Institute for Infocomm Research)
Hanwu Sun (Institute for Infocomm Research)
Bin Ma (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)
This paper presents speaker diarization system on NIST Rich Transcription 2009 (RT-09) Meeting Recognition evaluation data set for the task of Single Distant Microphone (SDM). A two-step speaker clustering method is proposed. The first step is speaker cluster initialization using speech segments of meeting audio, where we randomly pick a small subset of speech segments and merge them iteratively into a number of clusters. And, the second step is cluster purification, where we introduce a consensus-based speaker segment selection method for efficient speaker cluster modeling that purifies the clusters. The system achieves a promising diarization error rate (DER) of 16.4%.
#8Extending the punctuation module for European Portuguese
Fernando Batista (INESC-ID/ISCTE)
Helena Moniz (INESC-ID/FLUL)
Isabel Trancoso (INESC-ID/IST)
Hugo Meinedo (INESC-ID)
Ana Isabel Mata (FLUL)
Nuno Mamede (INESC-ID/IST)
This paper describes our recent work on extending the punctuation module of automatic subtitles for Portuguese Broadcast News. The main improvement was achieved by the use of prosodic information. This enabled the extension of the previous module which covered only full stops and commas, to cover question marks as well. The approach uses lexical, acoustic and prosodic information. Our results show that the latter is relevant for all types of punctuation. An analysis of the results also shows what type of interrogative is better dealt with by our method, taking into account the specificities of Portuguese. This may lead to different results for different types of corpora, depending on the types of interrogatives that are more frequent.
#9Utilizing a Noisy-Channel Approach for Korean LVCSR
Sakriani Sakti (NICT Spoken Language Communication Research Group)
Ryosuke Isotani (NICT Spoken Language Communication Research Group)
Hisashi Kawai (NICT Spoken Language Communication Research Group)
Satoshi Nakamura (NICT Spoken Language Communication Research Group)
Korean is an agglutinative and highly inflective language with a severe phonological phenomenon and coarticulation effects, making the development of a large-vocabulary continuous speech recognition system (LVCSR) difficult. Choosing a Korean orthographic word-phrase (eojeol) as a basic recognition unit leads to high out-of-vocabulary (OOV) rates, whereas choosing an orthographic syllable (eumjeol) unit results in high acoustic confusability. To overcome these difficulties, we propose to construct the speech recognition task as a serial architecture composed of two independent parts. The first part is to perform a standard hidden Markov model (HMM)-based recognition of phonemic syllable units of the actual pronunciation (surface forms). In this way, one phonemic syllable corresponds to one possible pronunciation only. Thus, the lexicon dictionary and OOV rates can be kept small, while avoiding high acoustic confusability. Here, the Korean orthography of written transcription are not yet considered. In the second part, the system then transforms the phonemic syllable surface forms into the desirable orthography of a recognition unit, e.g., eumjeol or eojeol. To solve this task, a noisy-channel model is utilized, wherein the sequence of phonemic syllables is considered as “noisy” string, and the goal is to recover the “clean” string of Korean orthography. The entire process requires no linguistic knowledge, only annotated texts. The experiments were conducted on a Korean dictation database, where the best system could achieve 91.21% eumjeol accuracy and 71.30% eojeol accuracy.
#10The RWTH 2009 Quaero ASR Evaluation System for English and German
Markus Nußbaum-Thom (RWTH Aachen University)
Simon Wiesler (RWTH Aachen University)
Martin Sundermeyer (RWTH Aachen University)
Christian Plahl (RWTH Aachen University)
Stefan Hahn (RWTH Aachen University)
Ralf Schlüter (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)
In this work, the RWTH automatic speech recognition systems for English and German for the second Quaero evaluation campaign 2009 are presented. The systems are designed to transcribe web data, European parliament plenary sessions and broadcast news data. Another challenge in the 2009 evaluation is that almost no in-domain training data is provided and the test data contains a large variety of speech types. The RWTH participates for the English and German languages with the best results for German and competitive results for the English. Contributing to the enhancements are the systematic use of hierarchical neural network based posterior features, system combination, speaker adaptation, cross speaker adaptation, domain dependent modeling and the usage of additional training data.


Time:Tuesday 16:00 Place:International Conference Room C Type:Poster
Chair:Takayuki Arai
#1When is Indexical Information about Speech Activated? Evidence from a Cross-Modal Priming Experiment
Benjamin Munson (University of Minnesota)
Renata Solum (University of Minnesota)
Listeners were asked to judge talkers' sex from audio samples. Pictures of men, women, or a neutral visual stimulus were presented concurrent with, 150 ms before, or 150 ms after the spoken stimulus. Listeners' identification of sex for men's voices was most strongly affected by the visual stimulus when it was presented 150 ms after the stimulus. Voice-picture mismatches affected recognition of women's voices earlier than recognition of men's. Thus, while indexical information might most typically be activated late in processing, some socioindexical categories like sex can be activated early and remain active throughout processing.
Benjamin Munson (University of Minnesota)
Numerous studies have documented distinctive patterns of phonetic variation associated with actual and perceived sexual orientation. This investigation tested the hypothesis that these are the consequence of variation in speech-motor fluency. Gay, lesbian/bisexual (GLB), and heterosexual men and women participated in a diadochokinetic rate task. No consistent differences between GLB and heterosexual people in DDK rate were found, and DDK rate did not correlate directly with independently made listener judgments of sex typicality in speech. Results do not support the hypothesis that GLB speech styles are the consequence of motor control differences between GLB and heterosexual people.
#3Laryngealization and features for Chinese tonal recognition
Kristine Yu (University of California, Los Angeles)
It is well known that the lowest tone in Mandarin, a language without contrastive phonation, often co-occurs with laryngealization/creaky voice quality, and we provide evidence that this is also the case for the lowest tone in Cantonese. However, the effects of laryngealization on f0 feature extraction for tonal recognition, as well as the potential of laryngealization as a feature for improving tonal recognition, have not been well-discussed in the literature. We give evidence from a corpora of tonal production data for Cantonese and Mandarin that laryngealization is prevalent and significantly disturbs the extraction of f0 features, and suggest that laryngealization may in fact be a feature that could improve tonal recognition.
#4Production and perception of Vietnamese short vowels in V1V2 context
Viet Son Nguyen (MICA Center, HUT - CNRS/UMI2954 INP Grenoble, Hanoi university of Technology, 1 Dai Co Viet street - Hanoi, Vietnam)
Eric Castelli (MICA Center, HUT - CNRS/UMI2954 INP Grenoble, Hanoi university of Technology, 1 Dai Co Viet street - Hanoi, Vietnam)
René Carré (Laboratoire Dynamique du Langage, URM 5596, CNRS, Université Lyon 2, 14 Avenue Marcelin Berthelot, 69363 Lyon cedex 07, France)
The paper analyses Vietnamese vowel-semivowel productions, including classical and shorter vowels in terms of the vowel duration in relation with the final semivowel duration, the formant transition duration, and the formant transition slopes. The results show that in the vowel-semivowel context, there is a relationship between the vowel duration and the final semivowel duration. Moreover, in the same context of a final semivowel, classical and shorter vowels can be differentiated by at least one of the formant transition slopes. This allows estimating the role of the final part in its articulation with shorter vowels in Vietnamese.
#5Measuring Basic Tempo across Languages and some Implications for Speech Rhythm
Gertraud Fenk-Oczlon (Department of Linguistics and Computational Linguistics, University of Klagenfurt, Austria)
August Fenk (Department of Media and Communication Studies, University of Klagenfurt, Austria)
Basic language-inherent tempo cannot be isolated by the current metrics of speech rhythm. Here we propose the number of syllables per intonation unit as an appropriate measure, also for large-scale comparisons between languages. Applying it to an extended sample of in the meantime 51 languages has not only corroborated our previously reported negative cross-linguistic correlation of this metric with syllable complexity, but has revealed, moreover, significant correlations with several in part directly time-dependent rhythm measures proposed by other authors. We discuss relations between intrinsic tempo and (a) other facets of rhythm and (b) rhythm classifications of language.
#6Durational structure of Japanese single/geminate stops in three- and four-mora words spoken at varied rates
Yukari Hirata (Colgate University)
Shigeaki Amano (Aichi Shukutoku University)
To distinguish Japanese single and geminate stops in two- and three-mora words spoken at varied speaking rates, the ratio of stop closure to the word in native speakers’ production was previously found to be a reliable measure. It was not clear, however, whether the stop closure relates more stably (1) to the entire “word” of any length than just (2) to the moras preceding and following the contrasting stops. This study examined this question with three- and four-mora nonsense words in Japanese. Results indicate that the stop closure duration relative to both of the units (1) and (2) were equally useful in accurately classifying single and geminate stops. This implies that the anchor to which the contrasting stop duration normalizes across rates does not have to be the entire “word” although the word is also a stable anchor.
#7Distribution and Trichotomic Realization of Voiced Velars in Japanese – An Experimental Study
Shin-ichiro Sano (Division of Arts and Sciences, International Christian University, Japan)
Tomohiko Ooigawa (Phonetics Laboratory, Graduate School of Foreign Studies, Sophia University, Japan)
In this paper, we demonstrate the trichotomic realization of voiced velars in Japanese, challenging the traditional plosive/nasal dichotomy of velar allophones, and examine the distribution of these allophones taking phonetic/phonological factors into account. We conducted the quantitative analysis based on some speech production experiments. The results show that voiced velars are more likely to realize as plosives in word-initial positions, as nasals in post-nasal positions, and as fricatives in sequential contexts; velars in word-initial positions can realize as fricatives; the decline of velarnasalization has been accelerated; following vowels and dialectal differences can affect the distribution.
Jagoda Sieczkowska (Universität Stuttgart)
Bernd Möbius (Universität Bonn, Universität Stuttgart)
Grzegorz Dogil (Universität Stuttgart)
This study investigates voicing properties of Polish, French, American English and German sonorant consonants, particularly rhotics. The analysis was conducted on four speech databases recorded by professional speakers. The term voicing profiles used in this article refers to the frame by frame voicing status of the sonorants, which was obtained by automatic measurements of fundamental frequency values and extraction of consonantal features. Results show resyllabification processes in Polish and French obstruent liquid clusters in word final positions, as well as contextual effects on devoicing in word initial and word medial American English and German obstruent sonorant clusters.
#9Phonetic imitation of Japanese vowel devoicing
Kuniko Nielsen (Linguistics Department, Oakland University)
Recent studies have shown that talkers implicitly imitate/accommodate the phonetic properties of recently heard speech (e.g., Goldinger 1998; Pardo, 2006). However, it has also been shown that this phonetic imitation effect is not an automatic process: in Nielsen (2008), the artificially lengthened VOT on /p/ was imitated in a non-shadowing task, while shortened VOT (which could jeopardize phonemic contrast) was not imitated. The current study explores the extent to which phonological factors unrelated to contrast preservation also affect imitation of phonetic details, specifically, Japanese vowel devoicing. The results revealed significant imitation of Japanese devoicing, indicating that even in phonologically constrained environments, perceived fine phonetic details are imitable and can subsequently affect speech production .
#10Post-aspiration in standard Italian: some first cross-regional acoustic evidence
Mary Stevens (Institut für Phonetik und Sprachverarbeitung, Ludwig-Maximilians-Universität, Munich, Germany)
Hajek John (School of Languages & Linguistics, University of Melbourne, Australia)
Voiceless geminate stops in Italian are typically described as unaspirated in all positions (e.g. [1, 2]). However, recent acoustic phonetic analysis of part of a corpus of standard Italian speech data has shown that the geminate voiceless stops /pp tt kk/ are frequently realized with both preaspiration i.e. [hC] (cf. [3]) and post-aspiration. This paper focuses on the latter phenomenon, presenting acoustic phonetic evidence in the form of VOT duration values for /pp tt kk/ tokens recorded in 15 Italian cities (based on the CLIPS corpus of spoken Italian [4, 5]). The co-occurrence of post-aspiration with preaspiration is considered and results are discussed with a focus on regional patterns.
#11Articulatory Grounding of Southern Salentino Harmony Processes
Mirko Grimaldi (Centro di Ricerca Interdisciplinare sul Linguaggio (CRIL), University of Salento, Lecce (Italy))
Andrea Calabrese (Department of Linguistics—University of Connecticut (USA))
Francesco Sigona (Centro di Ricerca Interdisciplinare sul Linguaggio (CRIL), University of Salento, Lecce (Italy))
Luigina Garrapa (Centro di Ricerca Interdisciplinare sul Linguaggio (CRIL), University of Salento, Lecce (Italy) & Department of Linguistics—University of Padova (Italy))
Bianca Sisinni (Centro di Ricerca Interdisciplinare sul Linguaggio (CRIL), University of Salento, Lecce (Italy))
Southern Salentino has a harmony process, where the stressed mid vowels /E, O/ are raised to the mid-high vowels /e, o/ when followed by -i or -u. We studied this process by combining acoustic analyses with ultrasound tongue imaging. The main result of our study is that the Southern Salentino harmonic adjustments in height, which are acoustically manifested in the differentiation of F1, are articulatorily correlated with tongue root advancement when the process is triggered by -i and with tongue body raising when the process is triggered by -u. We propose a phonological analysis of the process based on these findings.
#12Effects of accent typicality and phonotactic frequency on nonword immediate serial recall performance in Japanese
Yuuki Tanida (Graduate School of Education, Kyoto University, Japan)
Taiji Ueno (School of Psychological Sciences, University of Manchester, UK)
Satoru Saito (Graduate School of Education, Kyoto University, Japan)
Matthew Lambon Ralph (School of Psychological Sciences, University of Manchester, UK)
In a nonword serial recall experiment we found following results: (1) Phonotactically high frequent nonwords were recalled better than low ones in terms of phoneme accuracy; (2) but this phonotactic frequency effect was not observed in accent accuracy. (3) Accent typicality did not have an expected effect on phoneme recall accuracy; (4) but it had an effect on accent accuracy. These results suggest that both long-term knowledge about phoneme sequences and accent patterns have strong influences on verbal short-term memory performance, but those influences might be limited to each particular domain.
#13How abstract is phonetics
Osamu Fujimura (The Ohio State University, Department of Speech and Hearing Science)
Assuming a generative principle of description for speech utterances, in particular a syllable-based phonological representation and the C/D model of phonetic implementation of uterrances, the basic question is discussed: how abstract phonetic representations as an optimal symbolic description of speech utterances should be. Several examples from a variety of languages in the world, including Kaingang (a Je language spoken by some indiginous Barazilians), which requires a phonological specification of oral syllables, as opposed to the default nasal syllables, are discussed. The advantage of using syllabic features and their implementations for concise phonetic description for particular languages is advocated. Index Terms: syllable features, C/D model

Speech Production II: Vocal Tract Modeling and Imaging

Time:Tuesday 16:00 Place:International Conference Room D Type:Poster
Chair:Shinobu Masaki
#1Data-Driven Analysis of Realtime Vocal Tract MRI using Correlated Image Regions
Adam Lammert (Department of Computer Science, University of Southern California, USA)
Michael Proctor (Department of Linguistics, University of Southern California, USA)
Shrikanth Narayanan (Department of Electrical Engineering, University of Southern California, USA)
Realtime MRI provides useful data about the human vocal tract, but also introduces many of the challenges of processing high-dimensional image data. Intuitively, data reduction would proceed by finding the air-tissue boundaries in the images, and tracing an outline of the vocal tract. This approach is anatomically well-founded. We explore an alternative approach which is data-driven and has a complementary set of advantages. Our method directly examines pixel intensities. By analyzing how the pixels co-vary over time, we segment the image into spatially localized regions, in which the pixels are highly correlated with each other. Intensity variations in these correlated regions correspond to vocal tract constrictions, which are meaningful units of speech production. We show how these regions can be extracted entirely automatically, or with manual guidance. We present two examples and discuss its merits, including the opportunity to do direct data-driven time series modeling.
#2Rapid Semi-automatic Segmentation of Real-time Magnetic Resonance Images for Parametric Vocal Tract Analysis
Michael Ian Proctor (University of Southern California)
Danny Bone (University of Southern California)
Nassos Katsamanis (University of Southern California)
Shrikanth Narayanan (University of Southern California)
A method of rapid semi-automatic segmentation of real-time magnetic resonance image (rMRI) data for parametric analysis of vocal tract shaping is described. Tissue boundaries are identified by finding pixel intensity thresholds along tract-normal gridlines. Airway contours are constrained with respect to a tract centerline defined as an optimal path over the graph of all intensity minima identified between the glottis and lips. The method allows for superimposition of reference boundaries to guide automatic segmentation of anatomical features which are poorly imaged using magnetic resonance -- dentition and the hard palate -- resulting in more accurate sagittal sections than those produced by fully automatic segmentation. We demonstrate the utility of the technique in the dynamic analysis of tongue shaping in Tamil liquid consonants.
#3Improved Real-time MRI of Oral-Velar Coordination Using a Golden-ratio Spiral View Order
Yoon-Chul Kim (University of Southern California)
Shrikanth S. Narayanan (University of Southern California)
Krishna S. Nayak (University of Southern California)
In speech research using real-time magnetic resonance imaging (RT-MRI), frame reconstruction is typically performed with a constant temporal resolution. However, a flexible selection of temporal resolution is desirable because of natural variations in speaking rate and variations in the speed of different articulators. In this work, a novel spiral golden-ratio temporal view order was applied to nasal RT-MRI studies in imaging a mid-sagittal slice of the upper airway. Compared to conventional spiral bit-reversed temporal view order scheme, the proposed golden-ratio scheme provides less temporal blurring in the depiction of rapid tongue tip motion with a selection of high temporal resolution. It also provides higher signal-to-noise ratio (SNR) in the depiction of relatively slow velar motion with a selection of low temporal resolution.
#4Statistical multi-stream modeling of real-time MRI articulatory speech data
Erik Bresch (University of Southern California)
Athanasios Katsamanis (University of Southern California)
Louis Goldstein (University of Southern California)
Shrikanth Narayanan (University of Southern Calidornia)
This paper investigates different statistical modeling frameworks for articulatory speech data obtained using real-time (RT) magnetic resonance imaging (MRI). To quantitatively capture the spatio-temporal shaping process of the human vocal tract during speech production a multi-dimensional stream of image features is derived from the MRI recordings. The features are closely related, though not identical, to the tract variables commonly defined in the articulatory phonology theory. The modeling of the shaping process aims at decomposing the articulatory data streams into primitives by segmentation, and the segmentation task is carried out using vector quantizers, Gaussian Mixture Models, Hidden Markov Models, and a coupled Hidden Markov Model. We evaluate the performance of the different segmentation schemes qualitatively with the help of a well understood data set which was used in a earlier study of inter-articulatory timing phenomena of American English nasal sounds.
#5Predicting Unseen Articulations from Multi-speaker Articulatory Models
Ananthakrishnan G (Centre for Speech technology, KTH, Stockholm, Sweden)
Pierre Badin (GIPSA-Lab(Departement Parole & Cognition / ICP), UMR 5216, CNRS - Grenoble University, France)
Julián Andrés Valdés Vargas (GIPSA-Lab(Departement Parole & Cognition / ICP), UMR 5216, CNRS - Grenoble University, France)
Olov Engwall (Centre for Speech technology, KTH, Stockholm, Sweden)
In order to study inter-speaker variability, this work aims to assess the generalization capabilities of data-based multi-speaker articulatory models. We use various three-mode factor analysis techniques to model the variations of midsagittal vocal tract contours obtained from MRI images for three French speakers articulating 73 vowels and consonants. Articulations of a given speaker for phonemes not present in the training set are then predicted by inversion of the models from measurements of these phonemes articulated by the other subjects. On the average, the prediction RMSE was 5.25 mm for tongue contours, and 3.3 mm for 2D midsagittal vocal tract distances. Besides, this study has established a methodology to determine the optimal number of factors for such models.
#6Estimating missing data sequences in X-ray microbeam recordings
Chao Qin (University of California, Merced)
Miguel Carreira-Perpiñán (University of California, Merced)
Techniques for recording the vocal tract shape during speech such as X-ray microbeam or EMA track the spatial location of pellets attached to several articulators. Limitations of the recording technology result in most utterances having sequences of frames where one or more pellets are missing. Rather than discarding such sequences, we seek to reconstruct them. We use an algorithm for recovering missing data based on learning a density model of the vocal tract shapes, and predicting missing articulator values using conditional distributions derived from this density. Our results with the Wisconsin X-ray microbeam database show we can recover long, heavily oscillatory trajectories with errors of 1 to 1.5 mm for all articulators.
#7Adaptation of a tongue shape model by local feature transformations
Chao Qin (EECS, School of Engineering, University of California, Merced, USA)
Miguel A. Carreira-Perpinan (EECS, School of Engineering, University of California, Merced, USA)
Mohsen Farhadloo (EECS, School of Engineering, University of California, Merced, USA)
Reconstructing the full contour of the tongue from the position of 3 to 4 landmarks on it is useful in articulatory speech work. This can be done with submillimetric accuracy using nonlinear predictive mappings trained on hundreds or thousands of contours extracted from ultrasound images. Collecting and segmenting this amount of data from a speaker is difficult,so a more practical solution is to adapt a well-trained model from a reference speaker to a new speaker using a small amount of data from the latter. Previous work proposed an adaptation model with only 6 parameters and demonstrated fast, accurate results using data from one speaker only. However, the estimates of this model are biased, and we show that, when adapting to a different speaker, its performance stagnates quickly with the amount of adaptation data. We then propose an unbiased adaptation approach, based on local transformations at each contour point, that achieves a significantly lower reconstruction error with a moderate amount of adaptation data.
#8Vocal tract contour analysis of emotional speech by the functional data curve representation
Lee Sungbok (University of Southern California)
Narayanan Shrikanth (University of Southern California)
Midsagittal vocal tract contours are analyzed using the functional data analysis (FDA) technique with which a vocal tract contour (VT) can be parameterized by a set of coefficients. Such a parametric representation of the dynamic vocal tract profiles provides a means for normalizing VT contours across speakers and offers interpretability of coefficient variability as the degree of contribution from specific vocal tract regions. It also enables us to examine the differences in VT behaviors as well as inter- and intra-speaker differences across different speech production styles including emotion expression. A set of FDA coefficients can be used as a feature vector of a given VT contour for further modeling. The efficacy of such feature vectors is tested using the Fisher linear discriminant analysis. A cross-validation accuracy of 65.0% was obtained in the task of discriminating four different emotions with combined data points from two speakers.
#9Locally-Weighted Regression for Estimating the Forward Kinematics of a Geometric Vocal Tract Model
Adam Lammert (Department of Computer Science, University of Southern California, California, USA)
Louis Goldstein (Department of Linguistics, University of Southern California, California, USA)
Khalil Iskarous (Haskins Laboratories, New Haven, Connecticut, USA)
Task-space control is well studied in modeling speech production. Implementing control of this kind requires an accurate kinematic forward model. Despite debate about how to define the tasks for speech (i.e., acoustical vs. articulatory), a faithful forward model will be complex and infeasible to express analytically. Thus, it is necessary to learn the forward model from data. Artificial Neural Networks (ANNs) have previously been suggested for this. We argue for the use of locally-linear methods, such as Locally-Weighted Regression (LWR). While ANNs are capable of learning complex forward maps, LWR is more appropriate. Common formulations of control assume locally-linearity, whereas ANNs fit a nonlinear model to the entire map. Likewise, training LWR is simple compared to the complex optimization for ANNs. We provide an empirical comparison of these methods for learning a vocal tract forward model, discussing theoretical and practical aspects of each.
#10Identifying articulatory goals from kinematic data using principal differential analysis
Michael Reimer (Department of Computer Science, University of Toronto)
Frank Rudzicz (Department of Computer Science, University of Toronto)
Articulatory goals can be highly indicative of lexical intentions, but are rarely used in speech classification tasks. In this paper we show that principal differential analysis can be used to learn the behaviours of articulatory motions associated with certain high-level articulatory goals. This method accurately learns the parameters of second-order differential systems applied to data derived by electromagnetic articulography. On average, this approach is between 4.4% and 21.3% more accurate than an HMM and a neural network baseline.
#11Estimation of Speech Lip Features From Discrete Cosine Transform
Zuheng Ming (Laboratoire Langage et Audition)
Denis Beautemps (GIPSA-lab)
Gang Feng (GIPSA-lab)
Sébastien Schmerber (Laboratoire Langage et Audition)
This study is a contribution to the field of visual speech processing. It focuses on the automatic extraction of Speech lip features from natural lips. The method is based on the direct prediction of these features from predictors derived from an adequate transformation of the pixels of the lip region of interest. The transformation is made of a 2-D Discrete Cosine Transform combined with a Principal Component Analysis applied to a subset of the DCT coefficients corresponding to about 1% of the total DCTs. The results show the possibility to estimate the geometric lip features with a good accuracy (a root mean square of 1 to 1.4 mm for the lip aperture and the lip width) using a reduce set of predictors derived from the PCA.
#12Autoregressive Modelling for Linear Prediction of Ultrasonic Speech
Farzaneh Ahmadi (School of Computer Engineering, Nanyang Technological University, Singapore)
Ian V. McLoughlin (School of Computer Engineering, Nanyang Technological University, Singapore)
Hamid R. Sharifzadeh (School of Computer Engineering, Nanyang Technological University, Singapore)
Ultrasonic speech is a novel technology, which implies exciting human vocal tract (VT) with an ultrasonic signal to provide a speech mode in the ultrasonic frequency range. This has several applications including speech-aid prostheses for voice-loss patients, silent speech interfaces, secure mode of communication in mobile phones and speech therapy. Linear prediction has recently been proven to be applicable for feature extraction of ultrasonic propagation inside the VT. The authors have proposed that averaging the predictor coefficients obtained from multiple receiving points is a viable approach for autoregressive (AR) modelling of ultrasonic speech. In support of the previous theoretical work, this paper presents experimental results of implementing the averaging method, using finite element analysis of ultrasonic propagation inside the VT configuration for nine English vowels. A Comparison of the results with the conventional method of least squares error (LSE) -used in room acoustics- shows that averaging outperforms LSE in terms of determining the location of poles in the AR modelling of ultrasonic speech and demonstrates higher robustness to variations of the LPC order.

Special Session: Quality of Experiencing Speech Services

Time:Tuesday 16:00 Place:301 Type:Special
Chair:Sebastian Möller & Marcel Waeltermann
16:00The characterization of the relative information content by spectral features for the objective intelligibility assessment of nonlinearly processed speech
Anton Schlesinger (TU Delft)
Marinus M. Boone (TU Delft)
The objective intelligibility assessment of nonlinearly enhanced speech is a widely experienced problem. Nonlinear processors operate primarily on the low-level and transient components of speech. As these sections contain important acoustic cues as well as context-constitutive information, they dominate speech intelligibility. For that reason, short-time intelligibility measures at low-level and transient components are weighted with their contribution to the overall intelligibility. In this report, spectral features are calculated from auditory sub-bands and are utilized to label these sections of high information content. A genetic optimization is performed to adapt the spectral feature measures to psychoacoustical data. No improvement is found over existing methods of objective speech intelligibility assessment, using short-time intelligibility calculation and level-dependent weighting. Therewith, the reported results contribute to pinpoint practicable solutions to the problem.
16:15Analytical Assessment and Distance Modeling of Speech Transmission Quality
Marcel Wältermann (Deutsche Telekom Laboratories, TU Berlin)
Alexander Raake (Deutsche Telekom Laboratories, TU Berlin)
Sebastian Möller (Deutsche Telekom Laboratories, TU Berlin)
The quality of transmitted speech is based on the auditory characteristics the degraded signal provokes. In past studies, it has been shown that the main features of speech transmission can be subsumed under the orthogonal perceptual dimensions "discontinuity", "noisiness", and "coloration". In order to gain more insight into the dimensional composition for arbitrary transmission conditions, an auditory method is described in this paper which allows for assessing these dimensions efficiently. The results can be used to model the total impairment, a measure of the reduction of integral quality which is compliant with the E-model, a parametric tool for speech quality prediction. The model derived in this paper is based on a distance function and yields a correlation of r=0.97 between subjective scores and model predictions for the Euclidean case.
16:30An Intrusive Super-Wideband Speech Quality Model: DIAL
Nicolas Côté (LISyC EA 3883, UBO/ENIB, Brest, France)
Vincent Koehl (LISyC EA 3883, UBO/ENIB, Brest, France)
Valérie Gautier-Turbin (France Télécom R&D, Lannion, France)
Alexander Raake (Deutsche Telekom Laboratories, TU Berlin, Germany)
Sebastian Möller (Deutsche Telekom Laboratories, TU Berlin, Germany)
The intrusive speech quality model standardized by the ITU-T shows some limits in its quality predictions, especially in a wideband transmission context. They are mainly caused by strong differences in perceived quality when speech is transmitted over different telephone networks. Instrumental methods should provide reliable estimations of the integral speech quality over the entire perceptual speech quality space. This paper presents a new model, called Diagnostic Instrumental Assessment of Listening quality (DIAL). It combines a core model, four dimension estimators and a cognitive model, providing integral quality estimations as well as diagnostic information in a super-wideband context.
16:45It Takes Two to Tango - Assessing The Impact of Delay on Conversational Interactivity on Perceived Speech Quality
Sebastian Egger (Telecommunications Research Center (FTW), Vienna, Austria)
Raimund Schatz (Telecommunications Research Center (FTW), Vienna, Austria)
Stefan Scherer (Institute of Neural Information Processing, University of Ulm, Germany)
This paper analyzes the relationship between transmission delay, conversational interactivity and perceived quality of bi-directional speech. Our work is grounded on the results of subjective speech quality tests conducted in our lab and recent studies in this field. The test experiments do not only quantify the impact of network delay on speech quality as perceived by untrained subjects. They also assess the mutual influences between conversational interactivity (CI) and delay using three different conversation scenarios. Our results show a clear positive correlation between the level of conversational interactivity and interlocutors' delay sensitivity. Another key finding is that even in contexts of high interactivity, one-way delay values up to 400 ms did not have any significant impact on untrained participants' perception of overall speech quality. Furthermore, we examine the surface structure of participants' conversations across a wide range of delay conditions (up to 1600 ms). Our analysis demonstrates how additional metrics such as unintended interruption rate (UIR) can be successfully used to determine the surface structure and delay sensitivity of a conversation.
17:00Comparison of Approaches for Instrumentally Predicting the Quality of Text-To-Speech Systems
Sebastian Möller (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)
Florian Hinterleitner (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)
Tiago H. Falk (Bloorview Research Institute, Toronto)
Tim Polzehl (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)
In this paper, we compare and combine different ap-proaches for instrumentally predicting the perceived quality of Text-to-Speech systems. First, a log-likelihood is determined by comparing features extracted from the synthesized speech signal with features trained on natural speech. Second, parameters are extracted which capture quality-relevant degradations of the synthesized speech signal. Both approaches are combined and evaluated on three auditory test databases. The results show that auditory quality judgments can in many cases be predicted with a sufficiently high accuracy and reliability, but that there are considerable differences, mainly between male and female speech samples.
17:15A Hybrid Architecture for Mobile Voice User Interfaces
Imre Kiss (Nokia Research Center)
Joseph Polifroni (Nokia Research Center)
Chao Wang (Vlingo Corporation)
Ghinwa Choueiter (Vlingo Corporation)
Mike Phillips (Vlingo Corporation)
This paper describes our initial experiments with a hybrid voice recognition architecture for mobile voice user interfaces. Our system consists of a large vocabulary continuous speech recognizer on the network side, and a compact embedded recognizer on the handset. The two components are seamlessly integrated to provide a uniform user experience at all times. The hybrid system is able to handle unconstrained voice input from the user at any time, while it also features significantly improved response times and availability of service when compared to a network-only configuration. We have tested the hybrid architecture in an experimental setup with real user data in six different languages. Our results show that depending on the availability of prior usage information from the users, 28-55% of voice queries can be handled locally, with virtually instantaneous recognition, at the cost of less than 5% relative increase in the overall word error-rate of the system.
17:30Assessment of Spoken and Multimodal Applications: Lessons Learned from Laboratory and Field Studies
Markku Turunen (University of Tampere)
Jaakko Hakulinen (University of Tampere)
Tomi Heimonen (University of Tampere)
In this paper, we present the key lessons learnt from numerous evaluations conducted to measure the quality of spoken and multimodal applications. The issues we address include the relation of laboratory and field studies, long-term and pilot evaluations, unimodality and multimodality, objective and subjective metrics, and user expectations and experiences. We present concrete case studies to discuss the above issues. For example, there are major differences in evaluating speech-only and multimodal systems. Similarly, there are major differences between laboratory and field studies, which need to be considered in successful evaluations.
17:45Improving Cross Database Prediction of Dialogue Quality Using Mixture of Experts
Klaus-Peter Engelbrecht (Deutsche Telekom Laboratories, QU-Lab, TU Berlin)
Hamed Ketabdar (Deutsche Telekom Laboratories, QU-Lab, TU Berlin)
Sebastian Möller (Deutsche Telekom Laboratories, QU-Lab, TU Berlin)
Models for the prediction of user judgments from interaction data can be used in different contexts such as system quality assessment, monitoring of deployed systems, or as reward function in learned dialog managers. Such models still show a considerable lack with respect to their generalizability [6]. This paper specifically addresses this issue. We propose to use a Mixture of Experts approach for cross database predictions. In Mixture of Experts, several classifiers are trained on subsets of the data showing specific characteristics. Predictions of each expert model are combined for the overall prediction result. We show that such an approach can improve the cross database prediction accuracy.

Keynote 3: Chiu-yu Tseng - Beyond Sentence Prosody

Time:Wednesday 08:30 Place:Hall A/B Type:Keynote
Chair:Julia Hirschberg
08:30Beyond Sentence Prosody
Chiu-yu Tseng (Institute of Linguistics, Academia Sinica, Taipei, Taiwan)
The prosody of a sentence (utterance) when it appears in a discourse context differs substantially from when it is uttered in isolation. This paper addresses why paragraph is a discourse unit and discourse prosody is an intrinsic part of naturally occurring speech. Higher level discourse information treats sentences, phrases and their lower level units as sub-units and layers over them; and realized in patterns of global prosody. A perception based multi-phrase discourse prosody hierarchy and a parallel multi-phrase associative template were proposed to test discourse prosodic modulations. Results from quantitative modeling of speech data show that output discourse prosody can be derived through multiple layers of higher level modulations. The seemingly random occurrence of lower level prosodic units such as intonation variations is, in fact, systematic. In summary, abundant traces of global prosody can be recovered from the speech signal and accounted for; their patterns could help facilitate better understanding of spoken language processing.

ASR: Acoustic Model Adaptation

Time:Wednesday 10:00 Place:Hall A/B Type:Oral
Chair:Koichi Shinoda
10:00Prior Information for Rapid Speaker Adaptation
Catherine Breslin (Toshiba Research Europe Ltd, Cambridge, UK)
Haitian Xu (Toshiba Research Europe Ltd, Cambridge, UK)
KK Chin (Toshiba Research Europe Ltd, Cambridge, UK)
Mark Gales (Toshiba Research Europe Ltd, Cambridge, UK)
Kate Knill (Toshiba Research Europe Ltd, Cambridge, UK)
Rapidly adapting a speech recognition system to new speakers using a small amount of adaptation data is important to improve initial user experience. In this paper, a count-smoothing framework for incorporating prior information is extended to allow for the use of different forms of dynamic prior and improve the robustness of transform estimation on small amounts of data. Prior information is obtained from existing rapid adaptation techniques like VTLN and PCMLLR. Results using VTLN as a dynamic prior for CMLLR estimation show that transforms estimated on just one utterance can yield relative gains of 15% and 46% over a baseline gender independent model on two tasks.
10:20Discriminative Adaptation for Log-linear Acoustic Models
Jonas Lööf (RWTH Aachen University)
Ralf Schlüter (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)
Log-linear models have recently been used in acoustic modeling for speech recognition systems. This has been motivated by competitive results compared to systems based on Gaussian models, and a more direct parametrisation of the posterior model. To competitively use log-linear models for speech recognition, important methods, such as speaker adaptation, have to be reformulated in a log-linear framework. In this work, an approach to log-linear affine feature transforms for speaker adaptation is described. Experiments for both supervised and unsupervised adaptation are presented, showing improvements over a maximum likelihood baseline in the form of feature space maximum likelihood linear regression for the case of supervised adaptation.
10:40Automatic Speech Recognition of Multiple Accented Speech Data
Dimitra Vergyri (SRI International)
Lori Lamel (LIMSI-CNRS)
Jean-Luc Gauvain (LIMSI-CNRS)
Accent variability is an important factor in speech that can significantly degrade automatic speech recognition performance. We investigate the effect of multiple accents on an English broadcast news recognition system. A multi-accented English corpus is used for the task, including broadcast news segments from 6 different geographic regions: US, Great Britain, Australia, North Africa, Middle East and India. There is significant performance degradation of a baseline system trained on only US data when confronted with shows from other regions. The results improve significantly when data from all the regions are included for accent-independent acoustic model training. Further improvements are achieved when MAP-adapted accent-dependent models are used in conjunction with a GMM accent classifier.
11:00Shrinkage Model Adaptation in Automatic Speech Recognition
Jinyu Li (Microsoft Corporation, USA)
Yu Tsao (National Institute of Information and Communications Technology, Japan)
Chin-Hui Lee (Georgia Institute of Technology, USA)
We propose a parameter shrinkage adaptation framework to estimate models with only a limited set of adaptation data to improve accuracy for automatic speech recognition, by regularizing an objective function with a sum of parameter-wise power q constraint. For the first attempt, we formulate ridge maximum likelihood linear regression (MLLR) and ridge constraint MLLR (CMLLR) with an element-wise square sum constraint to regularize the objective functions of the conventional MLLR and CMLLR, respectively. Tested on the 5k-WSJ0 task, the proposed ridge MLLR and ridge CMLLR algorithms give significant word error rate reduction from the errors obtained with standard MLLR and CMLLR in an utterance-by-utterance unsupervised adaptation scenario.
11:20Unscented Transform with Online Distortion Estimation for HMM Adaptation
Jinyu Li (Microsoft Corporation, USA)
Dong Yu (Microsoft Corporation, USA)
Yifan Gong (Microsoft Corporation, USA)
Li Deng (Microsoft Corporation, USA)
In this paper, we propose to improve our previously developed method for joint compensation of additive and convolutive distortions (JAC) applied to model adaptation. The improvement entails replacing the vector Taylor series (VTS) approximation with unscented transform (UT) in formulating both the static and dynamic model parameter adaptation. Our new JAC-UT method differentiates itself from other UT-based approaches in that it combines the online noise and channel distortion estimation and model parameter adaptation in a unified UT framework. Experimental results on the standard Aurora 2 task show that the new algorithm enjoys 20.0% and 16.9% relative word error rate reductions over the previous JAC-VTS algorithm when using the simple and complex backend models, respectively.
11:40HMM Adaptation Using Linear Spline Interpolation with Integrated Spline Parameter Training for Robust Speech Recognition
Michael Seltzer (Microsoft Research)
Alex Acero (Microsoft Research)
We recently proposed a method for HMM adaptation to noisy environments called Linear Spline Interpolation (LSI). LSI uses linear spline regression to model the relationship between clean and noisy speech features. In the original algorithm, stereo training data was used to learn the spline parameters that minimize the error between the predicted and actual noisy speech features. The estimated splines are then used at runtime to adapt the clean HMMs to the current environment. While good results can be obtained with this approach, the performance is limited by the fact that the splines are trained independently from the speech recognizer and as such, they may actually be suboptimal for adaptation. In this work, we introduce a new Generalized EM algorithm for estimating the spline parameters using the speech recognizer itself. Experiments on the Aurora 2 task show that using LSI adaptation with splines trained in this manner results in a 20% improvement over the original LSI algorithm that used splines estimated from stereo data and a 28% improvement over VTS adaptation.

SLP systems for information extraction/retrieval

Time:Wednesday 10:00 Place:201A Type:Oral
Chair:Chiori Hori
10:00CRF-based Stochastic Pronunciation Modeling for Out-of-Vocabulary Spoken Term Detection
Dong Wang (EURECOM, France)
Simon King (CSTR, University of Edinburgh,UK)
Nicholas Evans (EURECOM, France)
Raphael Troncy (EURECOM, France)
Out-of-vocabulary (OOV) terms present a significant challenge to spoken term detection (STD). This challenge, to a large extent, lies in the high degree of uncertainty in pronunciations of OOV terms. In previous work, we presented a stochastic pronunciation modeling (SPM) approach to compensate for this uncertainty. A shortcoming of our original work, however, is that the SPM was based on a joint-multigram model (JMM), which is suboptimal. In this paper, we propose to use conditional random fields (CRFs) for letter-to-sound conversion, which significantly improves quality of the predicted pronunciations. When applied to OOV STD, we achieve considerable performance improvement with both a 1-best system and an SPM-based system.
10:20Improved Spoken Term Detection by Feature Space Pseudo-Relevance Feedback
Chia-Ping Chen (Graduate Institute of Communication Engineering, National Taiwan University)
Hung-yi Lee (Graduate Institute of Communication Engineering, National Taiwan University)
Ching-feng Yeh (Graduate Institute of Communication Engineering, National Taiwan University)
Lin-shan Lee (Graduate Institute of Communication Engineering, National Taiwan University)
In this paper, we propose an improved approach for spoken term detection using pseudo-relevance feedback. To remedy the problem of unmatched acoustic models with respect to spoken utterances produced under different acoustic conditions, which may give relatively poor recognition output, we integrate the relevance scores derived from the lattices with the DTW distances derived from the feature space of MFCC parameters or phonetic posteriorgrams. These DTW distances are evaluated for a carefully selected set of pseudo-relevant utterances, which obtained from the first-pass returned list given by the search engine. The utterances on the first-pass returned list are then reranked accordingly and finally shown to the user. Very encouraging, performance improvements were obtained in the preliminary experiments, especially when the acoustic models are poorly matched to the spoken utterances.
10:40Towards Spoken Term Discovery At Scale With Zero Resources
Aren Jansen (Johns Hopkins University)
Kenneth Church (Johns Hopkins University)
Hynek Hermansky (Johns Hopkins University)
The spoken term discovery task takes speech as input and identifies terms of possible interest. The challenge is to perform this task efficiently on large amounts of speech with zero resources (no training data and no dictionaries), where we must fall back to more basic properties of language. We find that long (~1 s) repetitions tend to be contentful phrases (e.g. University of Pennsylvania) and propose an algorithm to search for these long repetitions without first recognizing the speech. To address efficiency concerns, we take advantage of (i) sparse feature representations and (ii) inherent low occurrence frequency of long content terms to achieve orders-of-magnitude speedup relative to the prior art. We frame our evaluation in the context of spoken document information retrieval, and demonstrate our method's competence at identifying repeated terms in conversational telephone speech.
11:00Vocabulary Independent Spoken Query: a Case for Subword Units
Evandro Gouvêa (MERL)
Tony Ezzat (MERL)
In this work, we describe a subword unit approach for information retrieval of items by voice. An algorithm based on the minimum description length (MDL) principle converts an index written in terms of words into an index written in terms of phonetic subword units. A speech recognition engine that uses a language model and pronunciation dictionary built from such an inventory of subword units is completely independent from the information retrieval task. The recognition engine can remain fixed, making this approach ideal for resource constrained systems. In addition, we demonstrate that recall results at higher out of vocabulary (OOV) rates are much superior for the subword unit system. On a music lyrics task at 80% OOV, the subword-based recall is 75.2%, compared to 47.4% for a word system.
11:20Extractive Speech Summarization - From the View of Decision Theory
Shih-Hsiang Lin (National Taiwan Normal University)
Yao-Ming Yeh (National Taiwan Normal University)
Berlin Chen (National Taiwan Normal University)
Extractive speech summarization can be thought of as a decision-making process where the summarizer attempts to select a subset of informative sentences from the original document. Meanwhile, a sentence being selected as part of a summary is typically determined by three primary factors: significance, relevance and redundancy. To meet these specifications, we recently presented a novel probabilistic framework stemming from the Bayes decision theory for extractive speech summarization. It not only inherits the merits of several existing summarization techniques but also provides a flexible mechanism to render the redundancy and coherence relationships among sentences and between sentences and the whole document, respectively. In this paper, we propose several new approaches to the ranking strategy and modeling paradigm involved in such a framework. All experiments reported were carried out on a broadcast news speech summarization task; very promising results were demonstrated.
11:40The Impact of ASR on Abstractive vs. Extractive Meeting Summaries
Gabriel Murray (University of British Columbia)
Giuseppe Carenini (University of British Columbia)
Raymond Ng (University of British Columbia)
In this paper we describe a complete abstractive summarizer for meeting conversations, and evaluate the usefulness of the automatically generated abstracts in a browsing task. We contrast these abstracts with extracts for use in a meeting browser and investigate the effects of manual versus ASR transcripts on both summary types.

Speech representation

Time:Wednesday 10:00 Place:201B Type:Oral
Chair:Thomas F. Quatieri
10:00Binary Coding of Speech Spectrograms Using a Deep Auto-encoder
Li Deng (Microsoft research)
M Seltzer (Microsoft research)
Dong Yu (Microsoft research)
Alex Acero (Microsoft research)
A. Mohamed (University of Toronto)
Geoff Hinton (University of Toronto)
This paper reports our recent exploration of the layer-by-layer learning strategy for training a multi-layer generative model of patches of speech spectrograms. The top layer of the generative model learns binary codes that can be used for efficient compression of speech and could also be used for scalable speech recognition or rapid speech content retrieval. Each layer of the generative model is fully connected to the layer below and the weights on these connections are pre-trained efficiently by using the contrastive divergence approximation to the log likelihood gradient. After layer-by-layer pre-training we “unroll” the generative model to form a deep auto-encoder, whose parameters are then fine-tuned using back-propagation. To reconstruct the full-length speech spectrogram, individual spectrogram segments predicted by their respective binary codes are combined using an overlap-and-add method. Experimental results on speech spectrogram coding demonstrate that the binary codes produce a log-spectral distortion that is approximately 2 dB lower than a sub-band vector quantization technique over the entire frequency range of wide-band speech.
10:20A Super-Resolution Spectrogram Using Coupled PLCA
Juhan Nam (Stanford University)
Gautham J. Mysore (Stanford University)
Joachim Ganseman (University of Antwerp)
Kyogu Lee (Seoul National University)
Jonathan Abel (Stanford University)
The short-time Fourier transform (STFT) based spectrogram is commonly used to analyze the time-frequency content of a signal. By the choice of window length, the STFT provide a trade-off between time and frequency resolutions. This paper presents a novel method that achieves high resolution simultaneously in both time and frequency. We extend Probabilistic Latent Component Analysis (PLCA) to jointly decompose two spectrograms, one with a high time resolution and one with a high frequency resolution. Using this decomposition, a new spectrogram, maintaining high resolution in both time and frequency, is constructed. Termed the ``super-resolution spectrogram", it can be particularly useful for speech as it can simultaneously resolve both glottal pulses and individual harmonics.
10:40Fast Least-Squares Solution for Sinusoidal, Harmonic and Quasi-Harmonic Models
Georgios Tzedakis (Institute of Computer Science, FORTH, and Multimedia Informatics Lab, CSD, UoC, Greece)
Yannis Pantazis (Institute of Computer Science, FORTH, and Multimedia Informatics Lab, CSD, UoC, Greece)
Olivier Rosec (Orange Labs TECH/ASAP/VOICE, Lannion, France)
Yannis Stylianou (Institute of Computer Science, FORTH, and Multimedia Informatics Lab, CSD, UoC, Greece)
Sinusoidal model and its variants are commonly used in speech processing. In the literature, there are various methods for the estimation of the unknown parameters of sinusoidal model such as Fourier transform based on FFT algorithm and Least Squares (LS) method. Least Squares method is more accurate and actually optimum for Gaussian noise, thus, more appropriate for high-quality signal processing, however, it is slower compared with FFT-based algorithms. In this paper, we study the source of computational load of LS solution and propose various computational improvements. We show that the complexity of LS solution as well the execution time are highly improved.
11:00Sparse Component Analysis for Speech Recognition in Multi-Speaker Environment
Afsaneh Asaei (Idiap Research Institute/Ecole Polytechnique Fédérale de Lausanne (EPFL))
Hervé Bourlard (Idiap Research Institute/Ecole Polytechnique Fédérale de Lausanne (EPFL))
Philip N. Garner (Idiap Research Institute)
Sparse Component Analysis is a relatively young technique that relies upon representation of a signal occupying only a small part of a larger space. Mixtures of sparse components are disjoint in that space. As a particular application of sparsity of speech signals, we investigate the DUET blind source separation algorithm in the context of speech recognition for multi-party recordings. We show how DUET can be tuned to the particular case of speech recognition with interfering sources, and evaluate the limits of performance as the number of sources increases. We show that the separated speech fits a common metric for sparsity, and conclude that sparsity assumptions lead to good performance in speech separation and hence ought to benefit other aspects of the speech recognition chain.
11:20Intra-Frame Variability as a Predictor of Frame Classifiability
Trond Skogstad (NTNU)
Torbjørn Svendsen (NTNU)
This paper examines the association between the variability of the speech signal inside an analysis frame and the relative difficulty of classifying that frame. We introduce a novel measure of speech frame variability and show through classification experiments that this measure is a strong predictor of classifiability, even when conditioning on the distance to segment boundaries. Finally, we show how to incorporate the measure as weights in the discriminant function of a GMM-HMM recognizer, thereby increasing the relative importance of low variability frames in both decoding and training. This is shown to give a reduction in error rates.
11:40Autocorrelation and Double Autocorrelation Based Spectral Representations for Noisy Word Recognition System
Tetsuya Shimamura (Saitama University)
Dinh Nguen Doc (Saitama University)
Two methods of spectral analysis for noisy speech recognition are proposed and tested in a speaker independent word recognition experiment under an additive white Gaussian noise environment. One is Mel-frequency cepstral coefficients (MFCC) spectral analysis on the autocorrelation sequence of the speech signal and the other is MFCC spectral analysis on its double autocorrelation sequence. The word recognition experiment shows that both of the proposed methods achieve better results than the conventional MFCC spectral analysis on the input speech signal.

Voice Conversion

Time:Wednesday 10:00 Place:302 Type:Oral
Chair:Tomoki Toda
10:00Maximum a posteriori voice conversion using sequential Monte Carlo methods
Elina Helander (Tampere University of Technology)
Hanna Silen (Tampere University of Technology)
Joaquin Miguez (Departamento de Teoria de la Senal y Comunicaciones, Universidad Carlos III de Madrid)
Moncef Gabbouj (Tampere University of Technology)
Many voice conversion algorithms are based on frame-wise mapping from source features into target features. This ignores the inherent temporal continuity that is present in speech and can degrade the subjective quality. In this paper, we propose to optimize the speech feature sequence after a frame-based conversion algorithm has been applied. In particular, we select the sequence of speech features through the minimization of a cost function that involves both the conversion error and the smoothness of the sequence. The estimation problem is solved using sequential Monte Carlo methods. Both subjective and objective results show the effectiveness of the method.
10:20Dynamic Model Selection for Spectral Voice Conversion
Pierre Lanchantin (IRCAM)
Xavier Rodet (IRCAM)
Statistical methods for voice conversion are usually based on a single model selected in order to represent a tradeoff between goodness of fit and complexity. In this paper we assume that the best model may change over time, depending on the source acoustic features. We present a new method for spectral voice conversion called Dynamic Model Selection (DMS), in which a set of potential best models with increasing complexity - including a mixture of Gaussian and probabilistic principal component analyzers - are considered during the conversion of a source speech signal into a target speech signal. This set is built during the learning phase, according to the Bayes information criterion (BIC). During the conversion, the best model is dynamically selected among the models in the set, according to the acoustical features of each source frame. Subjective tests show that the method improves the conversion in terms of proximity to the target and quality.
10:40Speaker-independent HMM-based Voice Conversion Using Quantized Fundamental Frequency
Takashi Nose (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)
This paper proposes a segment-based voice conversion technique between arbitrary speakers with a small amount of training data. In the proposed technique, an input speech utterance of source speaker is decoded into phonetic and prosodic symbol sequences, and then the converted speech is generated from the pre-trained target speaker's HMM using the decoded information. To reduce the required amount of training data, we use speaker-independent model in the decoding of the input speech, and model adaptation for the training of target speaker's model. Experimental results show that there is no need to prepare the source speaker's training data, and the proposed technique with only ten sentences of the target speaker's adaptation data outperforms the conventional GMM-based one using parallel data of 200 sentences.
11:00Probabilistic Integration of Joint Density Model and Speaker Model for Voice Conversion
Daisuke Saito (Graduate School of Engineering, The University of Tokyo / NTT Communication Science Laboratories, NTT Corporation)
Shinji Watanabe (NTT Communication Science Laboratories, NTT Corporation)
Atsushi Nakamura (NTT Communication Science Laboratories, NTT Corporation)
Nobuaki Minematsu (Graduate School of Information Science and Technology, The University of Tokyo)
This paper describes a novel approach to voice conversion using both a joint density model and a speaker model. In voice conversion studies, approaches based on Gaussian Mixture Model (GMM) with a joint density model are widely used to estimate a transformation. However, for sufficient quality, they require a parallel corpus which contains plenty of utterances with the same linguistic content spoken by both the speakers. In addition, the joint density GMM methods often suffer from over-training effects when the amount of training data is small. To compensate for these problems, we propose a novel approach to integrate the speaker GMM of the target with the joint density model using probabilistic formulation. The proposed method trains the joint density model and the speaker model, independently. It eases the burden on the source speaker. Experiments demonstrate the effectiveness of the proposed method, especially when the amount of the parallel corpus is small.
11:20Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion
Zhi-Zheng Wu (School of Computer Engineering, Nanyang Technological University, Singapore)
Tomi Kinnunen (School of Computing, University of Eastern Finland, Joensuu, Finland)
Eng Siong Chng (School of Computer Engineering, Nanyang Technological University, Singapore)
Haizhou Li (School of Computer Engineering, Nanyang Technological University, Singapore; Human Language Technology Department, Institute for Infocomm Research (I2R), Singapore; School of Computing, University of Eastern Finland, Joensuu, Finland;)
In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplicity and text-independence of the frame-level conversion while yielding high-quality conversion. We achieve these goals by (1) introducing a text-independent tri-frame alignment method, (2) including delta features of F0 into Gaussian mixture model (GMM) conversion and (3) reducing the well-known GMM oversmoothing effect by F0 histogram equalization. Our objective and subjective experiments on the CMU Arctic corpus indicate improvements over both the mean/variance normalization and the baseline GMM conversion.
11:40A Minimum Converted Trajectory Error (MCTE) Approach to High Quality Speech-to-Lips Conversion
Xiaodan Zhuang (Microsoft Research Asia & University of Illinois at Urbana-Champaign)
Lijuan Wang (Microsoft Research Asia)
Frank Soong (Microsoft Research Asia)
Mark Hasegawa-Johnson (University of Illinois at Urbana-Champaign)
High quality speech-to-lips conversion, investigated in this work, renders realistic lips movement (video) consistent with input speech (audio) without knowing its linguistic content. Instead of memoryless frame-based conversion, we adopt maximum likelihood estimation of the visual parameter trajectories using an audio-visual joint Gaussian Mixture Model (GMM). We propose a minimum converted trajectory error approach (MCTE) to further refine the converted visual parameters. First, we reduce the conversion error by training the joint audio-visual GMM with weighted audio and visual likelihood. Then MCTE uses the generalized probabilistic descent algorithm to minimize a conversion error of the visual parameter trajectories defined on the optimal Gaussian kernel sequence according to the input speech. We demonstrate the effectiveness of the proposed methods using the LIPS 2009 Visual Speech Synthesis Challenge dataset, without knowing the linguistic (phonetic) content of the input speech.

Prosody: Language-specific models

Time:Wednesday 10:00 Place:International Conference Room A Type:Poster
Chair:Daniel Hirst
#1Influence of lexical tones on intonation in Kammu
Anastasia Karlsson (Dept of Linguistics and Phonetics, Centre for Languages and Literature, Lund University, Sweden)
David House (Dept of Speech, Music and Hearing, School of Computer Science and Communication, KTH, Stockholm, Sweden)
Jan-Olof Svantesson (Dept of Linguistics and Phonetics, Centre for Languages and Literature, Lund University, Sweden)
Damrong Tayanin (Dept of Linguistics and Phonetics, Centre for Languages and Literature, Lund University, Sweden)
The aim of this study is to investigate how the presence of lexical tones influences the realization of focal accent and sentence intonation. The language studied is Kammu, a language particularly well suited for the study as it has both tonal and non-tonal dialects. The main finding is that lexical tone exerts an influence on both sentence and focal accent in the tonal dialect to such a strong degree that we can postulate a hierarchy where lexical tone is strongest followed by sentence accent, with focal accent exerting the weakest influence on the F0 contour.
#2Phonetic Realization of Second Occurrence Focus in Japanese
Satoshi Nambu (University of Pennsylvania)
Yong-cheol Lee (University of Pennsylvania)
Previous studies have recently agreed that second occurrence focus is phonetically realized as prosodic prominence. What has been missing in the previous studies, however, is a comparison with neutral-focus, in addition to main focus and pre/post-focus, which is necessary to elucidate a precise phonetic status of second occurrence focus. Using evidence from Japanese, this study shows that second occurrence focus in the pre/post-focus position is realized with high pitch less salient than main focus but more than pre/post-focus. Compared with neutral-focus, the pitch of second occurrence focus is higher in the pre-focus position but lower in the post-focus position due to post-focus compression. Furthermore, this study provides a cross-linguistic insight of focus realization. The result suggests that Japanese focus experiences pre-focus compression, in addition to post-focus compression, which is different from Korean, English, and Mandarin.
#3Prosodic Grouping and Relative Clause Disambiguation in Mandarin
Jianjing Kuang (UCLA Linguistics)
The study discusses the role of prosodic grouping in the Mandarin Relative Clause attachment disambiguation. The grouping effect is explored under the Implicit Prosody Hypothesis (IPH) from four aspects of sentence processing experiments: default production, contrast production, online processing, as well as offline processing. It is found that (1) the length of RC greatly impacts ambiguity resolutions offline; (2) Prosodic grouping can well reflect the different attachment readings and is consciously used to produce contrastive meanings (3) Online processing can be affected by manipulating the grouping cues: Prominence and pause. The findings support the IPH, and contribute to our understanding about prosodic grouping in Mandarin, which can be applied in spoken language processing. Index Terms: prosodic grouping, disambiguation, Mandarin
#4Text-based Unstressed Syllable Prediction in Mandarin
Ya Li (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Jianhua Tao (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Meng Zhang (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Shifeng Pan (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Xiaoying Xu (Beijing Normal University, Beijing, China)
Recently, an increasing attention has been paid to Mandarin word stress which is important for improving the naturalness of speech synthesis. Most of the research on Mandarin speech synthesis focuses on three stress levels: stressed, regular and unstressed. This paper emphasizes the unstressed syllable prediction because the unstressed syllable is also important to the intelligibility of the synthetic speech. Similar as the prosodic structure, it is not easy to detect the stress from text analysis due to the complicated context information. A method based on Classification and Regression Tree (CART) model has been proposed to predict the unstressed syllables with the high accuracy of 85%. The method has been finally applied into the TTS system. The experiment shows that the MOS score of synthetic speech has been improved by 0.35; the pitch contour of the new synthesized speech is also closer to natural speech.
#5“Flat pitch accents” in Czech
Tomáš Duběda (Institute of Translation Studies, Charles University in Prague)
In this paper we investigate a particular type of stress marking in Czech, in which the syllable perceived as prominent is not accompanied by any clearly audible change in the overall pitch course. The paper gives a perceptual, phonotactic and acoustic account of these “flat pitch accents”. No positional effects or semantic correlates of words bearing this type of accent were found. Flat accents have significantly reduced intonational variability, as expected, and their durational and dynamic correlates are partly different from other accent types. However, none of these findings speaks in favour of compensation between prosodic parameters.
#7Positional variability of pitch accents in Czech
Tomáš Duběda (Institute of Translation Studies, Charles University in Prague)
An analysis of prenuclear accents in read speech is carried out with the aim of finding instances of regularity in their distribution. Significant differences are identified with respect to position within the phrase and phrase length, some of which are correlated with declination and pitch span narrowing. Only a weak interaction is found between nuclear and prenuclear pitch accents. No tendency of using only one type of pitch accents in a phrase could be found. The autosegmental approach seems to be a viable means of analyzing prenuclear intonation in Czech.
#8Modeling of Sentence-medial Pauses in Bangla Readout Speech: Occurrence and Duration
Shyamal Kr Dasmandal (Centre for Development of Advanced Computing, Kolkata)
Arup Saha (Centre for Development of Advanced Computing, Kolkata)
Tulika Basu (Centre for Development of Advanced Computing, Kolkata)
Keikichi Hirose (Department of Information and Communication Engineering, University of Tokyo)
Hiroya Fujisaki (Professor Emeritus, University of Tokyo)
Control of pause occurrence and duration is an important issue for text-to-speech synthesis systems. In text-readout speech, pauses occur unconditionally at sentence boundaries and with high probability at major syntactic boundaries such as clause boundaries, but more or less arbitrarily at minor syntactic boundaries. Pause duration tends to be longer at the end of a longer syntactic unit. A detailed analysis is conducted for sentence-medial pauses for readout speech of Bangla. Based on the results, linear models (with variables of syntactic unit length and distance to directly modifying word) are constructed for pause occurrence and duration. The models are evaluated using the test data not included in the analyzed data (open-test condition). The results show that the proposed models can predict occurrence probability for 87% of phrase boundaries correctly, and pause duration within ±100 ms for 80% of the cases.
#9Declarative sentence intonation patterns in 8 Swiss German dialects
Adrian Leemann (Department of Linguistics, University of Bern)
Lucy Zuberbuehler (Department of Linguistics, University of Bern)
This study examines declarative sentence intonation contours in 8 vastly different Swiss German dialects by the application of the Command-Response model. Fundamental frequency patterns of a controlled declarative sentence are analyzed on the global and local level of intonation. The results provide evidence of a different patterning for the dialects in the context of how global and local level F0 is modulated. Findings of previous studies on natural Swiss German speech are essentially confirmed, at the same time, however, new trends emerge.
#10Syllable-Level Prominence Detection with Acoustic Evidence
Je Hun Jeon (The University of Texas at Dallas)
Yang Liu (The University of Texas at Dallas)
In this work, we conduct a thorough study using acoustic prosodic cues for prominence detection in speech. This study is different from previous work in several aspects. In addition to the widely used prosodic features, such as pitch, energy, and duration, we introduce the use of cepstral features. Furthermore, we evaluate the effect of different features, speaker dependency and variation, different classifiers, and contextual information. Our experiments on the Boston University Radio News Corpus show that although the cepstral features alone do not perform well, when combined with prosodic features they yield some performance gain and, more importantly, can reduce much of the speaker variation in this task. We find that the previous context is more informative than the following context, and their combination achieves the best performance. The final result using selected features with context information is significantly better than that in previous work.
#11Prosody Cues For Classification of the Discourse Particle "hã" in Hindi
Sankalan Prasad (Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India)
Kalika Bali (Microsoft Research Labs India Pvt. Ltd. Sadashivnagar, Bangalore 560080, India)
In Hindi, affirmative particle "hã" carries out a variety of discourse functions. Preliminary investigation has shown that though it is difficult to disambiguate these different functions from prosody alone, there seems to be a distinct prosodic pattern associated with each of these. In this paper, we present a corpus study of spoken utterances of the Hindi word "hã". We identify these prosodic patterns and capture the specific pitch variations associated with each of the various functions. We also examine the use of prosodic cues in classification of the utterances into different functions using k-means clustering. While certain amount of speaker dependency, as well as lack of contextual and lexical information resulted in high classification entropy, however, the results were consistent with comparable studies in other languages.
#12Interaction of Syntax-marked Focus and Wh-question Induced focus in Standard Chinese
Yuan Jia (Phonetics Lab, Institute of Linguistics, Chinese Academy of Social Sciences, China)
Aijun Li (Phonetics Lab, Institute of Linguistics, Chinese Academy of Social Sciences, China)
The present study mainly investigates the interaction of syntax-marked focus and wh-question induced focus on the formation of F0 patterns in Standard Chinese (Hereinafter, SC). Acoustic experiment demonstrates that the syntax-marked (lian or shi) focus can co-exist with the wh-question induced focus. The results are two folds: (i) the two kinds of focuses can add together to trigger more obvious F0 prominence on the under-focus constituents and F0 compression on the post-focus constituents; (ii) they can realize prominences simultaneously on difference constituents in one sentence. Therefore, the F0 pattern of SC presents itself to observe the nuclear prominence and pre-nuclear prominence classification as in English. Specifically, the single focus induces the nuclear prominence and the dual focus triggers both nuclear prominence and pre-nuclear prominence.
#13Prominence Detection in Swedish Using Syllable Correlates
Samer Al Moubayed (KTH, Center for Speech Technology, Stockholm, Sweden)
Jonas Beskow (KTH, Center for Speech Technology, Stockholm, Sweden)
This paper presents an approach to estimating word level prominence in Swedish using syllable level features. The paper discusses the mismatch problem of annotations between word level perceptual prominence and its acoustic correlates, context, and data scarcity. 200 sentences are annotated by 4 speech experts with prominence on 3 levels. A linear model for feature extraction is proposed on a syllable level features, and weights for these features are optimized to match word level annotations. We show that using syllable level features and estimating weights for the acoustic correlates to minimize the word level estimation error gives better detection accuracy compared to word level features, and that both features exceed the baseline accuracy.
#14Automatic analysis of the intonation of a tone language. Applying the Momel algorithm to spontaneous Standard Chinese (Beijing).
Na Zhi (Laboratorio di Linguistica, Scuola Normale Superiore, Pisa, Italy)
Daniel Hirst (Laboratoire Parole et Langage, CNRS \& Universit\'e de Provence, France)
Pier Marco Bertinetto (Laboratorio di Linguistica, Scuola Normale Superiore, Pisa, Italy)
This paper describes the application of the Momel algorithm to a corpus of spontaneous speech in Standard (Beijing) Chinese. A selection of utterances by four speakers was analysed automatically and the resynthesised utterances were evaluated subjectively with two categories of errors: lexical tone errors and intonation errors. The target points determining the pitch contours of the synthetic utterances were then corrected manually in order to obtain a set of acceptable utterances for the entire corpus. An application attempting to optimise window-size for the Momel algorithm showed no overall improvement with respect to the manually corrected data. This annotated data will nevertheless constitute a useful yardstick for evaluating improvements to the automatic algorithm which is expected to be far more robust than data annotated for languages with no lexical tone.
#15Towards long-range prosodic attribute modeling for language recognition
Raymond W. M. Ng (Department of Electronic Engineering, The Chinese University of Hong Kong)
Cheung-Chi Leung (Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore 138632)
Ville Hautamäki (Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore 138632)
Tan Lee (The Chinese University of Hong Kong, Hong Kong)
Bin Ma (Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore 138632)
Haizhou Li (Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore 138632)
As a high-level feature, prosody may be an effective feature when it is modeled over longer ranges than the typical range of a syllable. This paper is about language recognition with the high-level prosodic attributes. It studies two important issues of long-range modeling, namely the data scarcity handling method, and the model which properly describes prosodic boundary events. Illustrated by NIST language recognition evaluation (LRE) 2009, long-range modeling is shown to bring a 7.2% relative improvement to a prosodic language detector. Score fusion between the long-range prosodic system and a phonotactic system gives an EER of 3.07%. Exploiting boundary N-grams is the main contributing factor to global EER reduction, while different long-range prosodic modeling factors benefit the detection of different languages. Analysis reveals the evidence of language-specific long-range prosodic attributes, which sheds light on robust long-range modeling methods for language recognition.
#16A Modified Parameterization of the Fujisaki Model
Robert Schubert (Dresden University of Technology, Institute of Acoustics and Speech Communication)
Oliver Jokisch (Dresden University of Technology, Institute of Acoustics and Speech Communication)
Diane Hirschfeld (voice INTER connect GmbH)
Fujisaki’s command-response model has proven suitable for analysis and synthesis of intonation contours in several languages. Although widely used in synthesis, it is subject to certain limitations, including mathematical over-determinacy, and insufficiency for some naturally occurring forms. We propose an alternative parameterization which separates declination and phrasal height, thereby making mathematical properties of phrase control symmetric to accent control. The modification improves the model’s utility for analysis, predictive synthesis, and rule-based synthesis, esp. when command dependent attenuation factors are used. An evaluation of the modified F0 generation on a speech corpus, based on experiments with the DRESS synthesizer, shows lower RMSE values and similar correlations between natural contours and their synthesized counterparts.

ASR: Language Modeling and Speech Understanding I

Time:Wednesday 10:00 Place:International Conference Room B Type:Poster
Chair:Atsushi Nakamura
#1Within and Across Sentence Boundary Language Model
Saeedeh Momtazi (Saarland University)
Friedrich Faubel (Saarland University)
Dietrich Klakow (Saarland University)
In this paper, we propose two different language modeling approaches, namely skip trigram and across sentence boundary, to capture the long range dependencies. The skip trigram model is able to cover more predecessor words of the present word compared to the normal trigram while the same memory space is required. The across sentence boundary model uses the word distribution of the previous sentences to calculate the unigram probability which is applied as the emission probability in the word and the class model frameworks. Our experiments on the Penn Treebank show that each of our proposed models and also their combination significantly outperform the baseline for both the word and the class models and their linear interpolation. The linear interpolation of the word and the class models with the proposed skip trigram and across sentence boundary models achieves 118.4 perplexity while the best state-of-the-art language model has a perplexity of 137.2 on the same dataset.
Ruhi Sarikaya (IBM T.J. Watson Research Center)
Stanley F. Chen (IBM T.J. Watson Research Center)
Abhinav Sethy (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
This paper investigates the impact of word classing on a recently proposed shrinkage-based language model, Model M. Model M, a class-based n-gram model, has been shown to significantly outperform word-based n-gram models on a variety of domains. In past work, word classes for Model M were induced automatically from unlabeled text using the algorithm of Brown et. al. We take a closer look at the classing and attempt to find out whether improved classing would also translate to improved performance. In particular, we explore the use of manually-assigned classes, part-of-speech (POS) tags, and dialog state information, considering both hard classing and soft classing. In experiments with a conversational dialog system (human--machine dialog) and a speech-to-speech translation system (human--human dialog), we find that better classing can improve Model M performance by up to 3% absolute in word-error rate.
#3Combination of Probabilistic and Possibilistic Language Models
Stanislas Oger (LIA - University of Avignon)
Vladimir Popescu (LIA - University of Avignon)
Georges Linarès (LIA - University of Avignon)
In a previous paper we proposed Web-based language models relying on the possibility theory. These models represent explicitely the possibility of word sequences. In this paper we propose to find the best way of combining this kind of model with classical probabilistic models, in the context of automatic speech recognition. We propose several combination approaches, depending on the nature of the combined models. In comparison with the baseline, the best combination provides an absolute word error rate reduction of about 1% on broadcast news transcription, and of 3.5% on domain-specific multimedia document transcription.
#4On-Demand Language Model Interpolation for Mobile Speech Input
Brandon Ballinger (Google)
Cyril Allauzen (Google)
Alexander Gruenstein (Google)
Johan Schalkwyk (Google)
Google offers several speech features on the Android mobile operating system: search by voice, voice input to any text field, and an API for application developers. As a result, our speech recognition service must support a wide range of usage scenarios and speaking styles: relatively short search queries, addresses, business names, dictated SMS and e-mail messages, and a long tail of spoken input to any of the applications users may install. We present a method of on-demand language model interpolation in which contextual information about each utterance determines interpolation weights among a number of n-gram language models. On-demand interpolation results in an 11.2% relative reduction in WER compared to using a single language model to handle all traffic.
#5Text Normalization based on Statistical Machine Translation and Internet User Support
Tim Schlippe (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany)
Chenfei Zhu (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany)
Jan Gebhardt (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany)
Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany)
In this paper, we describe and compare systems for text normalization based on SMT methods which are constructed with the support of internet users. By normalizing text displayed in a web interface, internet users provide a parallel corpus of normalized and non-normalized text. With this corpus, SMT models are generated to translate non-normalized into normalized text. To build traditional language-specific text normalization systems, knowledge of linguistics as well as established computer skills to implement text normalization rules are required. Our systems are built without profound computer knowledge due to the simple self-explanatory user interface and the automatic generation of the SMT models. Additionally, no inhouse knowledge of the language to normalize is required due to the multilingual expertise of the internet community. All techniques are applied on French texts, crawled with our Rapid Language Adaptation Toolkit [1] and compared through Levenshtein edit distance, BLEU score and perplexity. [1] Tanja Schultz and Alan Black. Rapid Language Adaptation Tools and Technologies for Multilingual Speech Processing. In: Proc. ICASSP Las Vegas, NV 2008.
#6Efficient Estimation of Maximum Entropy Language Models with N-gram features: an SRILM extension
Tanel Alumäe (Institute of Cybernetics, Tallinn University of Technology, Estonia)
Mikko Kurimo (Adaptive Informatics Research Centre, Aalto University, Finland)
We present an extension to the SRILM toolkit for training maximum entropy language models with N-gram features. The extension uses a hierarchical parameter estimation procedure for making the training time and memory consumption feasible for moderately large training data (hundreds of millions of words). Experiments on two speech recognition tasks indicate that the models trained with our implementation perform equally to or better than N-gram models built with interpolated Kneser-Ney discounting.
#7Similar N-gram Language Model
Christian Gillot (LORIA)
Christophe Cerisara (LORIA)
David Langlois (LORIA)
Jean-Paul Haton (LORIA)
This paper describes an extension of the n-gram language model: the similar n-gram language model. The estimation of the probability P(s) of a string s by the classical model of order n is computed using statistics of occurrences of the last n words of the string in the corpus, whereas the proposed model further uses all the strings s' for which the Levenshtein distance to s is smaller than a given threshold. The similarity between s and each string s' is estimated using co-occurrence statistics. The new P(s) is approximated by smoothing all the similar n-gram probabilities with a regression technique. A slight but statistically significant decrease in the word error rate is obtained on a state-of-the-art automatic speech recognition system when the similar n-gram language model is interpolated linearly with the n-gram model.
#8Topic and style-adapted language modeling for Thai broadcast news ASR
Markpong Jongtaveesataporn (Department of Computer Science, Tokyo Institute of Technology)
Sadaoki Furui (Department of Computer Science, Tokyo Institute of Technology)
The amount of available Thai broadcast news transcribed text for training a language model is still very limited, comparing to other major languages. Since the construction of a broadcast news corpus is very costly and time-consuming, newspaper text is often used to increase the size of training text data. This paper proposes a language model topic and style adaptation approach for a Thai broadcast news ASR system, using broadcast news and newspaper text. A rule-based speaking style classification method based on the existence of some specific words is applied to classify training text. Various kinds of language models adapted to topics and styles are studied and shown to successfully reduce test set perplexity and recognition error rate. The results also show that written style text from newspaper can be employed to alleviate the sparseness of the broadcast news corpus while spoken style text from the broadcast news corpus is still essential for building a reliable language model.
#9Augmented Context Features for Arabic Speech Recognition
Ahmad Emami (IBM T. J. Watson Research Center)
Hong-Kwang J. Kuo (IBM T. J. Watson Reserach Center)
Imed Zitouni (IBM T. J. Watson Research Center)
Lidia Mangu (IBM T. J. Watson Research Center)
We investigate different types of features for language modeling in Arabic automatic speech recognition. While much effort in language modeling research has been directed at designing better models or smoothing techniques for n-gram language models, in this paper we take the approach of augmenting the context in the n-gram model with different sources of information. We start by adding word class labels to the context. The word classes are automatically derived from un-annotated training data. As a contrast, we also experiment with POS tags which require a tagger trained on annotated data. An amalgam of these two methods uses class labels defined on word and POS tag combinations. Other context features include super-tags derived from the syntactic tree structure as well as semantic features derived from PropBank. Experiments on the DARPA GALE Arabic speech recognition task show that augmented context features often improve both perplexity and word error rate.
#10A Statistical Segment-Based Approach for Spoken Language Understanding
Lucia Ortega (Universitat Politècnica de València)
Isabel Galiano (Universitat Politècnica de València)
Lluís-F Hurtado (Universitat Politècnica de València)
Emilio Sanchis (Universitat Politècnica de València)
Encarna Segarra (Universitat Politècnica de València)
In this paper we propose an algorithm to learn statistical language understanding models from a corpus of unaligned pairs of word sentences and their corresponding semantic frames. Specifically, it allows to automatically map variable-length word segments with their corresponding semantic labels and thus, the decoding of user utterances to their corresponding meanings. In this way we avoid the time consuming work of manually associate semantic tags to words. We use the algorithm to learn the understanding component of a Spoken Dialog System for railway information retrieval in Spanish. Experiments show that the results obtained with the proposed method are very promising, whereas the effort employed to obtain the models is not comparable with this of manually segment the training corpus.

First and second language acquisition

Time:Wednesday 10:00 Place:International Conference Room C Type:Poster
Chair:Benjamin Munson
#1Cantonese tone word learning by tone and non-tone language speakers
Angela Cooper (Simon Fraser University)
Yue Wang (Simon Fraser University)
Adult non-native perception is subject to influence from a variety of factors, including native language experience. The present research examines the effect of linguistic experience on non-native tone perception and tone word learning. Native Thai and English-speaking participants completed seven sessions of lexical identification training on words distinguished by Cantonese tones. A tone identification task was administered before and after training. Both groups had comparable tone identification accuracy; however, Thai listeners obtained greater tone word learning proficiency. The findings suggest that native language experience with employing pitch lexically facilitates the incorporation of non-native tones into novel lexical representations.
#2Validation of a training method for L2 continuous-speech segmentation
Anne Cutler (MARCS Auditory Laboratories, University of Western Sydney, NSW 1797, Australia)
Janise Shanley (MARCS Auditory Laboratories, University of Western Sydney, NSW 1797, Australia)
Recognising continuous speech in a second language is often unexpectedly difficult, as the operation of segmenting speech is so attuned to native-language structure. We report the initial steps in development of a novel training method for second-language listening, focusing on speech segmentation and employing a task designed for studying this: word-spotting. Listeners detect real words in sequences consisting of a word plus a minimal context. The present validation study shows that learners from varying non-English backgrounds successfully perform a version of this task in English, and display appropriate sensitivity to structural factors that also affect segmentation by native English listeners.
#3Linguistic Rhythm in Foreign Accent
Jiahong Yuan (University of Pennsylvania)
This study investigates the influence of L1 on L2 with respect to linguistic rhythm. The L2 English of French, German, Italian, Russian, and Spanish speakers is compared with L1 English. The results show that the linguistic rhythm of L1 transfers to L2. Compared to L1 English, L2 English has shorter stressed vowels but longer reduced vowel. Stressed vowels in the L2 English of stress-timed languages have a higher pitch contour than those in the L2 English of syllable-timed languages. Index Terms: Foreign Accent, rhythm, duration, pitch
#4The effect of a word embedded in a sentence and speaking-rate variation on the perceptual training of geminate and singleton consonant distinction
Mee Sonu (Waseda University)
Keiichi Tajima (Hosei University)
Hiroaki Kato (NICT/ ATR)
Yoshinori Sagisaka (Waseda University)
Aiming at effective perceptual training of second language learning, we carried out training experiments on Japanese geminate consonants. Native Korean learners were trained to identify geminate and singleton stop of Japanese. Since Korean language has no phonemic contrast between long and short consonants, learners have tried to learn their differences based on their categorical perception through training. To test the training efficiency and find generalization of temporal discrimination, we investigated the perceptual training with a word embedded in sentences and single/multiple speaking rate. Training experiments showed the superiority with a word embedded in sentences and multiple speaking rates. These results suggest that perceptual training which was trained by multiple speaking rates could be effective to perceive temporal discrimination of length contrast of Japanese. However, under the training stimuli was single speaking rate condition, perceptual training have generalized to the limited extent. These results suggest that context factors including speaking rate would affect to identify the length contrast of Japanese to L2 learners.
#5Foreign accent matters most when timing is wrong
Chiharu Tsurutani (Griffith University)
This study aims to investigate native speakers’ perception of prosodic variation of Japanese utterances. The pitch contour above the word level is hard to determine due to individual variation or pragmatic and para-linguistic factors. Nevertheless, native speakers’ intonation is relatively consistent as long as the context and intention of the utterance is predetermined. On the other hand, L2 speakers’ intonation contains some prosodic deviation from the native speakers’ model, and yet some deviations are treated as non-native production and some are not. By identifying the prosodic deviations that are tolerated by native listeners, we will have better understanding of crucial points necessary for the improvement of Japanese pronunciation and the reference for computer-based assessment tools. The study suggests that pitch errors affect the performance score, but not as significantly as do timing errors.
#6Effects of Korean Learners’ Consonant Cluster Reduction Strategies on English Speech Recognition Performance
Hyejin Hong (Department of Linguistics, Seoul National University, Seoul, Korea)
Jina Kim (Interdisciplinary Program in Cognitive Science, Seoul National University, Seoul, Korea)
Minhwa Chung (Department of Linguistics, Seoul National University, Seoul, Korea)
This paper examines how the strategies for L2 production utilized by foreign language learners affect the performance of non-native speech recognition. Producing English consonant clusters are the most problematic for Korean learners of English because of difference between Korean and English phonotactics. The strategies of Korean learners in producing English consonant clusters entail a large amount of speech recognition errors. We have analyzed these problems based on phonetic and phonological knowledge of both languages and proposed two models focusing on vowel epenthesis and consonant deletion, respectively. These models reflect Korean learners’ cluster reduction strategies. Experimental results show that the vowel epenthesis model improves the speech recognition performance compared to the baseline; however, the consonant deletion model deteriorates the speech recognition performance. It is noteworthy that these experimental results are consistent with previous linguistic studies, which have claimed that L2 learners are more likely to avoid producing clusters by vowel epenthesis rather than by consonant deletion.
#7The effects of EMA-based augmented visual feedback on the English speakers' acquisition of the Japanese flap: a perceptual study
June S. Levitt (Texas Woman's University)
William F. Katz (The University of Texas at Dallas)
Electromagnetic Articulography (EMA) was used to provide augmented visual feedback in the learning of non-native speech sounds. Eight adult native speakers of English were randomly assigned to one of the two training conditions: (1) conventional L2 speech production training or (2) conventional L2 speech production training with EMA-based kinematic feedback. The participants’ speech was perceptually judged by six native speakers of Japanese. The results indicate that kinematic feedback with EMA facilitates the acquisition and maintenance of the Japanese flap consonant, providing superior acquisition and maintenance. The findings suggest augmented visual feedback may play an important role in adults’ L2 learning.
#8Perception of voiceless fricatives by Japanese listeners of advanced and intermediate level English proficiency
Hinako Masuda (Graduate School of Science and Technology, Sophia University, Tokyo, Japan)
Takayuki Arai (Graduate School of Science and Technology, Sophia University, Tokyo, Japan)
Numerous research has investigated how first language influences the perception of foreign sounds. The present study focuses on the perception of voiceless English fricatives by Japanese listeners with advanced and intermediate level English proficiency, and compares their results with that of English native listeners. Listeners identified consonants embedded in /a __ a/ in quiet, multi-speaker babble and white noise (SNR=0 dB). Results revealed that intermediate level learners scored the lowest among all listener groups, and /th/-/s/ confusions were unique to Japanese listeners. Confusions of /th/-/f/ were observed among all listener groups, which suggest that those phoneme confusions may be universal.
#9Perception of Estonian vowel category boundaries by native and non-native speakers
Lya Meister (Institute of Cybernetics at Tallinn University of Technology)
Einar Meister (Institute of Cybernetics at Tallinn University of Technology)
The aim of the paper is to study the perception of Estonian vowel categories by L2 learners of Estonian whose L1 is Russian. Estonian vowel system includes nine vowels whereas Russian has six. Five of Estonian vowels have counterparts in Russian: /i/, /e/, /u/, /o/ and /a/, the new vowel categories for L2 speakers are /ü/, /ö/, /ä/, and partly /õ/. For the perceptual experiments four-formant vowel stimuli were synthesized including nine Estonian prototype vowels and the intermediate steps between prototypes; the stimuli set covered 14 vowel category boundaries. The experiments involving native Estonian and non-native (Russian as L1) subjects showed that (1) Estonian vowels /i/, /e/, /u/ and /o/ assimilate well with their Russian counterparts; (2) Estonian /a/ and /ä/ assimilate with the allophones of Russian /a/; (3) Estonian /ü/, /ö/ and /õ/ assimilate partly with Russian /ɨ/; due to the close phonetic distance L2 subjects' ability to discriminate these categories is poor.
Qin Shi (IBM China Research Center)
Kun Li (Tsinghua University)
Shi Lei Zhang (IBM China Research Center)
Stephen M. Chu (IBM T. J. Watson Research Center)
Zhi Jian Ou (Tsinghua University)
The absence of real-time and targeted feedback is often critical in spoken foreign language learning. Computer-assisted language assessment systems are playing an ever more important role in this domain. This work considers the idiosyncratic pronunciation patterns of Chinese English speakers and uses both acoustic and prosody features to capture pronunciation, word stress, and rhythm information. The proposed system uses a. automatic speech recognition and alignment for pronunciation assessment, b. a set of special features with appropriate normalization for word stress detection, and c. a prosody phrase prediction model for rhythm assessment; and is shown to give immediate and accurate analyses to speakers to improve learning efficiency.
#11Russian Infants and Children’s Sounds and Speech Corpuses for Language Acquisition Studies
Elena Lyakso (Saint-Petersburg State University)
Olga Frolova (Saint-Petersburg State University)
Anna Kurazhova (Saint-Petersburg State University)
Julia Gaikova (Saint-Petersburg State University)
«INFANTRU» and «CHILDRU» are the first Russian child speech database. The corpus «INFANTRU» contains longitudinal vocalizations and speech records (n=2967) of 99 children from 3 mos to 36 mos by long utterances sequences and separate utterances in different psychoemotional state of the child. The database “CHILDRU” contains the records (n=28079, 13956Mb) of 150 children’s speech at the age from 4 to 7 years. Speech material are presented by the following situations: spontaneous speech, answers to questions, reading, poetry or retelling a tale, count and alphabet, play. Speech files format is Windows PCM, 22050 Hz, 16 bit.
Julia Monnin (1. CNEP, Université de la Nouvelle-Calédonie, 2. Département Parole et Cognition, GIPSA-lab)
Hélène Loevenbruck (Département Parole et Cognition, GIPSA-lab)
This study extends a cross-linguistic collaboration on phonological development, which aims at comparing production of word-initial sequences of consonant-vowel (CVs) across sets of languages which have comparable phonemes that differ in overall frequency. By comparing across languages, the influence of language-specific distributional patterns on phoneme mastery can be disentangled from the effects of more general phonetic constraints on development. We made word and non-word repetition experiments with French- and Drehu-acquiring 2-year-old to 5-year-old children. We first analysed production in words according to frequency data in French and Drehu. Results show that productions of word-initial consonants are correlated with frequency, especially in younger children. Then we compared the non-word production scores of French- and Drehu-acquiring children. French and Drehu learners have similar mean scores but show different patterns for specific phonemes that differ in frequency.
#13Did you say susi or shushi? Measuring the emergence of robust fricative contrasts in English- and Japanese-acquiring children
Jeffrey Holliday (Ohio State University)
Mary Beckman (Ohio State University)
Chanelle Mays (Ohio State University)
While the English sibilant fricatives can be well-differentiated by the centroid frequency of the frication noise alone, the Japanese sibilant fricatives cannot be. Measures of perceived spectral peak frequency and shape developed for stop bursts were adapted to describe sibilant fricative contrasts in English- and Japanese-speaking adults and children. These measures captured both the cross-language differences and more subtle inter-individual differences related to language-specific marking of gender. They could also be used in deriving a measure of robustness of contrast that captured cross-language differences in fricative development.

Spoken language resources, systems and evaluation I

Time:Wednesday 10:00 Place:International Conference Room D Type:Poster
Chair:Bhiksha Raj
#1An Empirical Comparison of the T3, Juicer, HDecode and Sphinx3 Decoders
Josef R. Novak (Tokyo Institute of Technology)
Paul R. Dixon (National Institute of Information and Communications Technology)
Sadaoki Furui (Tokyo Institute of Technology)
In this paper we perform a cross-comparison of the T3 WFST decoder against three different speech recognition decoders on three separate tasks of variable difficulty. We show that the T3 decoder performs favorably against several established veterans in the field, including the Juicer WFST decoder, Sphinx3, and HDecode in terms of RTF versus Word Accuracy. In addition to comparing decoder performance, we evaluate both Sphinx and HTK acoustic models on a common footing inside T3, and show that the speed benefits that typically accompany the WFST approach increase with the size of the vocabulary and other input knowledge sources. In the case of T3, we also show that GPU acceleration can significantly extend these gains.
#2Tracter: A lightweight Dataflow Framework
Philip N. Garner (Idiap Research Institute)
John Dines (Idiap Research Institute)
Tracter is introduced as a dataflow framework particularly useful for speech recognition. It is designed to work on-line in real-time as well as off-line, and is the feature extraction means for the Juicer transducer based decoder. This paper places Tracter in context amongst the dataflow literature and other commercial and open source packages. Some design aspects and capabilities are discussed. Finally, a fairly large processing graph incorporating voice activity detection and feature extraction is presented as an example of Tracter's capabilites.
#3Verifying Pronunciation Dictionaries using Conflict Analysis
Marelie H. Davel (CSIR South Africa)
Febe de Wet (CSIR South Africa)
We describe a new technique for automatically identifying errors in an electronic pronunciation dictionary which analyzes the source of conflicting patterns directly. We evaluate the effectiveness of this technique in two ways: we perform a controlled experiment using artificially corrupted data (allowing us to measure precision and recall exactly); and then apply the technique to a real-world pronunciation dictionary, demonstrating its effectiveness in practice. We also introduce a new freely available pronunciation resource (the RCRL Afrikaans Pronunciation Dictionary), the largest such dictionary that is currently available.
#4Automatic Estimation of Transcription Accuracy and Difficulty
Brandon Roy (MIT)
Soroush Vosoughi (MIT)
Deb Roy (MIT)
Managing a large-scale speech transcription task with a team of human transcribers requires effective quality control and workload distribution. As it becomes easier and cheaper to collect massive audio corpora the problem is magnified. Relying on expert review or transcribing all speech multiple times is impractical. Furthermore, speech that is difficult to transcribe may be better handled by a more experienced transcriber or skipped entirely. We present a fully automatic system to address these issues. First, we use the system to estimate transcription accuracy from a a single transcript and show that it correlates well with inter-transcriber agreement. Second, we use the system to estimate the transcription "difficulty" of a speech segment and show that it is strongly correlated with transcriber effort. This system can help a transcription manager determine when speech segments may require review, track transcriber performance, and efficiently manage the transcription process.
#5Creating a semantic coherence dataset with non-expert annotators
Benjamin Lambert (Language Technologies Institute, Carnegie Mellon University)
Rita Singh (Language Technologies Institute, Carnegie Mellon University)
Bhiksha Raj (Language Technologies Institute, Carnegie Mellon University)
We describe the creation of a linguistic plausibility dataset that contains annotated examples of language judged to be linguistically plausible, implausible, and every-thing in between. To create the dataset we randomly generate sentences and have them annotated by crowd sourcing over the Amazon Mechanical Turk. Obtaining inter-annotator agreement is a difficult problem because linguistic plausibility is highly subjective. The annotations obtained depend, among other factors, on the manner in which annotators are ques- tioned about the plausibility of sentences. We describe our experi- ments on posing a number of different questions to the annotators, in order to elicit the responses with greatest agreement, and present several methods for analyzing the resulting responses. The generated dataset and annotations are being made available to public.
#6Construction and Evaluations of an Annotated Chinese Conversational Corpus in Travel Domain for Language Model of Speech Recognition
Xinhui Hu (National Institute of Information and Communications Technology, Japan)
Ryosuke Isotani (National Institute of Information and Communications Technology, Japan)
Hisashi Kawai (National Institute of Information and Communications Technology, Japan)
Satoshi Nakamura (National Institute of Information and Communications Technology, Japan)
In this paper we describe the development of an annotated Chinese conversational textual corpus for speech recognition in a speech-to-speech translation system in the travel domain. A total of 515,000 manually checked utterances were constructed, which provided a 3.5 million word Chinese corpus with word segmentation and part-of-speech tagging. The annotation is conducted with careful manual checking. The specifications on word segmentation and POS-tagging are designed to follow the main existing Chinese corpora that are widely accepted by researchers of Chinese natural language processing. Many particular features of conversational texts are also taken into account. With this corpus, parallel corpora are obtained together with the corresponding pairs of Japanese and English texts from which the Chinese was translated. To evaluate the corpus, the language models built by it are evaluated using perplexity and speech recognition accuracy as criteria. The perplexity of the Chinese language model is verified as having reached a reasonably low level. Recognition performance is also found to be comparable to the other two languages, even though the quantity of training data for Chinese is only half the other two languages.
#7Building transcribed speech corpora quickly and cheaply for many languages
Thad Hughes (Google)
Kaisuke Nakajima (Google)
Linne Ha (Google)
Atul Vasu (Google)
Pedro Moreno (Google)
Mike LeBeau (Google)
We present a system for quickly and cheaply building transcribed speech corpora containing utterances from many speakers in a variety of acoustic conditions. The system consists of a client application running on an Android mobile device with an intermittent Internet connection to a server. The client application collects demographic information about the speaker, fetches textual prompts from the server for the speaker to read, records the speaker’s voice, and uploads the audio and associated metadata to the server. The system has so far been used to collect over 3000 hours of transcribed audio in 17 languages around the world.
#8The CHiME corpus: a resource and a challenge for Computational Hearing in Multisource Environments
Heidi Christensen (University of Sheffield)
Jon Barker (University of Sheffield)
Ning Ma (University of Sheffield)
Phil Green (University of Sheffield)
We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs spanning 20 dB). The corpus includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment. These have been used to add target utterances from the Grid speech recognition corpus into the CHiME domestic setting. Data has been mixed in a manner that produces a controlled and yet natural range of SNRs over which speech separation, enhancement and recognition algorithms can be evaluated. The paper motivates the design of the corpus, and describes the collection and post-processing of the data. We also present a set of baseline recognition results.
#9Developing A Chinese L2 Speech Database of Japanese Learners With Narrow-Phonetic Labels For Computer Assisted Pronunciation Training
Wen Cao (Center of Studies of Chinese as a Second Language,Beijing Language and Culture University, P. R. China)
Dongning Wang (Center of Studies of Chinese as a Second Language,Beijing Language and Culture University, P. R. China)
Jin-Song Zhang (Center of Studies of Chinese as a Second Language, College of Information Science, Beijing Language and Culture University, P. R. China)
Ziyu Xiong (Institute of Linguitics, Chinese Aademy of Social Sciences)
For the purpose of developing Computer Assisted Pronunciation Training (CAPT) technology with more informative feedbacks, we propose to use a set of narrow-phonetic labels to annotate Chinese L2 speech database of Japanese learners. The labels include basic units of “Initials”, “Finals” for Chinese phonemes and diacritics for erroneous articulation tendencies. Pilot investigations were made on the annotating consistencies of two sets of phonetic transcriptions in 17 speakers’ data. The results indicate the consistencies are moderately good, suggesting that the annotating procedure be practical, and there are also rooms for further improvement.
#10How Children Acquire Situation Understanding Skills?: A Developmental Analysis Utilizing Multimodal Speech Behavior Corpus
Ishikawa Shogo (Shizuoka University)
Kiriyama Shinya (Shizuoka University)
Takebayashi Yoichi (Shizuoka University)
Kitazawa Shigeyoshi (Shizuoka University)
We have developed a multimodal speech behavior corpus which includes metadata annotated from various viewpoints such as, utterances, actions, emotions and intention for analyzing behavioral factors of thinking processes from various perspectives in everyday life. Utilizing the corpus, we analyzed child development of situation understanding skills focused on "attention-catching" that has a role as a signal when communicating with other people. We formulated a hypothesis of the developmental process that there is a connection between physical expression skills and mental conditions such as utterances, gestures and the attention ability. The analysis results showed that the situation understanding skills follow the similar development, which is a change of object-centric to person-centric, despite the age of developmental change is different. Furthermore, the analysis results provided us with a more in-depth construction of the corpus.
#11The Influence of Expertise and Efficiency on Modality Selection Strategies and Perceived Mental Effort
Ina Wechsung (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Germany)
Stefan Schaffer (Research training group prometei, TU Berlin, Germany)
Robert Schleicher (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Germany)
Anja Naumann (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Germany)
Sebastian Möller (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Germany)
This paper describes a user study investigating the influence of expertise and efficiency on modality selection (speech vs. virtual keyboard) and perceived mental effort. Efficiency was varied in terms of interaction steps. The goal was to investigate if the number of necessary interaction steps determines the preference for a specific modality. It is shown that the threshold for changing the modality selection strategy is at three (experts) respectively four (novices) interaction steps.
#12Parameters Describing Multimodal Interaction - Definitions and Three Usage Scenarios
Christine Kühnel (Quality and Usability Lab, Technische Universität Berlin, Germany)
Benjamin Weiss (Quality and Usability Lab, Technische Universität Berlin, Germany)
Sebastian Möller (Quality and Usability Lab, Technische Universität Berlin, Germany)
While multimodal systems are an active research field, there is no agreed-upon set of multimodal interaction parameters, which would allow to quantify the performance of such systems and their underlying modules, and would therefor be necessary for a systematic evaluation. In this paper we propose an extension to established parameters describing the interaction with spoken dialog systems [Möller 2005] in order to be used for multimodal systems. Focussing on the evaluation of a multimodal system, three usage scenarios for these parameters are given.
#13Repair Strategies on Trial: Which Error Recovery Do Users Like Best?
Alexander Zgorzelski (University of Ulm)
Alexander Schmitt (University of Ulm)
Tobias Heinroth (University of Ulm)
Wolfgang Minker (University of Ulm)
Extensive research about recovery strategies for misunderstandings and non-understandings within the context of spoken dialogue systems (SDS) has been undertaken in the past and is still going on. Many scientists focus on optimizing the recovery rate using various strategies. It is still not sufficiently explored, how different strategies relate to user satisfaction, and how confused users get with simple strategies such as a reprompt. We carried out an empirical analysis with some of the most promising strategies. In addition to the two common strategies help and reprompt we also evaluated an adapted version of the promising MoveOn strategy. We found that the reactions regarding our different mockup dialogues, especially between computer experts and novices, vary a lot.

Special Session: Speech Intelligibility Enhancement for All Ages, Health Conditions, and Environments

Time:Wednesday 10:00 Place:301 Type:Special
Chair:Qian-Jie Fu & Junfeng Li
10:00Enhanced Speech Yielding Higher Intelligibility for All Listeners and Environments
Takayuki Arai (Sophia University)
Nao Hodoshima (Tokai University)
The current paper discusses two approaches to enhanced speech in reverberation/noise: machine signal processing and human speech production. We reviewed the speech enhancement techniques, including steady-state suppression and compared the modulation spectra of speech signals before and after processing. We also introduced the Lombard-like effect of speech in reverberation, and compared the characteristics of speech signals, including the modulation spectra between speech signals uttered in quiet and reverberation. We found that the enhanced speech signals have distinct characteristics that yield higher speech intelligibility.
10:20Quality Conversion of Non-Acoustic Signals for Facilitating Human-to-Human Speech Communication under Harsh Acoustic Conditions
Seyed Omid Sadjadi (The University of Texas at Dallas)
Sanjay A. Patil (The University of Texas at Dallas)
John H.L. Hansen (The University of Texas at Dallas)
Harsh acoustic conditions limit the effectiveness of human speech communication to a great extent. There is a consensus that even at moderate SNR levels, traditional speech enhancement techniques tend to improve the perceptual quality of speech rather than its intelligibility. As an alternative, non-acoustic contact sensors have recently been developed for noise-robust signal capture. Although relatively immune to ambient noise, due to alternative pickup location and non-acoustic principle of operation, signals measured from these sensors are of lower speech quality and intelligibility when compared to those obtained from a conventional microphone in clean conditions. To facilitate human-to-human speech communication under acoustically adverse environments, in this study we present and evaluate a probabilistic transformation framework to improve perceptual quality and intelligibility of signals acquired from one such sensor entitled: physiological microphone (PMIC). Results from both objective and subjective tests confirm that incorporating this framework as a post-processing stage yields significant improvement in overall quality and intelligibility of PMIC signals.
10:40The Use of Air-Pressure Sensor in Electrolaryngeal Speech Enhancement Based on Statistical Voice Conversion
Keigo Nakamura (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
Tomoki Toda (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
Hiroshi Saruwatari (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
Kiyohiro Shikano (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
In our previous work, we proposed a speaking-aid system converting electrolaryngeal speech (EL speech) to normal speech using a statistical voice conversion technique. The main weakness of our system is the difficulty of estimating natural contours of the fundamental frequency (F0) from EL speech including only built-in F0 contours. This paper proposes another speaking-aid system with an air-pressure sensor to enable laryngectomees to control F0 contours of the EL speech using their breathing air. The experimental result demonstrates that 1) the correlation coefficient of F0 contours between the converted and the target speech is improved from 0.58 to 0.78 by the use of the air-pressure sensor and 2) the synthetic speech converted by the proposed system sounds more natural and is more preferred to that by our conventional aid system.
11:00A new binary mask based on noise constraints for improved speech intelligibility
Gibak Kim (University of Texas at Dallas)
Philipos C. Loizou (University of Texas at Dallas)
It has been shown that large gains in speech intelligibility can be obtained by using the binary mask approach which retains the time-frequency (T-F) units of the mixture signal that are stronger than the interfering noise (masker) (i.e., SNR>0 dB), and removes the T-F units where the interfering noise dominates. In this paper, we introduce a new binary mask for improving speech intelligibility based on noise distortion constraints. A binary mask is designed to retain noise overestimated T-F units while discarding noise underestimated T-F units. Listening tests were conducted to evaluate the new binary mask in terms of intelligibility. Results from the listening tests indicated that large gains in intelligibility can be achieved by the application of the proposed binary mask to noise-corrupted speech even at extremely low SNR levels (-10 dB).
11:20Energy reallocation strategies for speech enhancement in known noise conditions
Yan Tang (Language and Speech Laboratory, Faculty of Letters, Universidad del Pais Vasco, Spain)
Martin Cooke (Language and Speech Laboratory, Faculty of Letters, Universidad del Pais Vasco, Spain; Ikerbasque (Basque Science Foundation))
Speech output, whether live, recorded or synthetic, is often employed in difficult listening conditions. Context-sensitive speech modifications aim to promote intelligibility while maintaining quality and listener comfort. The current study used objective measures of intelligibility and quality to compare five energy reallocation strategies operating under equal energy and preserved duration constraints. Results in both stationary and highly-nonstationary backgrounds suggest that time-varying modifications lead to large increases in objective intelligibility, but that speech quality is best preserved by time-invariant modifications. Selective amplification of time-frequency regions with low a priori SNR produced the highest objective intelligibility without severe disruption to quality.
11:40Effects of Enhancement of Spectral Changes on Speech Quality and Subjective Speech Intelligibility
Jing Chen (Department of Experimental Psychology, University of Cambridge)
Thomas Baer (Department of Experimental Psychology, University of Cambridge)
Brian Moore (Department of Experimental Psychology, University of Cambridge)
Most information in speech is carried in the spectral changes over time, rather than in static spectral shape per se. The present study presents a preliminary assessment of the possible benefits for speech intelligibility of enhancing spectral changes across time, using behavioral tests with both normal-hearing and hearing-impaired subjects. Ratings of speech quality and intelligibility were obtained using several variants of the processing scheme, with different processing parameters. The results suggest that the processing strategy may have advantages for hearing-impaired people, for certain sets of parameter values.

ASR: Search, Decoding and Confidence Measures II

Time:Wednesday 13:30 Place:Hall A/B Type:Oral
Chair:Michael Riley
13:30CRF-based Combination of Contextual Features to Improve A Posteriori Word-level Confidence Measures
Julien Fayolle (IRISA/INRIA Rennes, France)
Fabienne Moreau (University of Rennes 2/IRISA Rennes, France)
Christian Raymond (IRISA/INSA Rennes, France)
Guillaume Gravier (IRISA/CNRS Rennes, France)
Patrick Gros (IRISA/INRIA Rennes, France)
This paper addresses the issue of confidence measure reliability provided by automatic speech recognition systems for use in various spoken language processing applications. We propose a method based on conditional random field to combine contextual features to improve word-level confidence measures. The method consists in combining various knowledge sources (acoustic, lexical, linguistic, phonetic and morphosyntactic) to enhance confidence measures, explicitly exploiting context information. Experiments were conducted on a large French broadcast news corpus from the ESTER benchmark. Results demonstrate the added-value of our method with a significant improvement of the normalized cross entropy and of the equal error rate.
13:50Recognition of Spontaneous Conversational Speech using Long Short-Term Memory Phoneme Predictions
Martin Woellmer (Technische Universitaet Muenchen)
Florian Eyben (Technische Universitaet Muenchen)
Bjoern Schuller (Technische Universitaet Muenchen)
Gerhard Rigoll (Technische Universitaet Muenchen)
We present a novel continuous speech recognition framework designed to unite the principles of triphone and Long Short-Term Memory (LSTM) modeling. The LSTM principle allows a recurrent neural network to store and to retrieve information over long time periods, which was shown to be well-suited for the modeling of co-articulation effects in human speech. Our system uses a bidirectional LSTM network to generate a phoneme prediction feature that is observed by a triphone-based large-vocabulary continuous speech recognition (LVCSR) decoder, together with conventional MFCC features. We evaluate both, phoneme prediction error rates of various network architectures and the word recognition performance of our Tandem approach using the COSINE database - a large corpus of conversational and noisy speech, and show that incorporating LSTM phoneme predictions in to an LVCSR system leads to significantly higher word accuracies.
14:10Improving ASR error detection with non-decoder based features
Thomas Pellegrini (INESC-ID)
Isabel Trancoso (INESC-ID IST)
This study reports error detection experiments in large vocabulary automatic speech recognition (ASR) systems, by using statistical classifiers. We explored new features gathered from other knowledge sources than the decoder itself: a binary feature that compares outputs from two different ASR systems (word by word), a feature based on the number of hits of the hypothesized bigrams, obtained by queries entered into a very popular Web search engine, and finally a feature related to automatically infered topics at sentence and word levels. Experiments were conducted on a European Portuguese broadcast news corpus. The combination of baseline decoder-based features and two of these additional features led to significant improvements, from 13.87% to 12.16% classification error rate (CER) with a maximum entropy model, and from 14.01% to 12.39% CER with linear-chain conditional random fields, comparing to a baseline using only decoder-based features.
14:30Phoneme Classification and Lattice Rescoring Based on a k-NN Approach
Ladan Golipour (INRS)
Douglas O'Shaughnessy (INRS)
In this paper we propose a k-NN/SASH phoneme classification algorithm that competes favourably with state-of- the-art methods. We apply a similarity search algorithm (SASH) that has been used successfully for classification of high dimensional texts and images. Unlike other search algorithms, the computational time of SASH is not affected by the dimensionality of the data. Therefore, we generate fixed-length but high-dimensional feature vectors for phonemes using their underlying frames and those of boundaries. The k-NN/SASH phoneme classifier is fast, efficient, and could achieve a classification rate of 79.2% for the TIMIT test database. Finally, we apply this algorithm to rescore phoneme lattices, generated by the GMM-HMM monophone recognizer for both context-independent and context-dependent tasks. In both cases, the k-NN/SASH classifier leads to improvements in the recognition rate.
14:50Online Adaptive Learning for Speech Recognition Decoding
Jeff Bilmes (University of Washington)
Hui Lin (University of Washington)
We describe a new method for pruning in dynamic models based on running an adaptive filtering algorithm online during decoding to predict aspects of the scores in the near future. These predictions are used to make well-informed pruning decisions during model expansion. We apply this idea to the case of dynamic graphical models and test it on a speech recognition database derived from Switchboard. Results show that significant (approximately factor of 2) speedups can be obtained without any decrease in word error rate or increase in memory usage.
15:10Improvements of Search Error Risk Minimization in Viterbi Beam Search for Speech Recognition
Takaaki Hori (NTT Corporation)
Shinji Watanabe (NTT Corporation)
Atsushi Nakamura (NTT Corporation)
This paper describes improvements in a search error risk minimization approach to fast beam search for speech recognition. In our previous work, we proposed this approach to reduce search errors by optimizing the pruning criterion. While conventional methods use heuristic criteria to prune hypotheses, our proposed method employs a pruning function that makes a more precise decision using rich features extracted from each hypothesis. The parameters of the function can be estimated to minimize a loss function based on the search error risk. In this paper, we improve this method by introducing a modified loss function, arc-averaged risk, which potentially has a higher correlation with actual error rate than the original one. We also investigate various combinations of features. Experimental results show that further search error reduction over the original method is obtained in a 100K-word vocabulary lecture speech transcription task.

Spoken language resources, systems and evaluation II

Time:Wednesday 13:30 Place:201A Type:Oral
Chair:Norihide Kitaoka
13:30Say What? Why users choose to speak their web queries
Maryam Kamvar (Google)
Doug Beeferman (Google)
The context in which a speech-driven application is used (or conversely not used) can be an important signal for recognition engines, and for spoken interface design. Using large-scale logs from a widely deployed spoken system, we analyze on an aggregate level factors that are correlated with a decision to speak a query rather than type it. We find the factors most predictive of spoken queries are whether a query is made from an unconventional keyboard, for a search topic relating to a users' location, or for a search topic that can be answered in a “hands-free” fashion. We also find, contrary to our intuition, that longer queries have a higher probability of being typed than shorter queries.
13:50The Effect of Audience Familiarity on the Perception of Modified Accent
Jonathan Teutenberg (Teesside University)
Catherine Watson (University of Auckland)
Evaluating the efficacy of accent transformation is important when localising speech-enabled software. However, perceived accent is an attribute assigned by a listener, and the apparent success of accent transformation will vary with the audience. Here we show the extent to which evaluations can be affected by audience familiarity with an accent. A perceptual study comparing two approaches to accent transformation is presented to two audiences with differing familiarity with the target accents. For mean opinion score style evaluations, we quantify the approximate change in perception, and show that this can be sufficient to alter relative successfulness of such systems.
14:10On Generating Combilex Pronunciations via Morphological Analysis
Korin Korin Richmond (Centre for Speech Technology Research, Edinburgh University)
Robert Robert Clark (Centre for Speech Technology Research)
Sue Sue Fitt (Centre for Speech Technology Research)
Combilex is a high-quality lexicon that has been developed specifically for speech technology purposes and recently released by CSTR. Combilex benefits from many advanced features. This paper explores one of these: the ability to generate fully-specified transcriptions for morphologically derived words automatically. This functionality was originally implemented to encode the pronunciations of derived words in terms of their constituent morphemes, thus accelerating lexicon development and ensuring a high level of consistency. In this paper, we propose this method of modelling pronunciations can be exploited further by combining it with a morphological parser, thus yielding a method to generate full transcriptions for unknown derived words. Not only could this accelerate adding new derived words to Combilex, but it could also serve as an alternative to conventional letter-to-sound rules. This paper presents preliminary work indicating this is a promising direction.
14:30Say It As You Mean It – Analyzing Free User Comments in the VOICE Awards Corpus
Florian Gödde (Quality and Usability Lab, Deutsche Telekom Labs, Technische Universität Berlin)
Sebastian Möller (Quality and Usability Lab, Deutsche Telekom Labs, Technische Universität Berlin)
Usability questionnaires usually contain scales related to effectiveness, efficiency and overall satisfaction which provide a quantitative value for the user’s opinion. However, analyzing quantitative data often does not show the reason underlying for a good or bad opinion. Simple questions like “What did you like about the system?” and “What did you not like about the system?” can shade light on the underlying reasons, but a lot of effort is needed for the analysis of such data. Nevertheless, the answers to these questions contain the users’ opinion in their own words and hence often show high correlation with the overall rating of the system. In the frame of the SpeechEval project we analyzed the German VOICE Awards corpus over three consecutive years, categorizing the answers to these two free text questions and analyzing correlations between the categories and the overall rating of the systems.
14:50A new multichannel multimodal dyadic interaction database
Viktor Rozgic (USC)
Bo Xiao (USC)
Nassos Katsamanis (USC)
Brian Baucom (USC)
Panayiotis Georgiou (USC)
Shrikanth Narayanan (USC)
In this work we present a new multi-modal database for analysis of participant behaviors in dyadic interactions. This database contains multiple channels with close- and far-field audio, a high definition camera array and motion capture data. Presence of the motion capture allows precise analysis of the body language low-level descriptors and its comparison with similar descriptors derived from video data. Data is manually labeled by multiple human annotators using psychology-informed guides. This work also presents an initial analysis of approach-avoidance (A-A) behavior. Two sets of annotations are provided, one based on video only and the other obtained by using both the audio and video channels. Additionally, we describe the statistics of interaction descriptors and A-A labels on participants' roles. Finally we provide an analysis of relations between various non-verbal features and approach/avoidance labels.
15:10SEAME: a Mandarin-English Code-switching Speech Corpus in South-East Asia
Dau-Cheng Lyu (School of Computer Engineering, Nanyang Technological University, Singapore)
Tien-Ping Tan (School of Computer Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia)
Eng-Siong Chng (School of Computer Engineering, Nanyang Technological University, Singapore 639798)
Haizhou Li (Institute for Infocomm Research, 1 Fusionopolis Way, Singapore 138632)
In Singapore and Malaysia, people often speak a mix of Mandarin and English with a single sentence, that we call intra-sentential code-switch sentence. In this paper, we report the development of a Mandarin-English code-switching spontaneous speech corpus: SEAME. As part of a multilingual speech recognition project, the design of such a corpus allows the study of how Mandarin-English code-switch speech occurs in the spoken language in South-East Asia, and provides insights into the development of large vocabulary continuous speech recognition (LVCSR) to cover code-switching speech. We develop a speech corpus of intra-sentential code-switching utterances that are recorded under both interview and conversational settings. The paper describes the corpus design and the analysis of collected corpus.

Speech Production III: Analysis

Time:Wednesday 13:30 Place:201B Type:Oral
Chair:Shrikanth Narayanan
13:30Relying on critical articulators to estimate vocal tract spectra in an articulatory-acoustic database
Daniel Felps (Department of Computer Science and Engineering, Texas A&M University)
Christian Geng (Department of Linguistics and English Language, University of Edinburgh)
Michael Berger (Centre for Speech Technology Research, University of Edinburgh)
Korin Richmond (Centre for Speech Technology Research, University of Edinburgh)
Ricardo Gutierrez-Osuna (Department of Computer Science and Engineering, Texas A&M University)
We present a new phone-dependent feature weighting scheme that can be used to map articulatory configurations (e.g. EMA) onto vocal tract spectra (e.g. MFCC) through table lookup. The approach consists of assigning feature weights according to a feature’s ability to predict the acoustic distance between frames. Since an articulator’s predictive accuracy is phone-dependent (e.g., lip location is a better predictor for bilabial sounds than for palatal sounds), a unique weight vector is found for each phone. Inspection of the weights reveals a correspondence with the expected critical articulators for many phones. The proposed method reduces overall cepstral error by 6% when compared to a uniform weighting scheme. Vowels show the greatest benefit, though improvements occur for 80% of the tested phones.
Vikram Ramanarayanan (University of Southern California)
Dani Byrd (University of Southern California)
Louis Goldstein (University of Southern California)
Shrikanth Narayanan (University of Southern California)
We present a novel automatic procedure to analyze 'articulatory setting (AS)' or 'basis of articulation' using real-time magnetic resonance images (rt-MRI) of the human vocal tract recorded for read and spontaneously spoken speech. We extract relevant frames of inter-speech pauses (ISPs) and rest positions from MRI sequences of read and spontaneous speech and use automatically-extracted features to quantify areas of different regions of the vocal tract as well as the angle of the jaw. Significant differences were found between the ASs adopted for ISPs in read and spontaneous speech, as well as those between ISPs and absolute rest positions. We further contrast differences between ASs adopted when the person is ready to speak as opposed to an absolute rest position.
14:10Articulatory inversion of American English /r/ by conditional density modes
Chao Qin (University of California, Merced)
Miguel Carreira-Perpiñán (University of California, Merced)
Although many algorithms have been proposed for articulatory inversion, they are often tested on synthetic models, or on real data that shows very small proportions of nonuniqueness. We focus on data from the Wisconsin X-ray microbeam database for the American English textipa{/*r/} displaying multiple, very different articulations (retroflex and bunched). We propose a method based on recovering the set of all possible vocal tract shapes as the modes of a conditional density of articulators given acoustics, and then selecting feasible trajectories from this set. This method accurately recovers the correct textipa{/*r/} shape, while a neural network has errors twice as large.
14:30Can tongue be recovered from face? The answer of data-driven statistical models
Atef Ben Youssef (GIPSA-lab (Dept Parole & Cognition / ICP), UMR 5216, CNRS – Grenoble University, France)
Pierre Badin (GIPSA-lab (Dept Parole & Cognition / ICP), UMR 5216, CNRS – Grenoble University, France)
Gérard Bailly (GIPSA-lab (Dept Parole & Cognition / ICP), UMR 5216, CNRS – Grenoble University, France)
This study revisits the face-to-tongue articulatory inversion problem in speech. We compare the Multi Linear Regression method (MLR) with two more sophisticated methods based on Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs), using the same French corpus of articulatory data acquired by ElectroMagnetoGraphy. GMMs give overall results better than HMMs, but MLR does poorly. GMMs and HMMs maintain the original phonetic class distribution, though with some centralisation effects, effects still much stronger with MLR. A detailed analysis shows that, if the jaw / lips / tongue tip synergy helps recovering front high vowels and coronal consonants, the velars are not recovered at all. It is therefore not possible to recover reliably tongue from face.
14:50Phrase-medial vowel devoicing in spontaneous French
Francisco Torreira (Radboud Universiteit Nijmegen & Max Planck Institute for Psycholinguistics)
Mirjam Ernestus (Radboud Universiteit Nijmegen & Max Planck Institute for Psycholinguistics)
This study investigates phrase-medial vowel devoicing in European French (e.g. /ty po/ [typo] 'you can'). Our spontaneous speech data confirm that French phrase-medial devoicing is a frequent phenomenon affecting high vowels preceded by voiceless consonants. We also found that devoicing is more frequent in temporally reduced and coarticulated vowels. Complete and partial devoicing were conditioned by the same variables (speech rate, consonant type and distance from the end of the AP). Given these results, we propose that phrase-medial vowel devoicing in French arises mainly from the temporal compression of vocalic gestures and the aerodynamic conditions imposed by high vowels.
15:10Exploring the Mechanism of Tonal Contraction in Taiwan Mandarin
Chierh Cheng (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)
Yi Xu (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)
Michele Gubian (Centre for Language & Speech Technology, Radboud University, Nijmegen, NL)
This study investigates the mechanism of tonal contraction when a disyllabic unit is merged into a monosyllable at fast speech rate in Taiwan Mandarin. Various degrees of contraction of bi-tonal sequences were elicited by manipulating speech rates. Functional Data Analysis was performed to compare trajectories of F0 and velocity in the contracted and non-contracted syllables. Preliminary results show that speakers always make an effort to produce the original tones, even in cases of extreme degrees of reduction. This finding militates against phonology-based accounts like the Edge-in model, according to which contraction is a process of deleting adjacent tonemes while leaving the non-adjacent tonemes intact.

Paralanguage & Cognition

Time:Wednesday 13:30 Place:302 Type:Oral
Chair:Julia Hirschberg
13:30Voice Attributes Affecting Likability Perception
Benjamin Weiss (Quality & Usability Lab, DT. Laboratories, TU Berlin)
Felix Burkhardt (Deutsche Telekom Laboratories)
Ratings of voices' likability were collected in two subsequent studies. A single scale seems to be sufficient for assessing such ratings. Based on limited but controlled data, spectral parameters as well as f0 and articulation rate correlate with the ratings obtained. An automatic classification confirms the relevance of spectral features for the perception of likability. As a simple method of collecting more data for further studies, the single scale was validated within the bounds of the small data set. Both, the spectral parameters and items from a comprehensive questionnaire indicate the relevance of timbre for the likability perception.
13:50Turn alignment using eye-gaze and speech in conversational interaction
Kristiina Jokinen (University of Helsinki)
Kazuaki Harada (Doshisha University)
Masafumi Nishida (Doshisha University)
Seiichi Yamamoto (Doshisha University)
Spoken interactions are known for accurate timing and alignment between interlocutors: turn-taking and topic flow are managed in a manner that provides conversational fluency and smooth progress of the task. This paper studies the relation between the interlocutors’ eye-gaze and spoken utterances, and describes our experiments on turn alignment. We conducted classification experiments by Support Vector Machine on turn-taking using the features for dialogue act, eye-gaze, and speech prosody in conversation data. As a result, we demonstrated that eye-gaze features are important signals in turn management, and seem even more important than speech features when the intention of utterances is clear.
14:10An Investigation of Formant Frequencies for Cognitive Load Classification
Tet Fei Yap (School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, Australia)
Julien Epps (School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, Australia)
Eliathamby Ambikairajah (School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, Australia)
Eric H. C. Choi (ATP Research Laboratory, National ICT Australia, Australia)
The cognitive load experienced by a person can be used as an index to monitor task performance. Hence, the ability to measure the cognitive load of a person using speech can potentially be very useful, especially in areas such as air traffic control systems. Current research on cognitive load does not provide enough insight into how cognitive load affects the speech spectrum, or the speech production system. Since formants are closely related to the underlying vocal tract configuration, this work aims to study the effect of cognitive load on vowel formant frequencies, and hence proposes the effective application of formant features to cognitive load classification. Results from classification performed on the Stroop test database show that formant features not only have lower dimensionality, but dynamic formant features can outperform conventionally used MFCC-based features by a relative improvement of 12%.
14:30Language specific effects of emotion on phoneme duration
Martijn Goudbeek (Tilburg University)
Mirjam Broersma (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
This paper presents an analysis of phoneme durations of emotional speech in two languages: Dutch and Korean. The analyzed corpus of emotional speech has been specifically developed for the purpose of cross-linguistic comparison, and is more balanced than any similar corpus available so far: a) it contains expressions by both Dutch and Korean actors and is based on judgments by both Dutch and Korean listeners; b) the same elicitation technique and recording procedure were used for recordings of both languages; and c) the phonetics of the carrier phrase were constructed to be permissible in both languages. The carefully controlled phonetic content of the carrier phrase allows for analysis of the role of specific phonetic features, such as phoneme duration, in emotional expression in Dutch and Korean. In this study the mutual effect of language and emotion on phoneme duration is presented.
14:50Automatic Classification of Married Couples' Behavior using Audio Features
Matthew Black (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Athanasios Katsamanis (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Chi-Chun Lee (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Adam Lammert (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Brian Baucom (Department of Psychology, University of Southern California, Los Angeles, CA, USA)
Andrew Christensen (Department of Psychology, University of California, Los Angeles, Los Angeles, CA, USA)
Panayiotis Georgiou (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
In this work, we analyzed a 96-hour corpus of married couples spontaneously interacting about a problem in their relationship. Each spouse was manually coded with relevant session-level perceptual observations (e.g., level of blame toward other spouse, global positive affect), and our goal was to classify the spouses' behavior using features derived from the audio signal. Based on automatic segmentation, we extracted prosodic/spectral features to capture global acoustic properties for each spouse. We then trained gender-specific classifiers to predict the behavior of each spouse for six codes. We compare performance for the various factors (across codes, gender, classifier type, and feature type) and discuss future work for this novel and challenging corpus.
15:10Influence of Gestural Salience on the Interpretation of Spoken Requests
Gideon Kowadlo (Monash University)
Patrick Ye (Monash University)
Ingrid Zukerman (Monash University)
We present a probabilistic, salience-based mechanism for the interpretation of pointing gestures together with spoken utterances. Our formulation models dependencies between spatial and temporal aspects of gestures and features of objects. The results from our corpus-based evaluation show that the incorporation of pointing information improves interpretation accuracy.

Robust ASR Against Noise

Time:Wednesday 13:30 Place:International Conference Room A Type:Poster
Chair:Reinhold Haeb-Umbach
#1Robust Word Recognition using Articulatory trajectories and Gestures
Vikramjit Mitra (Institute for Systems Research, Department of Electrical and Computer Engineering, University of Maryland, College Park)
Hosung Nam (Haskins Laboratories)
Carol Espy-Wilson (Institute for Systems Research, Department of Electrical and Computer Engineering, University of Maryland, College Park)
Elliot Saltzman (Haskins Laboratories & Department of Physical Therapy and Athletic Training, Boston University)
Louis Goldstein (Haskins Laboratories & Department of Linguistics, University of Southern California)
Articulatory Phonology views speech as an ensemble of constricting events along the vocal tract. This study shows that articulatory information in the form of gestures and their output trajectories can help to improve the performance of automatic speech recognition systems. Lack of any natural speech database containing such articulatory information prompted us to use a synthetic speech dataset that contains gesture and their output trajectory information. We propose neural network based models to obtain articulatory information from the speech signal and show that such estimated articulatory information helps to improve the noise robustness of a word recognition system.
#2Performance Estimation of Noisy Speech Recognition Considering Recognition Task Complexity
Takeshi Yamada (University of Tsukuba)
Tomohiro Nakajima (University of Tsukuba)
Nobuhiko Kitawaki (University of Tsukuba)
Shoji Makino (University of Tsukuba)
To ensure a satisfactory QoE (Quality of Experience) and facilitate system design in speech recognition services, it is essential to establish a method that can be used to efficiently investigate recognition performance in different noise environments. Previously, we proposed a performance estimation method using a spectral distortion measure. However, there is the problem that recognition task complexity affects the relationship between the recognition performance and the distortion value. To solve this problem, this paper proposes a novel performance estimation method considering the recognition task complexity. We confirmed that the proposed method gives accurate estimates of the recognition performance for various recognition tasks by an experiment using noisy speech data recorded in a real room.
#3Estimating Noise from Noisy Speech Features with a Monte Carlo Variant of the Expectation Maximization Algorithm
Friedrich Faubel (Spoken Language Systems, Saarland University)
Dietrich Klakow (Spoken Language Systems, Saarland University)
In this work, we derive a Monte Carlo expectation maximization algorithm for estimating noise from a noisy utterance. In contrast to earlier approaches, where the distribution of noise was estimated based on a vector Taylor series expansion, we use a combination of importance sampling and Parzen-window density estimation to numerically approximate the occurring integrals with the Monte Carlo method. Experimental results show that the proposed algorithm has superior convergence properties, compared to previous implementations of the EM algorithm. Its application to speech feature enhancement reduced the word error rate by over 30%, on a phone number recognition task recorded in a (real) noisy car environment.
#4Template-based Spectral Estimation Using Microphone Array for Speech Recognition
Satoshi Tamura (Gifu University)
Eriko Hishikawa (Gifu University)
Wataru Taguchi (Gifu University)
Satoru Hayamizu (Gifu University)
This paper proposes a Template-based Spectral Estimation (TSE) method for noise reduction of microphone array processing aiming at speech recognition enhancement. In the proposed method, a noise template in a complex plane is calculated for each frequency bin using non-speech audio signals observed at microphones. Then for every noise-overlapped speech signals, a speech signal can be reformed by applying the template and the gradient descent method. Experiments were conducted to evaluate not only performance of noise reduction but also improvement of speech recognition. Then NRR 16.7dB improvement was achieved by combining TSE and Spectral Subtraction (SS) methods. For speech recognition, 44% relative recognition error reduction was obtained comparing with the conventional SS method.
#5A Particle Filter Feature Compensation Approach to Robust Speech Recognition
Aleem Mushtaq (School of ECE, Georgia Institute of Technology, Atlanta, GA, 30332-0250, USA)
Yu Tsao (National Institute of Information and Communications Technology, Kyoto, Japan)
Chin Hui-Lee (School of ECE, Georgia Institute of Technology, Atlanta, GA, 30332-0250, USA)
We propose a novel particle filter approach to enhancing speech features for robust speech recognition. We use particle filters to compensate the corrupted features according to an additive noise distortion model by incorporating both the statistics from the clean speech Hidden Markov Models and of the observed background noise to map the noisy features back to clean speech features. We report on experimental results obtained with the Aurora-2 connected digit recognition task, and show that a large digit error reduction of 67% from multi-condition training is attainable if the missing side information needed for particle filter based compensation were available. When such nuisance parameters are estimated in actual operational conditions then an error reduction of only 13% is currently achievable. We anticipate more improvements in the future when better estimation algorithms are explored.
#6Nonlinear Enhancement of Onset for Robust Speech Recognition
Chanwoo Kim (Carnegie Mellon University)
Richard Stern (Carnegie Mellon University)
In this paper, we present a novel algorithm called Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF) to enhance spectral features for robust speech recognition, especially in reverberant environments. This algorithm is motivated by the precedence effect and by the modulation frequency characteristics of the human auditory system. We describe two slightly different types of processing that differ in whether or not the falling edges of power trajectories are suppressed using a lowpassed power envelope signal. The SSF algorithms can be implemented for on-line processing. Speech recognition results show that this algorithm provides especially good robustness in reverberant environments
#7Mask Estimation in Non-stationary Noise Environments for Missing Feature Based Robust Speech Recognition
Shirin Badiezadegan (Department of Electrical and Computer Engineering, McGill University, Canada)
Richard C. Rose (Department of Electrical and Computer Engineering, McGill University, Canada)
This paper demonstrates the importance of accurate characterization of instantaneous acoustic noise for mask estimation in data imputation approaches to missing feature based ASR, especially in the presence of non-stationary background noise. Mask estimation relies on a hypothesis test designed to detect the presence of speech in time-frequency spectral bins under rapidly varying noise conditions. Masked mel-frequency filter bank energies are reconstructed using a MMSE based data imputation procedure. The impact of this mask estimation approach is evaluated in the context of MMSE based data imputation under multiple background conditions over a range of SNRs using the Aurora 2 speech corpus.
#8Robust Automatic Speech Recognition with Decoder Oriented Ideal Binary Mask Estimation
Lae-Hoon Kim Kim (University of Illinois at Urbana-Champaign)
Kyung-Tae Kim (University of Illinois at Urbana-Champaign)
Mark Hasegawa-Johnson (University of Illinois at Urbana-Champaign)
In this paper, we propose a joint optimal method for automatic speech recognition (ASR) and ideal binary mask (IBM) estimation in transformed into the cepstral domain through a newly derived generalized expectation maximization algorithm. First, cepstral domain missing feature marginalization is established using a linear transformation, after tying the mean and variance of non-existing cepstral coefficients. Second, IBM estimation is formulated using a generalized expectation maximization algorithm directly to optimize the ASR performance. Experimental results show that even in highly non-stationary mismatch condition (dance music as background noise), the proposed method achieves much higher absolute ASR accuracy improvement ranging from 14.69% at 0 dB SNR to 40.10% at 15 dB SNR compared with the conventional noise suppression method.
#9A Robust Speech Recognition System against the Ego Noise of a Robot
Gokhan Ince (Honda Research Institute Japan Co., Ltd., Japan)
Kazuhiro Nakadai (Honda Research Institute Japan Co., Ltd., Japan)
Tobias Rodemann (Honda Research Institute Europe GmbH, Germany)
Hiroshi Tsujino (Honda Research Institute Japan Co., Ltd., Japan)
Jun-ichi Imura (Dept. of Mechanical and Environmental Informatics, Tokyo Institute of Technology, Japan)
This paper presents a speech recognition system for a mobile robot that attains a high recognition performance, even if the robot generates ego-motion noise. We investigate noise suppression and speech enhancement methods that are based on prediction of ego-motion and its noise. The estimation of egomotion is used for superimposing white noise in a selective manner based on the ego-motion type. Moreover, instantaneous prediction of ego-motion noise is the core concept to establish the following techniques: ego-motion noise suppression by template subtraction and missing feature theory based masking of noisy speech features. We evaluate the proposed technique on a robot using speech recognition results. Adaptive superimposition of white noise achieves up to 20% improvement of word correct rates (WCR) and the spectrographic mask attains an additional improvement of up to 10% compared to the single channel recognition.
#10Empirical Mode Decomposition For Noise-Robust Automatic Speech Recognition
Kuo-Hao Wu (National Sun Yat-Sen University)
Chia-Ping Chen (National Sun Yat-Sen University)
In this paper, a novel technique based on the empirical mode decomposition (EMD) methodology is proposed and examined for the noise-robustness of automatic speech recognition systems. The EMD analysis is a generalization of the Fourier analysis for processing non-linear and non-stationary time functions, in our case, the speech feature sequences. We use the first and second intrinsic mode functions (IMF), which include the sinusoidal functions as special cases, obtained from the EMD analysis in the post-processing of the log energy feature. Experimental results on the noisy-digit Aurora 2.0 database show that our proposed method leads to significant improvement for the mismatched (clean-training) tasks.
#11An Effective Feature Compensation Scheme Tightly Matched with Speech Recognizer Employing SVM-based GMM Generation
Wooil Kim (Center for Robust Speech Systems, University of Texas at Dallas)
Jun-Won Suh (Center for Robust Speech Systems, University of Texas at Dallas)
John H. L. Hansen (Center for Robust Speech Systems, University of Texas at Dallas)
This paper proposes an effective feature compensation scheme to address a real-life situation where clean speech database is not available for Gaussian Mixture Model (GMM) training for a model-based feature compensation method. The proposed scheme employs a Support Vector Machine (SVM)-based model selection method to effectively generate the GMM for our feature compensation method directly from the Hidden Markov Model (HMM) of the speech recognizer. We also present a strategy to address the case of a combination with Cepstral Mean Normalization (CMN), where the HMM for speech recognizer is obtained using CMN-processed speech database. Experimental results demonstrate that the proposed method is effective at providing a comparable speech recognition performance to the matched data condition where the clean speech database is available for GMM training which is also used for HMM training for speech recognizer. This proves that the SVM-based model selection method is able to effectively generate Gaussian components from the pre-trained HMM model parameters to make the GMM for the feature compensation method be tightly matched to the speech recognizer.
#12Artificial and online aquired noise dictionaries for noise robust ASR
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Tuomas Virtanen (Department of Signal Processing, Tampere University of Technology,Tampere,Finland)
Recent research has shown that speech can be sparsely represented using a dictionary of speech segments spanning multiple frames, emph{exemplars}, and that such a sparse representation can be recovered using Compressed Sensing techniques. In previous work we proposed a novel method for noise robust automatic speech recognition in which we modelled noisy speech as a sparse linear combination of speech and noise exemplars extracted from the training data. The weights of the speech exemplars were then used to provide noise robust HMM-state likelihoods. In this work we propose to acquire additional noise exemplars during decoding and the use of a noise dictionary which is artificially constructed. Experiments on AURORA-2 show that the artificial noise dictionary works better for noises not seen during training and that acquiring additional exemplars can improve recognition accuracy.
#13Voice activity detection based on conditional random fields using multiple features
Akira Saito (Nagoya Institute of Technology)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Akinobu Lee (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)
This paper proposes a Voice Activity Detection (VAD) algorithm based on Conditional Random Fields (CRF) using multiple features. VAD is a technique to distinguish between speech and non-speech in noisy environments and an important component in many real-world speech applications. In the proposed method,the posterior probability of output labels is directly modeled by the weighted sum of the feature functions. By estimating appropriate weight parameters, effective features are automatically selected for improving the performance for VAD. Experimental results on CENSREC-1-C database show that the proposed method can decrease error rates by using conditional random fields.
#14A Comparative Study of Noise Estimation Algorithms for VTS-Based Robust Speech Recognition
Yong Zhao (Georgia Institute of Technology)
Biing-Hwang (Fred) Juang (Georgia Institute of Technology)
We conduct a comparative study to investigate two noise estimation approaches for robust speech recognition using vector Taylor series (VTS) developed in the past few years. The first approach, iterative root finding (IRF), directly differentiates the EM auxiliary function and approximates the root of the derivative function through recursive refinements. The second approach, twofold expectation maximization (TEM), estimates noise distributions by regarding them as hidden variables in a modified EM fashion. Mathematical derivations reveal the substantial connection between the two approaches. Two experiments are performed in evaluating the performance and convergence rate of the algorithms. The first is to fit a GMM model to artificially corrupted samples that are generated through Monte Carlo simulation. The second is to perform speech recognition on the Aurora 2 database.
#15On Using Missing-Feature Theory with Cepstral Features--Approximations to the Multivariate Integral
Frank Seide (Microsoft Research Asia)
Pei Zhao (Department of Machine Intelligence, Peking University)
Missing Feature Theory (MFT), a powerful systematic framework for robust speech recognition, to date has not been optimally applied to linear-transform based features like MFCC or HLDA, which are necessary for state-of-the-art recognition accuracy, due to the intractable multivariate integral in bounded marginalization. This paper seeks to enable more optimal use of MFT with MFCC features through two approximations of this integral: Numeric integration by linear sampling, and approximation by the integrand's maximum. The former is made feasible through a "tridiagonal" approximation of MFCC, based on interpreting MFCC as bandpass-filtering the filterbank vector. The latter is solved through quadratic programming. Their effectiveness is shown for recognizing reverberated TIMIT speech utilizing temporal auditory masking.
#16Using a DBN to integrate Sparse Classification and GMM-based ASR
Yang Sun (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Bert Cranen (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Louis ten Bosch (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Lou Boves (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
The performance of an HMM-based speech recognizer using MFCCs as input is known to degrade dramatically in noisy conditions. Recently, an exemplar-based noise robust ASR approach, called sparse classification (SC), was introduced. While very successfully at lower SNRs, the performance at high SNRs suffered when compared to HMM-based systems. In this work, we propose to use a Dynamic Bayesian Network (DBN) to implement an HMM-model that uses both MFCCs and phone predictions extracted from the SC system as input. By doing experiments on the AURORA-2 connected digit recognition task, we show that our approach successfully combines the strengths of both systems, resulting in competitive recognition accuracies at both high and low SNRs.

Speaker characterization and recognition IV

Time:Wednesday 13:30 Place:International Conference Room B Type:Poster
Chair:Bin MA
#1Transcript-Dependent Speaker Recognition using Mixer 1 and 2
Fred Richardson (MIT Lincoln Laboratory)
Joseph Campbell (MIT Lincoln Laboratory)
Transcript-dependent speaker-recognition experiments are performed with the Mixer 1 and 2 read-transcription corpus using the Lincoln Laboratory speaker recognition system. Our analysis shows how widely speaker-recognition performance can vary on transcript-dependent data compared to conversational data of the same durations, given enrollment data from the same spontaneous conversational speech. A description of the techniques used to deal with the unaudited data in order to create 171 male and 198 female text-dependent experiments from the Mixer 1 and 2 read transcription corpus is given.
#2On the Potential of Glottal Signatures for Speaker Recognition
Thomas Drugman (University of Mons)
Thierry Dutoit (University of Mons)
Most of current speaker recognition systems are based on features extracted from the magnitude spectrum of speech. However the excitation signal produced by the glottis is expected to convey complementary relevant information about the speaker identity. This paper explores the use of two proposed glottal signatures, derived from the residual signal, for speaker identification. Experiments using these signatures are performed on both TIMIT and YOHO databases. Promising results are shown to outperform other approaches based on glottal features. Besides it is highlighted that the signatures can be used for text-independent speaker recognition and that only several seconds of voiced speech are sufficient for estimating them reliably.
#3Acoustic Feature Diversity and Speaker Verification
Padmanabhan Rajan (Indian Institute of Technology Madras)
Hema A. Murthy (Indian Institute of Technology Madras)
We present a new method for speaker verification that uses the diversity of information from multiple feature representations. The principle behind the method is that certain features are better at recognising certain speakers. Thus, rather than using the same feature representation for all speakers, we use different features for different speakers. During training, we determine the optimal feature for each speaker from candidate features, by measuring information-theoretic criteria. During evaluation, verification is performed using the optimal feature of the claimed speaker. Experimental results with four candidate features show that the proposed system outperforms conventional systems that use a single feature or a combination of features.
#4A Discriminative Performance Metric for GMM-UBM Speaker Identification
Omid Dehzangi (School of Computer Engineering, Nanyang Technological University, Singapore)
Bin Ma (Institute for Infocomm Research, A*STAR, Singapore 138632)
Eng Siong Chng (School of Computer Engineering, Nanyang Technological University, Singapore)
Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore 138632)
Universal background model based Gaussian mixture modeling (GMM-UBM) approach is a widely used method for speaker identification, where a GMM model is used to characterize a specific speaker’s voice. The estimation of model parameters is generally performed based on the maximum likelihood (ML) or maximum a posteriori (MAP) criteria. However, interspeaker information to discriminate between different speakers is not taken into account in ML and MAP parameter estimation. To overcome this limitation, we design a discriminative performance metric to capture interspeaker variabilities leading to improve the classification performance of the GMM-UBM system. A learning algorithm is presented to tune the Gaussian mixture weights by optimizing the detection performance of GMM classifiers. We design an objective function to directly relate the model parameters to the performance metric. The comparative study of the proposed method is done with the GMM-UBM system on the 2001 NIST SRE corpus. Experimental results demonstrate that the proposed learning algorithm considerably improves the GMM-UBM system on speaker identification.
#5Novel binary key representation for biometric speaker recognition
Xavier Anguera (Telefonica Research)
Jean-François Bonastre (University of Avignon, LIA)
The approach presented in this paper represents voice recordings by a novel acoustic key composed only of binary values. Except for the process being used to extract such keys, there is no need for acoustic modeling and processing in the approach proposed, as all the other elements in the system are based on the binary vectors. We show that this binary key is able to effectively model a speaker's voice and to distinguish it from other speakers. Its main properties are its small size compared to current speaker modeling techniques and its low computational cost when comparing different speakers as it is limited to obtaining a similarity metric between two binary vectors. Furthermore, the binary key vector extraction process does not need any threshold and offers the opportunity to set the decision steps in a well defined binary domain where scores and decisions are easy to interpret and implement.
#6Variant Time-Frequency Cepstral Features for Speaker Recognition
Wei-Qiang Zhang (Tsinghua University)
Yan Deng (Tsinghua University)
Liang He (Tsinghua University)
Jia Liu (Tsinghua University)
In speaker recognition (SRE), the commonly used feature vector is basic ceptral coefficients concatenating with their delta and double delta cepstal features. This configuration is borrowed from speech recognition and may be not optimal for SRE. In this paper, we propose a variant time-frequency cepstral (TFC) features, which is based on our previous work for language recognition. The feature vector is obtained by performing a temporal discrete cosine transform (DCT) on the cepstrum matrix and selecting the transformed elements in a specific area with large variances. Different shapes and parameters are tested and the optimal configuration is obtained. Experimental results on the 2008 NIST speaker recognition evaluation short2 telephone-short3 telephone test set show that the proposed variant TFC is more effective than the conventional feature vectors.
#7Exploitation of Phase Information for Speaker Recognition
Ning Wang (Department of Electronic Engineering, The Chinese University of Hong Kong)
P. C. Ching (Department of Electronic Engineering, The Chinese University of Hong Kong)
Tan Lee (Department of Electronic Engineering, The Chinese University of Hong Kong)
Auditory experiments show insensitivity of human ears to phase information in perceiving phonetic content of speech signal. However, the discarded phase information may provide useful acoustic cue for identifying individual speaker, this is especially useful for speaker recognition systems operated with degraded magnitude in adverse conditions. This paper is therefore motivated to derive phase-related features for reliable speaker recognition performance. A pertinent representation for most dominant primary frequencies present in the speech signal is first built. It is then applied to frames of the speech signal to derive effective speaker-discriminative features. Through a set of specifically designed experiments on synthetic vowels, it is observed that the proposed features are capable of differentiating the inclusive formants, pitch harmonics from other components, and expressing the vocal particularities in various-shaped formants. By combining with standard cepstral parameters, these phase-related features have shown to evidently reduce the identification error rate and equal error rate in the context of Gaussian mixture model-based speaker recognition system.
#8Effects of the Phonological Relevance in Speaker Verification
Yanhua Long (iFly Speech Lab, EEIS, University of Science and Technology of China (USTC), China)
Lirong Dai (iFly Speech Lab, EEIS, University of Science and Technology of China (USTC), China)
Bin Ma (Institute for Infocomm Research (I2R), Singapore)
Wu Guo (iFly Speech Lab, EEIS, University of Science and Technology of China (USTC), China)
This paper presents an experimental evaluation and analysis of the effects of phonological relevance between the training and test utterances in speaker verification, while the test utterances in short durations of 3, 5 and 10 seconds are used in the experiments. We quantify the phonological relevance as the occurrence ratio of the subword units of a test utterance in the training utterance. It is found that a higher phonological relevance can make a large reduction of the miss detection error rate without increasing much in the false acceptance in text-independent speaker verification.
#9Topological representation of speech for speaker recognition
Gabriel Hernandez (Advanced Technologies Application Center, Havana, Cuba)
Jean F. Bonastre (Laboratorie Informatique d’Avignon, Avignon, France)
Driss Matrouf (Laboratorie Informatique d’Avignon, Avignon, France)
Jose R. Calvo (Advanced Technologies Application Center, Havana, Cuba)
During last decade, researchers in speaker recognition have been working over the same acoustic space, regardless of whether the data lie on a linear space or not. Our proposal is to take into account the inner geometric structure of the speech in order to obtain a new space with a better representation of the speech data. A topological approach based on manifolds obtained thanks to Laplacian and Isomap algorithms is proposed. In this first work, the proposal is evaluated in terms of dimension reduction of the supervector space, known to have a high redundancy. The experiments are done in the NIST-SRE framework. It appears that the proposed approach allows to reduce by a factor four the dimension of the supervector space without losses in terms of EER. This first result highlights the potential of topological approaches for speaker recognition.
#10Assessment of Single-Channel Speech Enhancement Techniques for Speaker Identification under Mismatched Conditions
Seyed Omid Sadjadi (The University of Texas at Dallas)
John H.L. Hansen (The University of Texas at Dallas)
It is well known that MFCC based speaker identification (SID) systems easily break down under mismatched training and test conditions. In this paper, we report on a study that considers four different single-channel speech enhancement front-ends for robust SID under such conditions. Speech files from the YOHO database are corrupted with four types of noise including babble, car, factory, and white Gaussian at five SNR levels (0–20 dB), and processed using four speech enhancement techniques representing distinct classes of algorithms: spectral subtraction, statistical model-based, subspace, and Wiener filtering. Both processed and unprocessed files are submitted to a SID system trained on clean data. In addition, a new set of acoustic feature parameters based on Hilbert envelope of gammatone filterbank outputs are proposed and evaluated for SID task. Experimental results indicate that: (i) depending on the noise type and SNR level, the enhancement front-ends may help or hurt SID performance, (ii) the proposed feature significantly achieves higher SID accuracy compared to MFCCs under mismatched conditions.
#11Speaker Recognition Using the Resynthesized Speech via Spectrum Modeling
Xiang Zhang (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)
Chuan Cao (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)
Lin Yang (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)
Hongbin Suo (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)
Jianping Zhang (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)
Yonghong Yan (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)
In this paper, we present a new approach for speaker recognition, which uses the prosodic information calculated on the original speech to resynthesize the new speech data utilizing the spectrum modeling technique. The resynthesized data are modeled with sinusoids based on pitch, vibration amplitude and phase bias. We use the resynthesized speech data to extract cepstral features for speaker modeling and scoring in the same way as in traditional speaker recognition approaches. We then model these features using GMMs and compensate for speaker and channel variability effects using joint factor analysis. The experiments are carried out on the core condition of NIST 2008 speaker recognition evaluation data. The experimental results show that our proposed system achieves comparable performance to the state-of-the-art cepstral-based joint factor analysis system which uses the original data for speaker recognition.

Voice Conversion and Speech Synthesis

Time:Wednesday 13:30 Place:International Conference Room C Type:Poster
Chair:Junichi Yamagashi
#1Shape-invariant speech transformation with the phase vocoder
Axel Röbel (IRCAM)
This paper proposes a new phase vocoder based method for shape invariant real-time modification of speech signals. The performance of the method with respect voiced and unvoiced signal components as well as the control of the voiced/unvoiced balance of the transformed speech signals will be discussed. The algorithm has been compared in perceptual tests with implementations of PSOLA, and HNM algorithms demonstrating a very satisfying performance. Due to the fact that the quality of transformed signals is remaining acceptable over a wide range of transformation parameters the algorithm is especially suited for real-time gender and age transformations.
#2A Phonetic Alternative to Cross-language Voice Conversion in a Text-dependent Context: Evaluation of Speaker Identity
Kayoko Yanagisawa (UCL Department of Speech Hearing and Phonetic Sciences)
Mark Huckvale (UCL Department of Speech Hearing and Phonetic Sciences)
Spoken language conversion (SLC) aims to generate utterances in the voice of a speaker but in a language unknown to them, using speech synthesis systems and speech processing techniques. Previous approaches to SLC have been based on cross-language voice conversion (VC), which has underlying assumptions that ignore phonetic and phonological differences between languages, leading to a reduction in intelligibility of the output. Accent morphing (AM) was proposed as an alternative approach, and its intelligibility performance was investigated in a previous study. AM attempts to preserve the voice characteristics of the target speaker whilst modifying their accent, using phonetic knowledge obtained from a native speaker of the target language. This paper examines AM and VC in terms of how similar the output sounds like the target speaker. AM achieved similarity ratings at least equivalent to VC, but the study highlighted various difficulties in evaluating speaker identity in a SLC context.
#3Evaluation of speaker mimic technology for personalizing SGD voices
Esther Klabbers (Biospeech, Inc.)
Alexander Kain (Biospeech, Inc.)
Jan P.H. van Santen (Center for Spoken Language Understanding, OHSU)
In this paper, we demonstrate the use of state-of-the-art speech technology to transform speech from a source speaker to mimic a particular target speaker with the intention of providng personalized voices to users of Speech Generating Devices (SGDs). This speaker mimicry (SM) capability allows us to use high-quality acoustic inventories from professional speakers and transform them to a different target speaker using a very limited set of sentences from that speaker. This technology targets future SGD users who still have a limited vocabulary or available previous recordings. The results of a perceptual study show that listeners can identify which SM voices most resemble their respective target voices.
#4Adaptive Voice-Quality Control Based on One-to-Many Eigenvoice Conversion
Kumi Ohta (Nara Institute of Science and Technology)
Tomoki Toda (Nara Institute of Science and Technology)
Yamato Ohtani (Nara Institute of Science and Technology)
Hiroshi Saruwatari (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)
This paper presents adaptive voice-quality control methods based on one-to-many eigenvoice conversion. To intuitively control the converted voice quality by manipulating a small number of control parameters, a multiple regression Gaussian mixture model (MR-GMM) has been proposed. The MR-GMM also allows us to estimate the optimum control parameters if target speech samples are available. However, its adaptation performance is limited because the number of control parameters is too small to widely model voice quality of various target speakers. To improve the adaptation performance while keeping capability of voice-quality control, this paper proposes an extended MR-GMM (EMR-GMM) with additional adaptive parameters to extend a subspace modeling target voice quality. Experimental results demonstrate that the EMR-GMM yields significant improvements of the adaptation performance while allowing us to intuitively control the converted voice quality.
#5Applying Voice Conversion To Concatenative Singing-Voice Synthesis
Fernando Villavicencio (YAMAHA Corporation)
Jordi Bonada (Universitat Pompeu Fabra)
This work address the application of Voice Conversion to singing-voice. The GMM-based approach was applied to VOCALOID, a concatenative singing synthesizer, to perform singer timbre conversion. The conversion framework was applied to full-quality singing databases, achieving a satisfactory conversion effect on the synthesized utterances issued by VOCALOID. We report in this paper a description of our implementation as well as the results of our experimentation focused to study the spectral conversion performance when applied to specific pitch-range data.
Miaomiao Wang (Department of Electrical Engineering and Information Systems, the University of Tokyo)
Miaomiao Wen (Department of Electrical Engineering and Information Systems, the University of Tokyo)
Keikichi Hirose (Department of Information and Communication Engineering, the University of Tokyo)
Nobuaki Minematsu (Department of Information and Communication Engineering, the University of Tokyo)
The HMM-based Text-to-Speech System can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. However the quality of synthetic speech degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (VU) decisions are the two key factors in voice quality problems. In this paper, an F0 generation process model is used to re-estimate F0 values in the regions of pitch tracking errors, as well as in unvoiced regions. A prior knowledge of VU is imposed in each Mandarin phoneme and they are used for VU decision. Then the F0 can be modeled within the standard HMM framework.
#7A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis
Ming Lei (iFLYTEK Speech Lab, University of Science and Technology of China, Hefei, China)
Yi-Jian Wu (Microsoft China, Beijing, China)
Frank K. Soong (Microsoft Research Asia, Beijing, China)
Zhen-Hua Ling (iFLYTEK Speech Lab, University of Science and Technology of China, Hefei, China)
Li-Rong Dai (iFLYTEK Speech Lab, University of Science and Technology of China, Hefei, China)
The conventional state-based F0 modeling in HMM-based speech synthesis system is good at capturing micro prosodic features, but difficult to characterize long term pitch patterns directly. This paper presents a hierarchical F0 modeling method to address this issue. In this method, different F0 models are used to model the pitch patterns for different prosodic layers (including state, phone, syllable, word, etc), and are combined with an additive structure. In model training, the F0 model for each layer is firstly initialized by using the residual between original F0s and generated F0s from other layers as training data, and then the F0 models of all layers are re-estimated simultaneously under a minimum generation error (MGE) training framework. We investigate the effectiveness of hierarchical F0 modeling with different layer settings, experimental results show that the proposed hierarchical F0 modeling method significantly outperforms the conventional state-based F0 modeling method.
#8Training a Parametric-Based LogF0 Model with the Minimum Generation Error Criterion
Javier Latorre (Toshiba Research Europe Ltd. Cambridge Research Laboratory, Cambridge, UK)
M.J.F. Gales (Toshiba Research Europe Ltd. Cambridge Research Laboratory, Cambridge, UK)
Heiga Zen (Toshiba Research Europe Ltd. Cambridge Research Laboratory, Cambridge, UK)
This paper describes an approach for improving a statistical parametric-based logF0 model using minimum-generation error (MGE) training. Compared with the previous scheme based on decision tree clustering,MGE allows the minimisation of the error in the generated logF0 to take into account not only each cluster by itself, but also the way in which the clusters interact with each other in the generation of the F0 over the whole sentence. Moreover, the “weights” of each component of the model, which previously were adjusted manually, are optimized automatically by the MGE training during the re-estimation of the model covariances. Objective evaluation indicated that, although the logF0 contours generated by the models trained with MGE have approximately the same root mean square error and correlation factor as those generated with the baseline models, they present a higher dynamic range. The subjective evaluation shows a small but significant preference for the system trained with MGE.
#9Improving Mandarin Segmental Duration Prediction with Automatically Extracted Syntax Features
Miaomiao Wen (Department of Electrical Engineering and Information Systems, the University of Tokyo, Japan)
Miaomiao Wang (Department of Electrical Engineering and Information Systems, the University of Tokyo, Japan)
Keikichi Hirose (Department of Information and Communication Engineering, the University of Tokyo, Japan)
Nobuaki Minematsu (Department of Information and Communication Engineering, the University of Tokyo, Japan)
Previous researches have indicated the relevance between segmental duration and syntax information, but the usefulness of syntax features have not been thoroughly studied for predicting segmental duration. In this paper, we design two sets of syntax features to improve Mandarin phone and pause duration prediction respectively. Instead of using manually extracted syntacx information as previous researches do, we acquire these syntax features from an automatic Chinese syntax parser. Results show that even though the automatically extracted syntax information has limited precision; it could still improve Mandarin segmental duration prediction.
#10An intonation model for TTS in Sepedi
Daniel R. Van Niekerk (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)
Etienne Barnard (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)
We present an initial investigation into the acoustic realisation of tone in continuous utterances in Sepedi (a language in the Southern Bantu family). An analytic model for the generation of appropriate pitch contours given an utterance with linguistic tone specification is presented and evaluated. By comparing the model output to speech data from a small tone-marked corpus we conclude that the initial implementation presented here is capable of generating pitch contours exhibiting some realistic properties and identify a number of aspects that require further attention. Lastly, we present some initial perceptual results when integrating the proposed model into a Hidden Markov Model-based speech synthesis system.
#11Synthesis of fast speech with interpolation of adapted HSMMs and its evaluation by blind and sighted listeners
Michael Pucher (Telecommunications Research Center Vienna (FTW), Austria)
Dietmar Schabus (Telecommunications Research Center Vienna (FTW), Austria)
Junichi Yamagishi (The Centre for Speech Technology Research (CSTR), UK)
In this paper we evaluate a method for generating synthetic speech at high speaking rates based on the interpolation of hidden semi-Markov models (HSMMs) trained on speech data recorded at normal and fast speaking rates. The subjective evaluation was carried out with both blind listeners, who are used to very fast speaking rates, and sighted listeners. We show that we can achieve a better intelligibility rate and higher voice quality with this method compared to standard HSMM-based duration modeling. We also evaluate duration modeling with the interpolation of all the acoustic features including not only duration but also spectral and F0 models. An analysis of the mean squared error (MSE) of standard HSMM-based duration modeling for fast speech identifies problematic linguistic contexts for duration modeling.
#12A comparison of pronunciation modeling approaches for HMM TTS
Gabriel Webster (Toshiba Research Europe, Ltd.)
Sacha Krstulović (Toshiba Research Europe, Ltd.)
Kate Knill (Toshiba Research Europe, Ltd.)
Hidden Markov model-based text-to-speech (HMM-TTS) systems are often trained on manual voice corpus phonetic transcriptions, despite the fact that because these manual pronunciations cannot be predicted with complete accuracy at synthesis time, the result is training/synthesis mismatch. In this paper, an alternate approach is proposed in which a set of manually written post-lexical effects (PLE) rules modeling a range of continuous speech effects are applied to canonical lexicon pronunciations, and the resulting "matched PLE" phone sequences are used both in the voice corpus markup and at synthesis time. For a US English system, a subjective evaluation showed that a system trained on matched PLE markup and a system trained on manual phone markup were equally preferred, suggesting that it may be possible to replace manual pronunciations with matched PLE pronunciations, dramatically decreasing the time and cost required to produce an HMM-TTS voice.
#13HMM-based Text-to-Articulatory-Movement Prediction and Analysis of Critical Articulators
Zhen-Hua Ling (iFLYTEK Speech Lab, University of Science and Technology of China)
Korin Richmond (CSTR, University of Edinburgh)
Junichi Yamagishi (CSTR, University of Edinburgh)
In this paper we present a method to predict the movement of a speaker's mouth from text input using hidden Markov models (HMM). We have used a corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), to train HMMs. To predict articulatory movements from text, a suitable model sequence is selected and the maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. In our experiments, we find that fully context-dependent models outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945mm when state durations are predicted from text, and 0.872mm when natural state durations are used. Finally, we go on to analyze the prediction error for different EMA dimensions and phone types. We find a clear pattern emerges that the movements of so-called critical articulators can be predicted more accurately than the average performance.

Detection, classification, and segmentation

Time:Wednesday 13:30 Place:International Conference Room D Type:Poster
Chair:Giuseppe Riccardi
#1Audio-based Sports Highlight Detection by Fourier Local Auto-Correlations
Jiaxing Ye (Department of Computer Science, University of Tsukuba, Japan)
Takumi Kobayashi (National Institute of Advanced Industrial Science and Technology)
Tetsuya Higuchi (National Institute of Advanced Industrial Science and Technology)
In this paper, we present a novel methodology for sports highlight detection based on audio information. For processing the sounds of sports events, we propose a time-frequency feature extraction method computing local auto-correlations on complex Fourier values (FLAC). For highlights detection, we apply (complex) subspace method to the extracted FLAC features to detect the “exciting” scenes which occur sparsely in a background of “ordinary” periods. As an unsupervised learning algorithm, the subspace method maintains advantages that any prior knowledge and expensive-computation are not required. To evaluate the proposed method, we made experiments on a soccer match. The experimental results show the effectiveness of the proposed approach including robustness to environmental noise, low computation burden and promising performance.
#2Automatic Excitement-Level Detection for Sports Highlights Generation
Hynek Boril (Center for Robust Speech Systems (CRSS), University of Texas at Dallas (UTD))
Abhijeet Sangwan (Center for Robust Speech Systems (CRSS), University of Texas at Dallas (UTD))
Taufiq Hasan (Center for Robust Speech Systems (CRSS), University of Texas at Dallas (UTD))
John Hansen (Center for Robust Speech Systems (CRSS), University of Texas at Dallas (UTD))
The problem of automatic excitement detection in baseball videos is considered and applied to highlights generation. This paper focuses on detecting exciting events in the video using complementary information from the audio and video domains. First, a new measure for non-stationarity which is extremely effective in separating background from speech is proposed. This new feature is employed in a unsupervised GMM-based segmentation algorithm that identifies the commentators speech in the crowd background. Thereafter, the ``level-of-excitement'' is measured using features such as pitch, F1-F3 center frequencies, and spectral center of gravity extracted from the commentators speech. Our experiments show that these features are well correlated with human assessment of excitability. Furthermore, slow-motion replay and pitching-scenes from the video are also detected to estimate scene end-points. Finally, audio/video information is fused to rank-order scenes by ``excitability'' and generate highlights of user-defined time-lengths. The techniques described in this paper are generic and applicable to a variety of domains.
#3Detecting novel objects in acoustic scenes through classifier incongruence
Jörg-Hendrik Bach (University of Oldenburg)
Jörn Anemüller (University of Oldenburg)
In this study, a new generic framework for the detection and interpretation of disagreement (“incongruence”) between different classifiers [15] is applied to the problem of detecting novel acoustic objects in an office environment. Using a general model that detects generic acoustic objects (standing out from a stationary background) and specific models tuned to particular sounds expected in the office, a novel object is detected as an incongruence between the models: the general model detects it as a generic object, but the specific models can not identify it as any of the known office-related sources. The detectors are realized using amplitude modulation spectrogram and RASTA-PLP features with support vector machine classification. Data considered are speech and non-speech sounds embedded in real office background at signal-to-noise ratios (SNR) from +20 dB to -20 dB. Our approach yields approximately 90% hit rate for novel events at 20 dB SNR, 75% at 0 dB and reaches chance level below -10 dB.
#4A Multidomain Approach for Automatic Home Environmental Sound Classification
Stavros Ntalampiras (University of Patras)
Ilyas Potamitis (Technological Educational Institute of Crete)
Nikos Fakotakis (University of Patras)
This article presents a multidomain approach which addresses the problem of automatic home environmental sound recognition. The proposed system will be part of a human activity monitoring system which will be based on heterogeneous sensors. This work concerns the audio classification component and its primary role is to detect anomalous sound events. We compare the discriminative capabilities of three feature sets (MFCC, MPEG-7 low level descriptors and a novel set based on wavelet packets) with respect to the classification of ten sound classes. These are combined with state of the art generative techniques (GMM and HMM) for estimating the density function of each class. The highest average recognition rate is 95.7% and is achieved by the vector formed by all the feature sets juxtaposed.
#5Content-Based Advertisement Detection
Patrick Cardinal (CRIM)
Vishwa Gupta (CRIM)
Gilles Boulianne (CRIM)
Television advertising is widely used by companies to promote their products among the public but it is hard for an advertiser to know if its advertisements are broadcast as they should. For this reason, some companies are specialized in the monitoring of audio/video streams for validating that ads are broadcast according to what was requested and paid for by the advertiser. The procedure for searching specific ads in an audio stream is very similar to the copy detection task for which we have developed very efficient algorithms. This work reports results of applying our copy detection algorithms to the advertisement detection task. Compared to a commercial software, we detected 18% more advertisements and the system runs at 0.003x of real-time.
#6Identification of Abnormal Audio Events Based on Probabilistic Novelty Detection
Stavros Ntalampiras (University of Patras)
Ilyas Potamitis (Technological Educational Institute of Crete)
Nikos Fakotakis (University of Patras)
This paper exploits the novelty detection technique towards identifying hazardous situations. The proposed system elaborates on the audio part of the PROMETHEUS database which includes heterogeneous recordings and was captured under real-world conditions. Three types of environments were used: smart-home, indoors public space and outdoors public space. The multidomain set of descriptors was formed by the following features: MFCCs, MPEG-7 descriptors, Teager energy operator parameters and wavelet packets. We report detection results using three types of probabilistic novelty detection algorithms: universal GMM, universal HMM and GMM clustering. We conclude that the results are encouraging and demonstrate the superiority of the novelty detection approach against the classification one.
#7Lightly supervised recognition for automatic alignment of large coherent speech recordings
Norbert Braunschweiler (Toshiba Research Europe Ltd., Cambridge Research Laboratory, United Kingdom)
Mark J.F. Gales (Toshiba Research Europe Ltd., Cambridge Research Laboratory, United Kingdom)
Sabine Buchholz (Toshiba Research Europe Ltd., Cambridge Research Laboratory, United Kingdom)
Large quantities of audio data with associated text such as audiobooks are nowadays available. These data are attractive for a range of research areas as they include features that go beyond the level of single sentences. The proposed approach allows high quality transcriptions and associated alignments of this form of data to be automatically generated. It combines information from lightly supervised recognition and the original text to yield the final transcription. The scheme is fully automatic and has been successfully applied to a number of audiobooks. Performance measurements show low word/sentence error rates as well as high sentence boundary accuracy.
#8Incremental Diarization of Telephone Conversations
Oshry Ben-Harush (Department of Electrical and Computers Engineering Ben-Gurion University of the Negev, Beer-Sheva, Israel)
Itshak Lapidot (Department of Electrical and Electronics Engineering Sami Shamoon College of Engineering, Ashdod, Israel)
Hugo Guterman (Department of Electrical and Computers Engineering Ben-Gurion University of the Negev, Beer-Sheva, Israel)
Speaker diarization systems attempt segmentation and labeling of a conversation between $R$ speakers, while no prior information is given regarding the conversation. Most state of the art diarization systems require the full body of the conversation data prior to the application of some diarization approach. However, for some applications such as forensics, which handles vast amount of data, an on-line or incremental diarization is of high importance. For that purpose, a two-stage incremental diarization of telephone conversations algorithm is suggested. On the first stage, a fully unsupervised diarization algorithm is applied over an initial training segment from the conversation. The second-stage is composed of time-series clustering of increments of the conversation. Applying incremental diarization over 1802 telephone conversations from NIST 2005 SER generated an increase in diarization error of approximately 2% compared to the diarization error of an off-line diarization system
#9Audio analytics by template modeling and 1-pass DP based decoding
Srikanth Cherla (Siemens Corporate Research & Technologies - India)
V Ramasubramanian (Siemens Corporate Research & Technologies - India)
We propose a novel technique for audio analytics and audio indexing using template based modeling of audio classes set in a one-pass dynamic programming continuous decoding framework. We propose use of concatenation costs in the one-pass DP recursions to reduce so-called incursion errors; we also propose selection of variable length templates for modeling indefinite duration audio classes using the segmental K-means (SKM) algorithm. Based on detailed decoding results with long audio streams, we conclude the effectiveness of template based modeling, SKM based template selection, 1-pass DP based decoding and the use of concatenation constraints therein. We show that an average (%Hit, %False-alarm) of (66%, 4.9%) are possible with the proposed decoding technique.
#10Perceptual Wavelet Decomposition for Speech Segmentation
Mariusz Ziolko (Department of Electronics, AGH University of Science and Technology, Krakow)
Jakub Galka (Department of Electronics, AGH University of Science and Technology, Krakow)
Bartosz Ziolko (Department of Electronics, AGH University of Science and Technology, Krakow)
Tomasz Drwiega (Faculty of Applied Mathematics, AGH University of Science and Technology, Krakow)
A non-uniform speech segmentation method based on wavelet packet transform is used for the localisation of phoneme boundaries. Eleven subbands are chosen by applying the mean best basis algorithm. Perceptual scale is used for decomposition of speech via Meyer wavelet in the wavelet packet structure. A real valued vector representing the digital speech signal is decomposed into phone-like units by placing segment borders according to the result of the multiresolution analysis. The final decision on localisation of the boundaries is made by analysis of the energy flows among the decomposition levels.
#11A comparative study of constrained and unconstrained approaches for segmentation of speech signal
Venkatesh Keri (International Institute of Information Technology, Hyderabad, India.)
Kishore Prahallad (International Institute of Information Technology, Hyderabad, India.)
In this work, we compare different approaches for speech segmentation, of which some are constrained and the remaining are unconstrained by phone transcript. A high accuracy speech segmentation can be obtained by approaches constrained by phone transcript such as HMM forced-alignment when {it exact phone transcript} is known. But such approaches have to adjust with {it canonical phone transcript}, as {it exact phone transcript} is tough to obtain. Our experiments on TIMIT corpus demonstrate that ANN and HMM phone-loop based unconstrained approaches, perform better than HMM forced-alignment based approach constrained by {it canonical phone transcript}. Finally a detailed error analysis of these approaches is reported.
#12Automatic discriminative measurement of voice onset time
Morgan Sonderegger (University of Chicago)
Joseph Keshet (Toyota Technological Institute at Chicago)
We describe a discriminative algorithm for automatic VOT measurement, considered as an application of predicting structured output from speech. In contrast to previous studies which use customized rules, in our approach a function is trained on manually labeled examples, using an online algorithm to predict the burst and voicing onsets (and hence VOT). The feature set used is customized for detecting the burst and voicing onsets, and the loss function used in training is the difference between predicted and actual VOT. Applied to initial voiceless stops from two corpora, the algorithm compares favorably to previous work, and the agreement between automatic and manual measurements is near human inter-judge reliability.
#13Selective Gammatone Filterbank Feature for Robust Sound Event Recognition
Yiren Leng (Institute for Infocomm Research, A*STAR, Singapore)
Huy Dat Tran (Institute for Infocomm Research, A*STAR, Singapore)
Norihide Kitaoka (Nagoya University, Japan)
Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore)
This paper introduces a novel feature based on the raw output of the gammatone filterbank. Channel selection is used to enhance robustness over a range of signal-to-noise ratios (SNR) of additive noise. The recognition accuracy of the proposed feature is tested on a sound event database using a Hidden Markov Model (HMM) recogniser. A comparison with a series of similar features and the conventional Mel-Frequency Cepstral Coefficients (MFCC) shows that the proposed feature offers significant improvement in low SNR conditions.

ASR: Lexical and Pronunciation Modeling

Time:Wednesday 16:00 Place:Hall A/B Type:Oral
Chair:Mikko Kurimo
16:00FSM-Based Pronunciation Modeling using Articulatory Phonological Code
Chi Hu (University of Illinois, Urbana Champaign)
Xiaodan Zhuang (University of Illinois, Urbana Champaign)
Mark Hasegawa-Johnson (University of Illinois, Urbana Champaign)
According to articulatory phonology, the gestural score is an invariant speech representation. Though the timing schemes, i.e., the onsets and offsets, of the gestural activations may vary, the ensemble of these activations tends to remain unchanged, informing the speech content. In this work, we propose a pronunciation modeling method that uses a finite state machine to represent the invariance of a gestural score. Given the "canonical'' gestural score of a word with a known activation timing scheme, the plausible activation onsets and offsets are recursively generated and encoded as a weighted FSM. Speech recognition is achieved by matching the recovered gestural activations to the FSM-encoded gestural scores of different speech contents. We carry out pilot word classification experiments using synthesized data from one speaker. The proposed pronunciation modeling achieves over 90% accuracy for a vocabulary of 139 words with no training observations, outperforming direct use of the "canonical'' gestural scores.
16:20Detailed pronunciation variant modeling for speech transcription
Denis Jouvet (LORIA-INRIA, Speech Group, 54602 Villers les Nancy, France)
Dominique Fohr (LORIA-INRIA, Speech Group, 54602 Villers les Nancy, France)
Irina Illina (LORIA-INRIA, Speech Group, 54602 Villers les Nancy, France)
Modeling pronunciation variants is an important topic for automatic speech recognition. This paper investigates the pronunciation modeling at the lexical level, and presents a detailed modeling of the probabilities of the pronunciation variants. The approach is evaluated on the French ESTER2 corpus, and a significant word error rate reduction is achieved through the use of context and speaking rate dependent modeling of these pronunciation probabilities. A rule-based approach makes it possible to derive a priori probabilities for the pronunciation of words that are not present in the training corpus, and a MAP estimation process yields reliable estimates of the pronunciation variant probabilities.
16:40A Minimum Classification Error approach to pronunciation variation modeling of non-native proper names
Line Adde (Department of Electronics and Telecommunications, NTNU, Norway)
Bert Réveil (ELIS, Ghent University, Belgium)
Jean-Pierre Martens (ELIS, Ghent University, Belgium)
Torbjørn Svendsen (Department of Electronics and Telecommunications, NTNU, Norway)
In automatic recognition of non-native proper names, it is critical to be able to handle a variety of different pronunciations. Traditionally, this has been solved by including alternative pronunciation variants in the recognition lexicon at the risk of introducing unwanted confusion between different name entries. In this paper we propose a pronunciation variant selection criterion that aims to avoid this risk by basing its decisions on scores which are calculated according to the minimum classification error (MCE) framework. By comparing the error rate before and after a lexicon change, the selection criterion chooses only the candidates that actually decrease the error rate. Selecting pronunciation candidates in this manner substantially reduces both the error rate and the required number of variants per name compared to a probability-based baseline selection method.
17:00Acoustics-Based Phonetic Transcription Method for Proper Nouns
Antoine Laurent (LIUM - University of Le Mans)
Sylvain Meignier (LIUM - University of Le Mans)
Teva Merlin (LIUM - University of Le Mans)
Paul Deléglise (LIUM - University of Le Mans)
This paper focuses on an approach to improve automatic phonetic transcription of proper nouns. The method is based on a two-level iterative process that extract the phonetic variants from the audio signals before filtering the irrelevant variants. The evaluation of the method shows a decreasing of the Word Error Rate (WER) on segments of speech with proper nouns, without affecting negatively the WER on the rest of the corpus (ESTER corpus of French broadcast news).
17:20Wiktionary as a Source for Automatic Pronunciation Extraction
Tim Schlippe (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany)
Sebastian Ochs (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany)
Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany)
In this paper, we analyze whether dictionaries from the World Wide Web which contain phonetic notations, may support the rapid creation of pronunciation dictionaries within the speech recognition and speech synthesis system building process. As a representative dictionary, we selected Wiktionary [1] since it is at hand in multiple languages and, in addition to the definitions of the words, many phonetic notations in terms of the International Phonetic Alphabet (IPA) are available. Given word lists in four languages English, French, German, and Spanish, we calculated the percentage of words with phonetic notations in Wiktionary. Furthermore, two quality checks were performed: First, we compared pronunciations from Wiktionary to pronunciations from dictionaries based on the GlobalPhone project, which had been created in a rule-based fashion and were manually cross-checked [2]. Second, we analyzed the impact of Wiktionary pronunciations on automatic speech recognition (ASR) systems. French Wiktionary achieved the best pronunciation coverage, containing 92.58% phonetic notations for the French GlobalPhone word list as well as 76.12% and 30.16% for country and international city names. In our ASR systems evaluation, the Spanish system gained the most improvement from Wiktionary pronunciations with 7.22% relative word error rate reduction. [1] “Wiktionary - a wiki-based open content dictionary.” [Online]. Available: [2] Tanja Schultz. GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University. In: Proc. ICSLP Denver, CO, 2002.
17:40Learning New Word Pronunciations from Spoken Examples
Ibrahim Badr (Spoken Language Systems Group, CSAIL, MIT)
Ian McGraw (Spoken Language Systems Group, CSAIL, MIT)
James Glass (Spoken Language Systems Group, CSAIL, MIT)
A lexicon containing explicit mappings between words and pronunciations is an integral part of most automatic speech recognizers (ASRs). While many ASR components can be trained or adapted using data, the lexicon is one of the few that typically remains static until experts make manual changes. This work takes a step towards alleviating the need for manual intervention by integrating a popular grapheme-to-phoneme conversion technique with acoustic examples to automatically learn high-quality baseform pronunciations for unknown words. We explore two models in a Bayesian framework, and discuss their individual advantages and shortcomings. We show that both are able to generate better-than-expert pronunciations with respect to word error rate on an isolated word recognition task.

Speaker recognition and diarization

Time:Wednesday 16:00 Place:201A Type:Oral
Chair:Sadaoki Furui
16:00Phonetic Subspace Mixture Model for Speaker Diarization
I-Fan Chen (Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan)
Shih-Sian Cheng (Institute of Information Science, Academia Sinica, Taipei, Taiwan)
Hsin-Min Wang (Institute of Information Science, Academia Sinica, Taipei, Taiwan)
This paper presents an improved distance measure for speaker clustering in speaker diarization systems. The proposed phonetic subspace mixture (PSM) model introduces phonetic information to the ΔBIC distance measure. Therefore, the new PSM model-based ΔBIC distance measure can remove the effect of phonetic content on the diarization results. The typical ΔBIC distance measure can be seen as a special case of the new ΔBIC distance measure. Our experiment results show that the new distance measurement consistently improves the speaker diarization performance on three datasets.
16:20Overlap Detection for Speaker Diarization by Fusing Spectral and Spatial Features
Martin Zelenák (Universitat Politècnica de Catalunya, Barcelona, Spain)
Carlos Segura (Universitat Politècnica de Catalunya, Barcelona, Spain)
Javier Hernando (Universitat Politècnica de Catalunya, Barcelona, Spain)
A substantial portion of errors of the conventional speaker diarization systems on meeting data can be accounted to overlapped speech. This paper proposes the use of several spatial features to improve speech overlap detection on distant channel microphones. These spatial features are integrated into a spectral-based system by using principal component analysis and neural networks. Different overlap detection hypotheses are used to improve diarization performance with both overlap exclusion and overlap labeling. In experiments conducted on AMI Meeting Corpus we demonstrate a relative DER improvement of 11.6% and 14.6% for single- and multi-site data, respectively.
16:40Floor Holder Detection and End of Speaker Turn Prediction in Meetings
Alfred Dielmann (Idiap Research Institute - Rue Marconi 19 - 1920 Martigny, Switzerland)
Giulia Garau (Idiap Research Institute - Rue Marconi 19 - 1920 Martigny, Switzerland)
Hervé Bourlard (Idiap Research Institute - Rue Marconi 19 - 1920 Martigny, Switzerland)
We propose a novel fully automatic framework to detect which meeting participant is currently holding the conversational floor and when the current speaker turn is going to finish. Two sets of experiments were conducted on a large collection of multiparty conversations: the AMI meeting corpus. Unsupervised speaker turn detection was performed by post-processing the speaker diarization and the speech activity detection outputs. A supervised end-of-speaker-turn prediction framework, based on Dynamic Bayesian Networks and automatically extracted multimodal features (related to prosody, overlapping speech, and visual motion), was also investigated. These novel approaches resulted in good floor holder detection rates (13.2% Floor Error Rate), attaining state of the art end-of-speaker-turn prediction performances.
17:00Confidence Measures for Speaker Segmentation and their Relation to Speaker Verification
Carlos Vaquero (University of Zaragoza, Zaragoza, Spain)
Alfonso Ortega (University of Zaragoza, Zaragoza, Spain)
Jesús Villalba (University of Zaragoza, Zaragoza, Spain)
Antonio Miguel (University of Zaragoza, Zaragoza, Spain)
Eduardo Lleida (University of Zaragoza, Zaragoza, Spain)
This paper addresses the problem of speaker verification in two speaker conversations, proposing a set of confidence measures to assess the quality of a given speaker segmentation. In addition we study how these measures can be used to estimate the performance of a state of the art speaker verification system. Our approach for speaker segmentation is based on the eigenvoice paradigm. We present a novel PCA based initialization in the speaker factor space along with a modification of the speaker turn duration distribution that improves the performance of previously reported Joint Factor Analysis based speaker segmentation systems. Three confidence measures are analyzed on the output of the proposed segmentation system for the summed-channel telephone data of the NIST Speaker Recognition Evaluation 2008, showing that they constitute a good measure to estimate not only the segmentation accuracy but the performance of a speaker verification system when it faces two speaker conversations.
17:20Decoupling session variability modelling and speaker characterisation
Anthony Larcher (University of Avignon - LIA)
Christophe Lévy (University of Avignon - LIA)
Driss Matrouf (University of Avignon - LIA)
Jea-Francois Bonastre (University of Avignon - LIA)
The Factor Analysis framework demonstrated its high power to model session variability during the past years. However, train- ing the FA parameters implies to have a large amount of training data. When the size of the available database is limited, the number of components of the core statistical model, the UBM, is also limited as the UBM drives the dimension of the FA main matrix. As the size of the UBM gives directly the size of the speaker supervector (concatenation of the GMM mean parameters), it limits also the intrinsic capacity of the recognition system, reducing the performance expectation. This paper aims to withdraw this limitation by breaking the intrinsic link between the FA dimensionality and the UBM dimensionality. The session variability modelling is done on a smaller dimension compared to the UBM, which drives the discriminative power of the system. The first experimental results proposed in this paper, done using the NIST-SRE 2008 framework, are encouraging with a relative EER improvement of about 18% when a 512 components UBM is associated to a 32 components session variability modelling compared with a 32 components UBM associated with the same variability modelling.
17:40Incorporating MAP Estimation and Covariance Transform for SVM based Speaker Recognition
Cheung-Chi Leung (Institute for Infocomm Research, A*STAR, Singapore)
Donglai Zhu (Institute for Infocomm Research, A*STAR, Singapore)
Kong Aik Lee (Institute for Infocomm Research, A*STAR, Singapore)
Bin Ma (Institute for Infocomm Research, A*STAR, Singapore)
Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore)
In this paper, we apply Constrained Maximum a Posteriori Linear Regression (CMAPLR) transformation on Universal Background Model (UBM) when characterizing each speaker with a supervector. We incorporate the covariance transformation parameters into the supervector in addition to the mean transformation parameters. Maximum Likelihood Linear Regression (MLLR) covariance transformation is adopted. The auxiliary function maximization involved in Maximum Likelihood (ML) and Maximum a Posteriori (MAP) estimation is also presented. Our experiment on the 2006 NIST Speaker Recognition Evaluation (SRE) corpus shows that the two proposed techniques provide substantial performance improvement.

Speech and audio classification

Time:Wednesday 16:00 Place:201B Type:Oral
Chair:Kazunori Mano
16:00Single-speaker/multi-speaker co-channel speech classification
Stéphane Rossignol (IMS research group, SUPELEC -- Metz Campus)
Olivier Pietquin (IMS research group, SUPELEC -- Metz Campus)
The demand for content-based management and real-time manipulation of audio data is constantly increasing. This paper presents a method to identify temporal regions, in a segment of co-channel speech, as being either single-speaker or multi-speaker speech. The state of the art approach for this purpose is the kurtosis. In this paper, a set of complementary time-domain and frequency-domain features is studied. The employed classification scheme is the one-class SVM classifier. A recognition rate of 94.75 % is reached. The set of features providing the best performance is determined.
Oriol Vinyals (University of California at Berkeley)
Gerald Friedland (International Computer Science Institute)
Nelson Morgan (International Computer Science Institute)
In this paper, we propose a discriminative extension to agglomerative hierarchical clustering, a typical technique for speaker diarization, that fits seemlessy with most state-of-the art diarization algorithms. We propose to use maximum mutual information using bootstrapping i.e., initial predictions are used as input for retraining of models in an unsupervised fashion. This article describes this new approach, analyzes its behavior, and presents results on the official NIST Rich Transcription datasets. We show an absolute improvement of 4% DER with respect to the generative approach baseline. We also observe a strong correlation between the original error and the amount of improvement, that is, the better our predicted labels are, the more gain we obtain from discriminative training, which we interpret as a strong indication for the high potential of the extension.
Jürgen Geiger (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany)
Frank Wallhoff (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany)
Gerhard Rigoll (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany)
In this paper, we present an open-set online speaker diarization system. The system is based on Gaussian mixture models (GMMs), which are used as speaker models. The system starts with just 3 such models (one each for both genders and one for non-speech) and creates models for individual speakers not till the speakers occur. As more and more speakers appear, more models are created. Our system implicitly performs audio segmentation, speech/non-speech classification, gender recognition and speaker identification. The system is tested with the HUB4-1996 radio broadcast news database.
17:00A Segment-Based Non-Parametric Approach for Monophone Recognition
Ladan Golipour (INRS)
Douglas O'Shaughnessy (INRS)
In this paper, we propose a segment-based non-parametric method of monophone recognition. We pre-segment the speech utterance into its underlying phonemes using a group-delay-based algorithm. Then, we apply the k- NN/SASH phoneme classification technique to classify the hypothesized phonemes. Since phoneme boundaries are already known during the decoding, the search space is very limited and the recognition fast. However, such hard-decisioning leads to missed boundaries and over-segmentations. Therefore, while constructing the graph for an utterance, we use phoneme duration constraints and broad-class similarity information to merge or split the segments and create new branches. We perform a simplified acoustical level monophone recognition task on the TIMIT test database. Since phoneme transitional probabilities are not included, only one (most likely) hypothesis and score is provided for each segment and a simple shortest path search algorithm is applied to find the best phoneme sequence rather than the Viterbi search. This simplified evaluation achieves 58.5% accuracy and 67.8% correctness.
17:20A Fast One-Pass-Training Feature Selection Technique for GMM-based Acoustic Event Detection with Audio-Visual Data
Taras Butko (Universitat Politècnica de Catalunya)
Climent Nadeu (Universitat Politècnica de Catalunya)
Acoustic event detection becomes a difficult task, even for a small number of events, in scenarios where events are produced rather spontaneously and often overlap in time. In this work, we aim to improve the detection rate by means of feature selection. Using a one-against-all detection approach, a new fast one-pass-training algorithm, and an associated highly-precise metric are developed. Choosing a different subset of multimodal features for each acoustic event class, the results obtained from audiovisual data collected in the UPC multimodal room show an improvement in average detection rate with respect to using the whole set of features.
17:40Effects of modelling within- and between-frame temporal variations in power spectra on non-verbal sound recognition
Nobuhide Yamakawa (Graduate School of Informatics, Kyoto University)
Tetsuro Kitahara (College of Humanities and Sciences, Nihon University)
Toru Takahashi (Graduate School of Informatics, Kyoto University)
Kazunori Komatani (Graduate School of Informatics, Kyoto University)
Tetsuya Ogata (Graduate School of Informatics, Kyoto University)
Hiroshi G. Okuno (Graduate School of Informatics, Kyoto University)
Research on environmental sound recognition has not shown great development in comparison with that on speech and musical signals. One of the reasons is that the category of environmental sounds covers a broad range of acoustical natures. We classified them in order to explore suitable recognition techniques for each characteristic. We focus on impulsive sounds and their non-stationary feature within and between analytic frames. We used matching-pursuit as a framework to use wavelet analysis for extracting temporal variation of audio features inside a frame. We also investigated the validity of modeling decaying patterns of sounds using Hidden Markov Models. Experimental results indicate that sounds with multiple impulsive signals are recognized better by using time-frequency analyzing bases than by frequency domain analysis. Classification of sound classes with a long and clear decaying pattern improves when multiple number of HMMs are applied.

Emotion Recognition

Time:Wednesday 16:00 Place:302 Type:Oral
Chair:Marc Swerts
16:00On the Importance of Glottal Flow Spectral Energy for the Recognition of Emotions in Speech
Ling He (School of Electrical and Computer Engineering, RMIT University)
Margaret Lech (School of Electrical and Computer Engineering, RMIT University)
Nicholas Allen (Department of Psychology, University of Melbourne)
Two new approaches to feature extraction for automatic emotion classification in speech are described and tested. The methods are based on recent laryngological experiments testing the glottal air flow during phonation. The proposed approach calculates the area under the spectral energy envelope of the speech signal (AUSEES) and the glottal waveform (AUSEEG). The new methods provided very high recognition rates for seven emotions (contempt, angry, anxious, dysphoric, pleasant, neutral and happy). The speech data included 170 adult speakers (95 female and 75 male). The classification results showed that the new features provided significantly higher classification results (89.95% for AUSEEG, 76.07% for AUSEES) compared to the baseline MFCC approach (37.81%). The glottal waveform based AUSEEG features provided better results than the speech based AUSEES features, indicating that the majority of the emotion information is likely to be added to speech during the glottal wave formation
16:20Real-life emotion-related states detection in call centers
Laurence Devillers (Department of Human-Machine Interaction, LIMSI-CNRS, France ; Department of Computer Sciences, University of Orsay PXI, France)
Christophe Vaudable (Department of Human-Machine Interaction, LIMSI-CNRS, France)
Clement Chastagnol (Department of Human-Machine Interaction, LIMSI-CNRS, France)
In this article, we describe experiments on the detection of three emotional states (Anger, Positive and Neutral) for two French corpora collected in call centers in different contexts (service complaints and medical emergency). These corpora have a high level of privacy. In order to be comparable with results obtained in the community we used the openEAR acoustic features extraction platform instead of our own library. One of our aims being the comparison of anger and positive emotions across corpora, we train models on one corpus and test it on the other to compare their similarities, then conversely. We will discuss the possible gain in generalization power.
16:40Multi-class and hierarchical SVMs for emotion recognition
Ali Hassan (University of Southampton, UK)
Robert Damper (University of Southampton, UK)
This paper extends binary support vector machines to multiclass classification for recognising emotions from speech. We apply two standard schemes (one-versus-one and one-versus rest) and two schemes that form a hierarchy of classifiers each making a distinct binary decision about class membership, on three publicly-available databases. Using the OpenEAR toolkit to extract more than 6000 features per speech sample, we have been able to outperform the state-of-the-art classification methods on all three databases.
17:00Determining Optimal Features for Emotion Recognition from Speech by applying an Evolutionary Algorithm
David Hübner (Department of Electrical Engineering and Information Technology, Otto von Guericke University Magdeburg, Germany)
Bogdan Vlasenko (Department of Electrical Engineering and Information Technology, Otto von Guericke University Magdeburg, Germany)
Tobias Grosser (Department of Electrical Engineering and Information Technology, Otto von Guericke University Magdeburg, Germany)
Andreas Wendemuth (Department of Electrical Engineering and Information Technology, Otto von Guericke University Magdeburg, Germany)
The automated recognition of emotions from speech is a challenging issue. In order to build an emotion recognizer well defined features and optimized parameter sets are essential. This paper will show how an optimal parameter set for HMM-based recognizers can be found by applying an evolutionary algorithm on standard features in automated speech recognition. For this, we compared different signal features, as well as several architectures of HMMs. The system was evaluated on a non-acted database and its performance was compared to a baseline system. We present an optimal feature set for the public part of the SmartKom database.
17:20Context-Sensitive Multimodal Emotion Recognition from Speech and Facial Expression using Bidirectional LSTM Modeling
Martin Woellmer (Technische Universitaet Muenchen)
Angeliki Metallinou (University of Southern California)
Florian Eyben (Technische Universitaet Muenchen)
Bjoern Schuller (Technische Universitaet Muenchen)
Shrikanth Narayanan (University of Southern California)
In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual information for modeling the evolution of emotion within a conversation. We focus on recognizing dimensional emotional labels, which enables us to classify both prototypical and non-prototypical emotional expressions contained in a large audio-visual database. Subject-independent experiments on various classification tasks reveal that the BLSTM network approach generally prevails over standard classification techniques such as Hidden Markov Models or Support Vector Machines, and achieves F1-measures of the order of 72%, 65%, and 55% for the discrimination of three clusters in emotional space and the distinction between three levels of valence and activation, respectively.
17:40Data-dependent evaluator modeling and its application to emotional valence classification from speech
Kartik Audhkhasi (University of Southern California)
Shrikanth Narayanan (University of Southern California)
Practical supervised learning scenarios involving subjectively evaluated data have multiple evaluators, each giving their noisy version of the hidden ground truth. Majority logic combination of labels assumes equally skilled evaluators, and is generally suboptimal. Previously proposed models have assumed data independent evaluator behavior. This paper presents a data dependent evaluator model, and an algorithm to jointly learn evaluator behavior and a classifier. This model is based on the intuition that real world evaluators have varying performance depending on the data. Experiments on an emotional valence classification task show modest performance improvements of the proposed algorithm as compared to the majority logic baseline and a data independent evaluator model. But more critically, the algorithm also provides accurate estimates of individual evaluator performance, thus paving the way for incorporating active learning, evaluator feedback and unreliable data detection.

Speech coding, modeling, and transmission

Time:Wednesday 16:00 Place:International Conference Room A Type:Poster
Chair:Masami Akamine
#1Modelling Speech Line Spectral Frequencies with Dirichlet Mixture Models
Zhanyu Ma (KTH Royal Institute of Technology, Sound and Image Processing Lab)
Arne Leijon (KTH Royal Institute of Technology, Sound and Image Processing Lab)
In this paper, we model the underlying probability density function (PDF) of the speech line spectral frequencies (LSF) parameters with a Dirichlet mixture model (DMM). The LSF parameters have two special features: 1) the LSF parameters have a bounded range; 2) the LSF parameters are in an increasing order. By transforming the LSF parameters to the ΔLSF parameters, the DMM can be used to model the ΔLSF parameters and take the advantage of the features mentioned above. The distortion-rate (D-R) relation is derived for the Dirichlet distribution with the high rate assumption. A bit allocation strategy for DMM is also proposed. In modelling the LSF parameters extracted from the TIMIT database, the DMM shows a better performance compared to the Gaussian mixture model, in terms of D-R relation, likelihood and model complexity. Since modelling is the essential and prerequisite step in the PDF-optimized vector quantizer design, better modelling results indicate a superior quantization performance.
#2PDF-optimized LSF Vector Quantization Based on Beta Mixture Models
Zhanyu Ma (KTH Royal Institute of Technology, Sound and Image Processing Lab)
Arne Leijon (KTH Royal Institute of Technology, Sound and Image Processing Lab)
The line spectral frequencies (LSF) are known to be the most efficient representation of the linear predictive coding (LPC) parameters from both the distortion and perceptual point of view. By considering the bounded property of the LSF parameters, we apply beta mixture model (BMM) to the distribution of the LSF parameters. Meanwhile, by following the principles of probability density function (PDF) optimized vector quantization (VQ), we derive the bit allocation strategy for the BMM. The LSF parameters are obtained from the TIMIT database and a practical VQ is designed. By taking the Bayesian information criterion (BIC), the square error (SE) and the spectral distortion (SD) as the criteria, the BMM based VQ outperforms the Gaussian mixture model based VQ with uncorrelated Gaussian component (UGMVQ) by about 1 ~ 2 bits/vector.
#3Non-Linear Predictive Vector Quantization of Feature Vectors for Distributed Speech Recognition
Jose Enrique Garcia (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
Antonio Miguel (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)
In this paper, we present a non linear prediction scheme based on a Multi-Layer Perceptron for Predictive Vector Quantization (PVQ-MLP) of MFCC for very low bit-rate coding of acoustic features in distributed speech recognition (DSR). Certain applications like voice enabled web-browsing or speech controlled processes in large industrial plants, where hundreds of users access simultaneously to the same ASR server can benefit from this substantial bit-rate reduction. Experimental results obtained on a large vocabulary task show an improved performance of PVQ-MLP in terms of prediction gain and WER compared to a linear prediction scheme, especially when low bit-rates are evaluated. Using PVQ-MLP the bit-rate can be reduced up to 1.8 kbps resulting in a reduction of 66% with respect to the ETSI standards (4.4 kbps) with a WER degradation lower than 5% compared to a system without quantization.
#4Superwideband Extension of G.718 and G.729.1 Speech Codecs
Lasse Laaksonen (Nokia Research Center)
Mikko Tammi (Nokia Research Center)
Vladimir Malenovsky (VoiceAge)
Tommy Vaillancourt (VoiceAge)
Mi Suk Lee (ETRI)
Tomofumi Yamanashi (Panasonic Corp.)
Masahiro Oshikiri (Panasonic Corp.)
Claude Lamblin (France Telecom)
Balazs Kovesi (France Telecom)
Miao Lei (Huawei Technologies)
Deming Zhang (Huawei Technologies)
Jon Gibbs (Motorola)
Holly Francois (Motorola)
This communication presents the recently standardized superwideband (SWB) extensions of ITU-T G.718 and G.729.1. These extensions were standardized as G.718 annex B and G.729.1 annex E. The SWB functionality is implemented using embedded scalable layers on top of the wideband (WB) core codecs, and it extends the bit rate of the codecs to 48 and 64 kbit/s for the G.718 and G.729.1, respectively. The main technology is a two-mode SWB coding method of the high frequencies. In addition, the G.729.1 SWB extension enhances the lower frequency range. The codec performance is illustrated with some listening test results extracted from the ITU-T Characterization phase.
#5A multipulse FEC scheme based on amplitude estimation for CELP codecs over packet networks
José L. Carmona (Universidad de Granada)
Angel M. Gómez (Universidad de Granada)
Antonio M. Peinado (Universidad de Granada)
José L. Pérez-Córdoba (Universidad de Granada)
José A. González (Universidad de Granada)
This paper presents a forward error correction (FEC) technique based on a multipulse representation of the excitation for code-excited linear prediction (CELP) speech transmission under packet loss conditions. In this approach, the encoder sends the position of a pulse that it is used for the resynchronization of the adaptive codebook, so that propagation errors can be prevented. At the decoder, the amplitude of the resynchronization pulse is estimated by means of minimum mean square error (MMSE) estimation based on Gaussian mixture models (GMMs) of the received parameters and the pulse amplitude. The proposal is tested employing PESQ scores and AMR 12.2 kbps, a well-known CELP codec. The results show that, with a very small additional information (350 bps), this technique achieves a noticeable improvement over the results obtained by the packet loss concealment included in the legacy codec.
#6Voice Quality Evaluation of Recent Open Source Codecs
Anssi Rämö (Nokia Research Center)
Henri Toukomaa (Nokia Research Center)
This paper introduces Silk, CELT, and BroadVoice that are available in the internet as open source voice codecs. Their voice quality is evaluated with a subjective listening test. AMR, AMR-WB, G.718, G.718B, and G.722.1C were used as stan- dardized references. In addition Skype’s Silk codec’s peculiar bandwidth and bitrate characteristics are examined in more de- tail.
#7Efficient HMM-Based Estimation of Missing Features, with Applications to Packet Loss Concealment
Bengt Jonas Borgstrom (University of California, Los Angeles)
Abeer Alwan (University of California, Los Angeles)
In this paper, we present efficient HMM-based techniques for estimating missing features. By assuming speech features to be observations of hidden Markov processes, we derive a minimum mean-square error (MMSE) solution. We increase the computational efficiency of HMM-based methods by downsampling underlying Markov models, and by enforcing symmetry in transitional probability matrices. When applied to features generally utilized in parametric speech coding, namely line spectral frequencies (LSFs), the proposed methods provide significant improvement over the baseline repetition scheme, in terms of weighted spectral distortion and peak SNR.
#8Speech Inventory Based Discriminative Training for Joint Speech Enhancement and Low-Rate Speech Coding
Xiaoqiang Xiao (Dept. of Electrical Engineering, Penn State University, University Park, PA, USA)
Robert Nickel (Dept. of Electrical Engineering, Bucknell University, Lewisburg, PA, USA)
A significant extension to a novel inventory based speech processing procedure published by the authors in 2009 and 2010 is presented. The method is based on a speech analysis and re-synthesis scheme for scenarios in which speaker enrollment and noise enrollment are feasible. The procedure jointly provides speech enhancement and high-quality low-rate speech encoding with a flexible rate of just below 1.5 kbits/sec in average. In this paper we are presenting a significant improvement of the original approach that fosters intelligibility in lower SNR environments. We are proposing to augment the originally solely HMM based analysis stage with a discriminative training algorithm that dramatically improves the accuracy of the employed inventory frame selection process. A comparison mean opinion score (CMOS) study shows that the new method leads to a significant gain in overall perceptual quality between the encoder input and the decoder output.
#9Quality-Based Playout Buffering with FEC for Conversational VoIP
Qipeng Gong (McGill University)
Peter Kabal (McGill University)
In Voice-over-IP, buffer delay and packet loss are two main factors effecting perceived conversational quality. A quality-based algorithm aims to seek an optimum balancing of delay versus loss. To improve perceived quality further, steps should be taken to mitigate the effect of losses due to network (missing packets) and buffer underflow (late packets) without increasing buffer delays. In this paper, we propose a quality-based playout algorithm with an FEC design based on conversational quality including calling quality and interactivity. The simulation results show our algorithm's efficiency of correcting for losses (isolated and burst) and improving perceived conversational quality.
#10Sub-band basis spectrum model for pitch-synchronous log spectrum and phase based on approximation of sparse coding
Masatsune Tamura (Knowledge Media Laboratory, Corporate Research and Development Center, Toshiba Corporation)
Takehiko Kagoshima (Knowledge Media Laboratory, Corporate Research and Development Center, Toshiba Corporation)
Masami Akamine (Knowledge Media Laboratory, Corporate Research and Development Center, Toshiba Corporation)
In this paper, we propose a sub-band basis spectrum model (SBM) which is a new spectrum representation model that uses a linear combination of sub-band basis. We first apply sparse coding to the pitch-synchronously analyzed log-spectra. Based on the approximation of the resulting basis, we set a sub-band basis using 1-cycle sinusoidal shapes that have mel-scale for lower frequencies and equally spaced scale for higher frequencies. Parameter of SBM of the log spectrum and the phase spectrum is calculated by fitting the basis to the spectrum. Since the parameter represents the shape of the spectrum, it can be used for frequency warping and filtering based voice adaptation for unit-fusion based TTS. Experimental results show that the analysis synthesis speech is close to original speech and that there are no significant difference between the synthetic speech using analysis-synthesis database and those using original database for unit-fusion based TTS.
Harshavardhan Sundar (Indian Institute of Science)
Chandra Sekhar Seelamantula (Indian Institute of Science)
Thippur Sreenivas (Indian Institute of Science)
We address the problem of robust formant tracking in continuous speech. We propose the robust statistical model of t-distribution mixture density (tMM) operating on the "pyknogram" obtained through a multiband AM-FM demodulation technique. The statistical model of the pyknogram is shown to be more-effective to handle the variability in the signal processing stage. The t-mixture density estimation is shown to be effective than Gaussian mixture density because of outlier data in the pyknogram. For formant tracking, we show that the tMM is better in terms of parameter selection, accuracy, and smoothness of the estimate. We present experimental results on simulated data, real speech sentences, and test the robustness of the proposed MDA-tMM method to additive noise. Comparisons with PRAAT software and a recently-developed adaptive filterbank technique show that the proposed MDA-tMM method is superior in several aspects.
#12Estimation studies of vocal tract shape trajectory using a variable length and lossy Kelly-Lochbaum model
Heikki Ville Tapani Rasilo (Department of Signal Processing and Acoustics, Aalto University School of Science and Technology, Finland)
Unto Kalervo Laine (Department of Signal Processing and Acoustics, Aalto University School of Science and Technology, Finland)
Okko Johannes Räsänen (Department of Signal Processing and Acoustics, Aalto University School of Science and Technology, Finland)
This work demonstrates the use of a modified Kelly-Lochbaum (KL) vocal tract (VT) model in dynamic mapping from speech signals to articulatory configurations. The sixteen section KL model is equipped with a variable length segment for lip rounding and an accurate model for lip radiation impedance. Profiles for the eight Finnish vowels are used to form so called anchor points in the articulatory and spectral domain. These profiles are modulated by cosine functions to produce clusters of vowel variants around the anchor points. The resulting profile and formant frequency data are stored in a codebook that is used in the trajectory estimation task, proposing a number of profile candidates for each speech frame based on the observed formant frequencies. The final trajectory is estimated by minimizing the articulatory distance across all frames. The first trajectory estimation results are promising and in good balance with the present phonetic literature.

ASR: Language Modeling and Speech Understanding II

Time:Wednesday 16:00 Place:International Conference Room B Type:Poster
Chair:Jerome Bellegarda
#1Improving backoff models with bag of words and hollow-grams
Benjamin Lecouteux (Laboratoire Informatique d'Avignon)
Raphaël Rubino (Laboratoire Informatique d'Avignon)
Georges Linares (Laboratoire Informatique d'Avignon)
Classical n-grams models lack robustness on unseen events. The literature suggests several smoothing methods: empirically, the most effective of these is the modified Kneser-Ney approach. We propose to improve this back-off model: our method boils down to back-off value reordering, according to the mutual information of the words, and to a new hollow-gram model. Results show that our back-off model yields significant improvements to the baseline, based on the modified Kneser-Ney back-off. We obtain a 0.6% absolute word error rate improvement without acoustic adaptation, and 0.4% after adaptation with a 3xRT ASR system.
#2Study on Interaction between Entropy Pruning and Kneser-Ney Smoothing
Ciprian Chelba (Google)
Thorsten Brants (Google)
Will Neveitt (Google)
Peng Xu (Google)
The paper presents an in-depth analysis of a less known interaction between Kneser-Ney smoothing and entropy pruning that leads to severe degradation in language model performance under aggressive pruning regimes. Experiments in a data-rich setup such as voice search show a significant impact in WER as well: pruning Kneser-Ney and Katz models to 0.1% of their original impacts speech recognition accuracy significantly, approx. 10% relative.
#3Dynamic Language Model Adaptation Using Keyword Category Classification
Hitoshi Yamamoto (NEC Corporation)
Ken Hanazawa (NEC Corporation)
Kiyokazu Miki (NEC Corporation)
Koichi Shinoda (Tokyo Institute of Technology)
This paper describes a language model adaptation method for improving speech recognition of keywords in spoken queries occurring in information retrieval tasks. The method dynamically adapts language models to keyword categories within a single utterance; it first estimates keyword categories and their positions in an input query utterance and then dynamically changes the weights for language models designed for individual keyword categories on the basis of the estimation results. The method has been evaluated in speech recognition experiments on television program retrieval tasks and has demonstrated a 22.0% reduction in keyword error rates.
#4Integration of a Cache-based Model and Topic Dependent Class Model with Soft Clustering and Soft Voting
Welly Naptali (Toyohashi University of Technology)
Masatoshi Tsuchiya (Toyohashi University of Technology)
Seiichi Nakagawa (Toyohashi University of Technology)
A topic dependent class (TDC) language model (LM) is a topic-based LM that uses a semantic extraction method to reveal latent topic information from the relation of nouns. Previously, we have shown that TDC models outperform several state-of-the-art baseline models. There are two separate points that we would like to introduce in this paper. First, we improve the TDC model further by incorporating a cache-based LM through unigram scaling. Experiments on the Wall Street Journal (WSJ) and Japanese newspaper (Mainichi Shimbun) corpora show that this combination improves the model significantly in terms of perplexity. Second, a TDC stand-alone model suffers from a shrinking training corpus as the number of topics increases. We solve this problem by performing soft-clustering and soft-voting in the training and test phases. Experimental results using the WSJ corpus show that the TDC model outperforms the baseline without interpolation with a word-based n-gram.
#5Conditional models for detecting lambda-functions in a Spoken Language Understanding System
Frederic Duvert (Laboratoire d’Informatique d’Avignon (LIA), France)
Renato De Mori (Laboratoire d’Informatique d’Avignon (LIA), France;Department of Computer Science, Mc Gill University, Canada)
In this paper, methods are proposed for hypothesizing lambda-expressions of referred objects in telephone dialogues. Relations between words, semantic constituents and their composition into lambda-expressions are modeled by conditional random fields (CRF), in which functions integrate manually derived template patterns with words and concept hypotheses. Substantial error reductions are obtained using these functions instead of just using words and concept n-grams. Manually derived patterns and models appear to be very useful and not difficult to obtain by generalizing significant examples fetched by the presence of specific semantic constituents.
#6Novel Weighting Scheme for Unsupervised Language Model Adaptation Using Latent Dirichlet Allocation
Md. Akmal Haidar (INRS-Énergie-Matériaux-Télécommunications, Montréal, Canada)
Douglas O'Shaughnessy (INRS-Énergie-Matériaux-Télécommunications, Montréal, Canada)
A new approach for computing weights of topic models in language model (LM) adaptation is introduced. We formed topic clusters by a hard-clustering method assigning one topic to one document based on the maximum number of words chosen from a topic for that document in Latent Dirichlet Allocation (LDA) analysis. The new weighting idea is that the unigram count of the topic generated by hard-clustering is used to compute the mixture weights instead of using an LDA latent topic word count used in the literature. Our approach shows significant perplexity and word error rate (WER) reduction against the existing approach.
#7Automatic Speech Recognition System Channel Modeling
Qun Feng Tan (University of Southern California)
Kartik Audhkhasi (University of Southern California)
Panayiotis Georgiou (University of Southern California)
Emil Ettelaie (University of Southern California)
Shrikanth Narayanan (University of Southern California)
In this paper, we present a systems approach for channel modeling of an Automatic Speech Recognition (ASR) system. This can have implications in improving speech recognition components, such as through discriminative language modeling. We simulate the ASR corruption using a phrase-based machine translation system trained between the reference phoneme and output phoneme sequences of a real ASR. We demonstrate that local optimization on the quality of phoneme-to-phoneme mappings does not directly translate to overall improvement of the entire model. However, we are still able to capitalize on contextual information of the phonemes which a simple acoustic distance model is not able to accomplish. Hence we show that the use of longer context results in a significantly improved model of the ASR channel.
#8Round-Robin Discrimination Model for Reranking ASR Hypotheses
Takanobu Oba (NTT Communication Science Laboratories, NTT Corporation.)
Takaaki Hori (NTT Communication Science Laboratories, NTT Corporation.)
Atsushi Nakamura (NTT Communication Science Laboratories, NTT Corporation.)
We propose a novel model training method for reranking problems. In our proposed approach, named the round-robin duel discrimination (R2D2), model training is done so that all pairs of samples can be distinguished from each other. The loss function of R2D2 for a log-linear model is concave. Therefore we can easily find the global optimum by using a simple parameter estimation method such a gradient descent method. We also describe the relationships between the global conditional log-linear model (GCLM) and R2D2. R2D2 can be recognized as an expansion of GCLM. We evaluate R2D2 on an error correction language model for speech recognition. Our experimental results using the corpus of spontaneous Japanese show that R2D2 provides an accurate model with a high generalization ability.
#9On-the-fly Lattice Rescoring for Real-time Automatic Speech Recognition
Haşim Sak (Boğaziçi University)
Murat Saraçlar (Boğaziçi University)
Tunga Güngör (Boğaziçi University)
This paper presents a method for rescoring the speech recognition lattices on-the-fly to increase the word accuracy while preserving low latency of a real-time speech recognition system. In large vocabulary speech recognition systems, pruned and/or lower order n-gram language models are often used in the first-pass of the speech decoder due to the computational complexity. The output word lattices are rescored offline with a better language model to improve the accuracy. For real-time speech recognition systems, offline lattice rescoring increases the latency of the system and may not be appropriate. We propose a method for on-the-fly lattice rescoring and generation, and evaluate it on a broadcast speech recognition task. This first-pass lattice rescoring method can generate rescored lattices with less than 20% increased computation over standard lattice generation without increasing the latency of the system.

Speech Perception III: Processing and Intelligibility

Time:Wednesday 16:00 Place:International Conference Room C Type:Poster
Chair:Toshio Irino
#1A Feature Extraction Method for Automatic Speech Recognition Based on the Cochlear Nucleus
Serajul Haque (University of Western Australia)
Roberto Togneri (University of Western Australia)
Motivated by the human auditory system, a feature extraction method for automatic speech recognition (ASR) based on the differential processing strategy of the AVCN, PVCN and the DCN of the cochlear nucleus is proposed. The method utilizes a zero-crossing with peak amplitudes (ZCPA) auditory model as synchrony detector to discriminate the low frequency formants. It utilizes the mean rate information in the synapse processing to capture the very rapidly changing dynamic nature of speech. Additionally, a temporal companding method is utilized for spectral enhancement through two-tone suppression. We propose to separate synchrony detection from synaptic processing as observed in the parallel processing methodology in the cochlear nucleus. HMM recognition using isolated digits showed improved recognition rates in clean and in non-stationary noise conditions than the existing auditory model.
#2A Phoneme Recognition Framework based on Auditory Spectro-Temporal Receptive Fields
Samuel Thomas (Johns Hopkins University)
Kailash Patil (Johns Hopkins University)
Sriram Ganapathy (Johns Hopkins University)
Nima Mesgarani (Johns Hopkins University)
Hynek Hermansky (Johns Hopkins University)
In this paper we propose to incorporate features derived using spectro-temporal receptive fields (STRFs) of neurons in the auditory cortex for the task of phoneme recognition. Each of these STRFs is tuned to different auditory frequencies, scales and modulation rates. We select different sets of STRFs which are specific for phonemes in different broad phonetic classes (BPC) of sounds. These STRFs are then used as spectro-temporal filters on spectrograms of speech to extract features for phoneme recognition. For the phoneme recognition task on the TIMIT database, the proposed features show an improvement of about 5% over conventional feature extraction techniques.
#3Perceptual compensation for effects of reverberation in speech identification: A computer model based on auditory efferent processing
Amy V. Beeston (Department of Computer Science, University of Sheffield, UK)
Guy J. Brown (Department of Computer Science, University of Sheffield, UK)
Human speech perception is remarkably robust to the effects of reverberation, due in part to mechanisms of perceptual constancy that compensate for the characteristics of the acoustic environment. A computer model of this phenomenon is described, which shows compensation for the effects of reverberation in a word identification task. The presence of reverberation is detected as a change in the mean-to-peak ratio of the simulated auditory nerve response. In turn, this leads to attenuation of peripheral auditory activity, which is achieved through an efferent feedback loop. The computer model provides a qualitative match to a range of perceptual data, suggesting that auditory mechanisms under efferent control might contribute to compensation for reverberation in particular speech identification tasks.
#4Predicting human perception and ASR classification of word-final [t] by its acoustic sub-segmental properties
Barbara Schuppler (Center for Language and Speech Technology)
Mirjam Ernestus (Max Planck Institute for Psycholinguistics, Center for Language and Speech Technology)
Wim van Dommelen (Department of Language and Communication Studies, NTNU)
Jacques Koreman (Department of Language and Communication Studies, NTNU)
This paper presents a study on the acoustic sub-segmental properties of word-final /t/ in conversational standard Dutch and how these properties contribute to whether humans and an ASR system classify the /t/ as acoustically present or absent. In general, humans and the ASR system use the same cues (presence of a constriction, a burst, and alveolar frication), but the ASR system is also less sensitive to fine cues (weak bursts, smoothly starting friction) than human listeners and misled by the presence of glottal vibration. These data inform the further development of models of human and automatic speech processing.
#5A speech in noise test based on spoken digits: Comparison of normal and impaired listeners using a computer model
Matthew Robertson (Department of Computer Science, University of Sheffield, UK)
Guy J. Brown (Department of Computer Science, University of Sheffield, UK)
Wendy Lecluyse (Department of Psychology, Essex University, UK)
Manasa Panda (Department of Psychology, Essex University, UK)
Christine M. Tan (Department of Psychology, Essex University, UK)
This paper describes a speech-in-noise test which is suitable for testing both human and machine speech recognition in noise. The test uses spoken digit triplets, presented in a range of babble backgrounds and signal-to-noise ratios (SNRs). The performance of a normal hearing (NH) and hearing impaired (HI) listener have been assessed using the test. Both listeners show a fall in performance with decreasing SNR, as well as a decrease in performance with an increase in the number of talkers in the babble background. A physiologically accurate computational auditory model has been tuned to match the NH and HI listeners, allowing their performance in the test to be modelled using a missing data-based automatic speech recognition (ASR) system. For the NH model we show a good match to the behaviour of the human listener. However, the computer model underestimates the digit test performance of the specific HI listener considered here.
#6Evaluation of bone-conducted ultrasonic hearing-aid regarding transmission of paralinguistic information: A comparison with cochlear implant simulator
Takayuki Kagomiya (Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan)
Seiji Nakagawa (Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan)
Human listeners can perceive speech signals in a voice-modulated ultrasonic carrier from a bone-conduction stimulator, even if the listeners are patients with sensorineural hearing loss. Considering this fact, we have been developing a bone-conducted ultrasonic hearing aid (BCUHA). The purpose of this study is to evaluate the usability of BCUHA regarding transmission of paralinguistic information. For this purpose, two series of listening experiments were conducted. One is a speaker’s intention identification experiment, the other is a speaker discrimination experiment. To compare performance of BCUHA to that of air-conduction (AC) and cochlear implant, both experiments were conducted under three conditions; BCUHA, AC, and cochlear implant simulator (CIsim). The results show that BCUHA can transmit intentions of speaker as well as CIsim. Also BCUHA can transmit speaker information better than CIsim.
#7Challenging the Speech Intelligibility Index: Macroscopic vs. Microscopic Prediction of Sentence Recognition in Normal and Hearing-impaired Listeners
Tim Jürgens (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany)
Stefan Fredelake (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany)
Ralf M. Meyer (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany)
Birger Kollmeier (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany)
Thomas Brand (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany)
A “microscopic” model of phoneme recognition, which includes an auditory model and a simple speech recognizer, is adapted to model the recognition of single words within whole German sentences. “Microscopic” in terms of this model is defined twofold, first, as analyzing the particular spectro-temporal structure of the speech waveforms, and second, as basing the recognition of whole sentences on the recognition of single words. This approach is evaluated on a large database of speech recognition results from normal-hearing and sensorineural hearing-impaired listeners. Individual audiometric thresholds are accounted for by implementing a spectrally-shaped hearing threshold simulating noise. Furthermore, a comparative challenge between the microscopic model and the “macroscopic” Speech Intelligibility Index (SII) is performed using the same listeners’ data. The results are that both models show similar correlations of modeled Speech Reception Thresholds (SRTs) to observed SRTs.
#8Does sentence complexity interfere with intelligibility in noise? Evaluation of the Oldenburg Linguistically and Audiologically Controlled Sentence Test (OLACS)
Verena Nicole Uslar (Institute of Physics, CvO University of Oldenburg, Germany)
Thomas Brand (Institute of Physics, CvO University of Oldenburg, Germany)
Mirko Hanke (Department of Modern Languages , CvO University of Oldenburg, Germany)
Rebecca Carroll (Department of Modern Languages , CvO University of Oldenburg, Germany)
Esther Ruigendijk (Department of Modern Languages , CvO University of Oldenburg, Germany)
Cornelia Hamann (Department of Modern Languages , CvO University of Oldenburg, Germany)
Birger Kollmeier (Institute of Physics, CvO University of Oldenburg, Germany)
The Oldenburg Linguistically and Audiologically Controlled Sentence Test (OLACS), which contains sentences with seven different grades of linguistic complexity, is introduced. The evaluation of this new German speech material was performed by presenting each sentence at three different SNRs to 36 normally hearing listeners. Sentence specific discrimination functions were calculated and for each sentence type 40 sentences were selected. Differences of up to 3dB occurred for the different grades of linguistic complexity. Interindividual differences occurred in speech recognition rates of up to 30%. Thus, the material on the one hand seems to be appropriate for examining the influence of sentence complexity on speech recognition both qualitatively as well as quantitatively. On the other hand the OLACS might be used for diagnostic purposes to differentiate e.g. across individual listeners.
#9Intelligibility Predictions for Speech against Fluctuating Masker
Juan-Pablo Ramirez (AIPA and Quality and Usability, Deutsche Telekom Laboratories, Berlin Institute of Technology, Germany)
Alexander Raake (AIPA, Deutsche Telekom Laboratories, Berlin Institute of Technology, Germany)
Hamed Ketabdar (Quality and Usability Lab, Deutsche Telekom Laboratories, Berlin Institute of Technology, Germany)
The effect of masking due to fluctuating sources on speech intelligibility is a phenomenon difficult to predict. Intelligibility scores vary with the efficiency of the energetic masking while the linguistic content of the message and listener’s cognitive performances add to the general incertitude that peaks for the case of masking speech. The present contribution proposes a signal-based assessment of the energetic masking at the sentence level. A mapping onto the scale of the speech intelligibility index is established for stationary noise. Predictions are quantitatively compared with the results of an intelligibility test for speech-modulated noise. The model is independent of voices similarities and semantic features, two important sources of informational masking.
#10An Effect of Formant Amplitude in Vowel Perception
Masashi Ito (Department of Electrical and Intelligent Systems, Tohoku Institute of Technology, Japan)
Keiji Ohara (Research Institute of Electrical Communication, Tohoku University, Japan)
Akinori Ito (Graduate School of Engineering, Tohoku University, Japan)
Masafumi Yano (Research Institute of Electrical Communication, Tohoku University, Japan)
A psycho-acoustical experiment was conducted using synthetic vowel-like stimuli to examine effect of formant amplitude in vowel perception. Nine combinations of formant frequencies were examined. For each combination, relative amplitude of the third to the second formants was modified in seven degrees. In eight of the nine combinations, perceived vowels were changed according to the formant amplitude although every formant frequency kept constant. Furthermore, this amplitude effect was observed even when frequency separation of the neighboring formants was greater than 3.5 Bark. The result suggested that formant amplitude is effective cue for vowel perception as well as formant frequency.
#12Functional neuroimaging of brain regions sensitive to communication sounds in primates: A comparative summary
Christopher Petkov (Newcastle University)
Benjamin Wilson (Newcastle University)
There is considerable brain imaging evidence on the neural substrates of speech in humans, but only recently has data for comparison become available on the brain regions that process communication signals in other primates. To obtain insights into the relationship between the substrates for communication in primates, we compared the results from several neuroimaging studies in humans with those that have recently been obtained from macaque monkeys and chimpanzees. We note a striking general correspondence between the primates on the pattern of brain regions that process species-specific vocalizations and the acoustics related to voice identity.

Spoken Language Understanding and Spoken Language Translation I

Time:Wednesday 16:00 Place:International Conference Room D Type:Poster
Chair:Tatsuya Kawahara
#1Strategies for Statistical Spoken Language Understanding with Small Amount of Data -- an Empirical Study
Ye-Yi Wang (Microsoft Research)
The semantic frame based spoken language understanding involves two decisions -- frame classification and slot filling. The two decisions can be made either separately or jointly. This paper compares the different strategies and presents some empirical results in the conditional model framework when only a small amount of training data is available. It is found that while the two pass classification/slot filling solution has resulted in the much better frame classification accuracy, the joint model has yielded better results for slot filling. Application developers need to carefully choose the strategy appropriate to the application scenarios.
#2Investigating multiple approaches for SLU portability to a new language
Bassam Jabaian (LIG, University Joseph Fourier, Grenoble - France)
Leurent Besacier (LIG, University Joseph Fourier, Grenoble - France)
Fabrice Lefevre (LIA, University of Avignon, Avignon - France)
The challenge with language portability of a spoken language understanding module is to be able to reuse the knowledge and the data available in a source language to produce knowledge in the target language. In this paper several approaches are proposed, motivated by the availability of the MEDIA French dialogue corpus and its manual translation into Italian. The three portability methods investigated are based on statistical machine translation or automatic word alignment techniques and differ in the level of the system development at which the translation is performed. The first experimental results show the efficiency of the proposed portability methods in general for a fast and low-cost SLU porting from French to Italian and the best performance are obtained by using translation only at the test level.
#3Learning Naturally Spoken Commands for a Robot
Anja Austermann (The Graduate University for Advanced Studies (SOKENDAI))
Seiji Yamada (The Graduate University for Advanced Studies (SOKENDAI), National Institute of Informatics)
Kotaro Funakoshi (Honda Research Institute Japan)
Mikio Nakano (Honda Research Institute Japan)
Enabling a robot to understand natural commands for Human-Robot-Interaction is a challenge that needs to be solved to enable novice users to interact with robots smoothly and intuitively. We propose a method to enable a robot to learn how its user utters commands in order to adapt to individual differences in speech usage. The learning method combines a stimulus encoding phase based on Hidden Markov models to encode speech sounds into units, modeling similar utterances, and a stimulus association phase based on classical conditioning to associate these models with their symbolic representations. Using this method, the robot is able to learn how its user utters parameterized commands, such as "Please put the book in the bookshelf" or "Can you clean the table for me?" through situated interaction with its user.
#4A semi-supervised cluster-and-label approach for utterance classification
Amparo Albalate (University of Ulm)
Aparna Suchindranath (University of Bremen)
Wolfgang Minker (University of Ulm)
In this paper we propose a novel cluster-and-label semi-supervised algorithm for utterance classification algorithm. The approach assumes that the underlying class distribution is roughly captured through -fully unsupervised- clustering. Then, a minimum amount of labeled examples are used to automatically label the extracted clusters, so that the initial label set is "augmented" to the whole clustered data. The optimum cluster labeling is achieved by means of the Hungarian algorithm, traditionally used to solve any optimization assignment problem. Finally, the augmented labeled set is applied to train a SVM classifier. This semi-supervised approach has been compared to a fully supervised version, in which the initial labeled sets are directly used to train the SVM model.
#5Classifying Dialog Acts in Human-Human and Human-Machine Spoken Conversations
Silvia Quarteroni (DISI - University of Trento, Italy)
Giuseppe Riccardi (DISI - University of Trento, Italy)
Dialog acts represent the illocutionary aspect of communication conveyed in utterances; depending on the nature of the dialog and its participants, different types of dialog act occur and an accurate classification of these is essential to support the understanding of human conversations. We learn effective discriminative dialog act classifiers by studying the most predictive classification features on Human-Human and Human-Machine corpora such as LUNA and SWITCHBOARD; additionally, we assess classifier robustness to speech errors. Our results exceed the state of the art on dialog act classification from reference transcriptions on SWITCHBOARD and allow us to reach a satisfying performance on ASR transcriptions.
#7Exploring Speaker Characteristics for Meeting Summarization
Fei Liu (The University of Texas at Dallas)
Yang Liu (The University of Texas at Dallas)
In this paper, we investigate using meeting-specific characteristics to improve extractive meeting summarization, in particular, speaker-related attributes (such as verboseness, gender, native language, role in the meeting). A rich set of speaker-sensitive features are developed in the supervised learning framework. We perform experiments on the ICSI meeting corpus. Results are evaluated using multiple criteria, including ROUGE, a sentence-level F-measure, and an approximated Pyramid approach. We show that incorporating speaker characteristics can consistently improve summarization performance on various testing conditions.
#8Semi-Supervised Extractive Speech Summarization via Co-Training Algorithm
Shasha Xie (The University of Texas at Dallas)
Hui Lin (University of Washington)
Yang Liu (The University of Texas at Dallas)
Supervised methods for extractive speech summarization require a large training set. Summary annotation is often expensive and time consuming. In this paper, we exploit semi-supervised approaches to leverage unlabeled data. In particular, we investigate co-training for the task of extractive meeting summarization. Compared with text summarization, speech summarization task has its unique characteristic in that the features naturally split into two sets: textual features and prosodic/acoustic features. Such characteristic makes co-training an appropriate approach for semi-supervised speech summarization. Our experiments on the ICSI meeting corpus show that by utilizing the unlabeled data, co-training significantly improves summarization performance when only a small amount of labeled data is available.
#9Extractive Summarization using A Latent Variable Model
Asli Celikyilmaz (University of California, Berkeley)
Dilek Hakkani-Tur (ICSI)
Extractive multi-document summarization is the task of choosing the sentences from documents to compose a summary text in response to a user query. We propose a generative approach to explicitly identify summary and non-summary topic distributions in document cluster sentences. Using these approximate summary topic probabilities as latent output variables, we build a discriminative classifier model. The sentences in new document clusters are inferred using the trained model. In our experiments we find that the proposed summarization approach is effective in comparison to the state-of-the-art methods.
#10Hierarchical Classification for Speech-to-Speech Translation
Emil Ettelaie (University of Southern California)
Panayiotis G. Georgiou (University of Southern California)
Shrikanth S. Narayanan (University of Southern California)
Concept classifiers have been used in speech to speech translation systems. Their effectiveness, however, depends on the size of the domain that they cover. The main bottleneck in expanding the classifier domain is the degradation in accuracy as the number of classes increase. Here we introduce a hierarchical classification process that aims to scale up the domain without compromising the accuracy. We propose to exploit the categorical associations that naturally appear in the training data to split the domain into sub-domains with fewer classes. We use two methods of language model based classification and topic modeling with latent Dirichlet allocation to use the discourse information for sub-domain detection. The classification task is performed in two steps. First the best category for the discourse is detected using one of the above methods. Then a sub-domain classifier--limited to that category--is deployed. Empirical results from our experiments show higher accuracy for the proposed method compared to a single layered classifier.
#11Rapid Development of Speech Translation using Consecutive Interpretation
Matthias Paulik (Carnegie Mellon & Karlsruhe Institute of Technology)
Alexander Waibel (Carnegie Mellon & Karlsruhe Institute of Technology)
The development of a speech translation (ST) system is costly, largely because it is expensive to collect parallel data. A new language pair is typically only considered in the aftermath of an international crisis that incurs a major need of cross-lingual communication. Urgency justifies the deployment of interpreters while data is being collected. In recent work, we have shown that audio recordings of interpreter-mediated communication can present a low-cost data resource for the rapid development of automatic text and speech translation. However, our previous experiments remain limited to English/Spanish simultaneous interpretation. In this work, we examine our approaches for exploiting interpretation audio as translation model training data in the context of English/Pashto consecutive interpretation. We show that our previously made findings remain valid, despite the more complex language pair and the additional challenges introduced by the strong resource-limitations of Pashto.
#12Combining Many Alignments for Speech to Speech Translation
Maskey Sameer (IBM)
Rennie Steven (IBM)
Zhou Bowen (IBM)
Alignment combination (symmetrization) has been shown to be useful for improving Machine Translation (MT) models. Most existing alignment combination techniques are based on heuristics, and can combine only two sets of alignments at a time. Recently, we proposed a power mean based algorithm that can be optimized to combine an arbitrary number alignment tables simultaneously. In this paper we present an empirical investigation of the merits of the approach for combining a large number of alignments (more than 200 in total before pruning). The results of the study suggest that the algorithm can often improve the performance of speech to speech translation systems for low resource languages.

Special Session: Compressive Sensing for Speech and Language Processing

Time:Wednesday 16:00 Place:301 Type:Special
Chair:Tara Sainath & Bhuvana Ramabhadran
16:00Towards a robust face recognition system using compressive sensing
Allen Yang (University of California, Berkeley)
Zihan Zhou (University of Illinois)
Yi Ma (University of Illinois)
Shankar Sastry (University of California, Berkeley)
An application of compressive sensing (CS) theory in image-based robust face recognition is considered. Motivated by CS, the problem has been recently cast in a sparse representation framework: The sparsest linear combination of a query image is sought using all prior training images as an overcomplete dictionary, and the dominant sparse coefficients reveal the identity of the query image. The ability to perform dense error correction directly in the image space also provides an intriguing solution to compensate pixel corruption and improve the recognition accuracy exceeding most existing solutions. Furthermore, a local iterative process can be applied to solve for an image transformation applied to the face region when the query image is misaligned. Finally, we discuss the state of the art in fast algorithms to improve the speed of the system. The paper also provides useful guidelines to practitioners working in similar fields, such as acoustic/speech recognition.
16:20Exemplar-Based Sparse Representation Features for Speech Recognition
Tara Sainath (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
David Nahamoo (IBM T.J. Watson Research Center)
Dimitri Kanevsky (IBM T.J. Watson Research Center)
Abhinav Sethy (IBM T.J. Watson Research Center)
In this paper, we explore the use of exemplar-based sparse representations (SRs) to map test features into the linear span of training examples. We show that the frame classification accuracy with these new features is 1.3% higher than a Gaussian Mixture Model (GMM), showing that not only do SRs move test features closer to training, but also move the features closer to the correct class. Given these new SR features, we train up a Hidden Markov Model (HMM) on these features and perform recognition. On the TIMIT corpus, we find that applying the SR features on top of our best discriminatively trained system allows for a 0.7% absolute reduction in phonetic error rate (PER), from 19.9% to 19.2%. In fact, after applying model adaptation we reduce the PER to 19.0%, the best results on TIMIT to date. Furthermore, on a large vocabulary 50 hour broadcast news task, we achieve a reduction in word error rate (WER) of 0.3% absolute, demonstrating the benefit of these SR features for large vocabulary.
16:40Data Selection for Language Modeling Using Sparse Representations
Abhinav Sethy (IBM TJ Watson Research Center)
Tara Sainath (IBM TJ Watson Research Center)
Bhuvana Ramabhadran (IBM TJ Watson Research Center)
Dimitri Kanevsky (IBM TJ Watson Research Center)
hsakjda sjkahsa (jfhkdf)
The ability to adapt language models to specific domains from large generic text corpora is of considerable interest to the language modeling community. One of the key challenges is to identify the text material relevant to a domain in the generic text collection. The text selection problem can be cast in a semi-supervised learning framework where the initial hypothesis from a speech recognition system is used to identify relevant training material. We present a novel sparse representation formulation which selects a sparse set of relevant sentences from the training data which match the test set distribution. In this formulation, the training sentences are treated as the columns of the sparse representation matrix and the n-gram counts as the rows. The target vector is the n-gram probability distribution for the test data. A sparse solution to this problem formulation identifies a few columns which can best represent the target test vector, thus identifying the relevant set of sentences from the training data. Rescoring results with the language model built from the data selected using the proposed method yields modest gains on the English broadcast news RT-04 task, reducing the word error rate from 14.6% to 14.4%.
17:00Observation uncertainty measures for sparse imputation
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Ulpu Remes (Adaptive Informatics Research Centre, Aalto University, Finland)
Kalle J. Palomäki (Adaptive Informatics Research Centre, Aalto University, Finland)
Missing data imputation estimates the clean speech features for automatic speech recognition in noisy environments. The estimates are usually considered equally reliable while in reality, the estimation accuracy varies from feature to feature. In this work, we propose uncertainty measures to characterise the expected accuracy of a sparse imputation (SI) based missing data method. In experiments on noisy large vocabulary speech data, using observation uncertainties derived from the proposed measures improved the speech recognition performance on features estimated with SI. Relative error reductions up to 15% compared to the baseline system using SI without uncertainties were achieved with the best measures.
17:20Sparse Representations for Text Categorization
Tara Sainath (IBM T.J. Watson Research Center)
Sameer Maskey (IBM T.J. Watson Research Center)
Dimitri Kanevsky (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
David Nahamoo (IBM T.J. Watson Research Center)
Julia Hirschberg (Department of Computer Science, Columbia University)
Sparse representations (SRs) are often used to characterize a test signal using a few support training examples, and allow the number of supports to be adapted to the specific signal being categorized. Given the good performance of SRs compared to other classifiers for both image and phonetic classification, in this paper, we extend the use of SRs for text classification, a method which has thus far not been explored for this domain. Specifically, we demonstrate how sparse representations can be used for text classification and how their performance varies with the vocabulary size of the document features. In addition, we also show that this method offers promising results over the Naive Bayes (NB) classifier, a standard classifier used for text classification, thus introducing an alternative class of methods for text categorization.
17:40Sparse Auto-associative Neural Networks: Theory and Application to Speech Recognition
Sivaram Garimella (Johns Hopkins University)
Sriram Ganapathy (Johns Hopkins University)
Hynek Hermansky (Johns Hopkins University)
This paper introduces the sparse auto-associative neural network (SAANN) in which the internal hidden layer output is forced to be sparse. This is achieved by adding a sparse regularization term to the original reconstruction error cost function, and updating the parameters of the network to minimize the overall cost. We show applicability of this network to phoneme recognition by extracting sparse hidden layer outputs (used as features) from a network which is trained using perceptual linear prediction (PLP) cepstral coefficients in an unsupervised manner. Experiments with the SAANN features on a state-of-the-art TIMIT phoneme recognition system show a relative improvement in phoneme error rate of 5.1% over the baseline PLP features.

Speech Translation Technology: Communication beyond Language Barriers

Time:Thursday 08:30 Place:Hall A/B Type:Special
08:30Speech Translation, a Field in Transition: from Research Lab to Deployment
Alex Waibel (Carnegie Mellon University, USA)
09:00Large Scale Field Experiments and Analysis of Speech-to-speech translation in Japan
Satoshi Nakamura (National Institute of Information and Communications Technology, Japan)
10:00Two decades of statistical translation of language and speech: where do we stand?
Hermann Ney (RWTH Aachen University, Germany)
10:30IBM Real-Time Translation Services (RTTS)
Yaser Al-Onaizan (IBM T.J. Watson Research Center, USA)
11:00Demo Session
---- ---- (----)

Physiology and Pathology of Spoken Language

Time:Thursday 10:00 Place:201A Type:Oral
Chair:Francis Grenez
10:00Reliable tracking based on speech sample salience of vocal cycle length perturbations
Christophe Mertens (Université Libre de Bruxelles)
Francis Grenez (Université Libre de Bruxelles)
Lise Crevier-Buchman (Hôpital Européen Georges Pompidou, Paris, France)
Jean Schoentgen (National Fund for Scientific Research, Belgium)
The presentation concerns a method for tracking cycle lengths in voiced speech. The speech cycles are detected via the saliences of the speech signal samples, defined by the length of the temporal interval over which a sample is a maximum. The tracking of the cycle lengths is based on a dynamic programming algorithm which does not request that the signal is locally periodic and the average period length known a priori. The method has been validated on a corpus of normophonic speakers. The results report the tremor frequency and modulation depth of the vocal frequency of 72 ALS and 8 normophonic speakers.
10:20Longitudinal Changes of Selected Voice Source Parameters
Hideki Kasuya (Utsunomiya University)
Hajime Yoshida (Yoshida Clinic)
Satoshi Ebihara (Kyoundo Hospital)
Hiroki Mori (Utsunomiya University)
Longitudinal changes were investigated for selected voice source parameters: fundamental frequency (F0), jitter (period perturbation quotient, PPQ), shimmer (amplitude perturbation quotient, APQ) and glottal noise (normalized noise energy, NNE). Acoustic analyses were made on the sustained phonation of the Japanese vowel /a/ of 20 males and 38 females with no laryngeal disease, which were recorded over periods ranging from 10 to 18 years. The longitudinal change of the parameters was evaluated by the statistical t-test, revealing that: 1) strong individuality existed for significant longitudinal changes in the acoustic parameters, 2) F0 falling is a more typical tendency of vocal aging of females than males, while F0 rising, which has been pointed out in the previous reports for males, was not found, 3) shimmer is a more observable indication of vocal aging than jitter, and 4) glottal noise in a high-frequency region tends to increase with aging.
10:40Automatic perceptual categorization of disordered connected speech
Ali Alpan (Laboratory of Images, Signals & Telecommunication Devices, Université Libre de Bruxelles)
Jean Schoentgen (Laboratory of Images, Signals & Telecommunication Devices, Université Libre de Bruxelles)
Youri Maryn (Department of Otorhinolaryngology and Head & Neck Surgery, Department of Speech-Language Pathology and Audiology, Sint-Jan General Hospital)
Francis Grenez (Laboratory of Images, Signals & Telecommunication Devices, Université Libre de Bruxelles)
The objective of the presentation is to report experiments involving the automatic classification of disordered connected speech into binary (normal, pathological) or multiple (modal, moderately hoarse, severely hoarse) categories. The multi-category classification according to the perceived degree of hoarseness is considered to be clinically meaningful and desirable given that the reliable perceptual classification by humans of disordered voice stimuli is known to be difficult and time-consuming. The acoustic cues are temporal signal-to-dysperiodicity ratios as well as mel-frequency cepstral coefficients. The classifiers are support vector machines which have been trained and tested on two connected speech corpora. The binary classification accuracy has been high (98%) for both sets of acoustic cues. The multi-category classification accuracy has been 70% when based on signal-to-dysperiodicity ratios and 59% when based on mel-frequency cepstral coefficients.
11:00Kinematic Analysis of Tongue Movement Control in Spastic Dysarthria
Heejin Kim (Beckman Institute, University of Illinois, Urbana, USA)
Panying Rong (Department of Speech and Hearing, University of Illinois, Urbana, USA)
Torrey Loucks (Department of Speech and Hearing, University of Illinois, Urbana, USA)
Mark Hasegawa-Johnson (Department of Electrical and Computer Engineering, University of Illinois, Urbana, USA)
This study provided a quantitative analysis of the kinematic deviances in dysarthria associated with spastic cerebral palsy. Of particular interest were tongue tip movements during alveolar consonant release. Our analysis based on kinematic measures indicated that speakers with spastic dysarthria had a restricted range of articulation and disturbances in articulatory-voicing coordination. The degree of kinematic deviances was greater for lower intelligibility speakers, supporting an association between articulatory dysfunctions and intelligibility in spastic dysarthria.
11:20Pre- and short-term posttreatment vocal functioning in patients with advanced head and neck cancer treated with concomitant chemoradiotherapy
Irene Jacobi (Department of Head and Neck Oncology and Surgery, The Netherlands Cancer Institute, Amsterdam, The Netherlands)
Lisette van der Molen (Department of Head and Neck Oncology and Surgery, The Netherlands Cancer Institute, Amsterdam, The Netherlands)
Maya van Rossum (Department of Otorhinolaryngology, Leiden University Medical Centre, Leiden, The Netherlands)
Frans Hilgers (Department of Head and Neck Oncology and Surgery, The Netherlands Cancer Institute, Amsterdam, The Netherlands)
Forty-seven patients with advanced larynx/hypopharynx, nasopharynx or oropharynx/oral cavity cancer were recorded before, and 10 weeks after concomitant chemoradiotherapy (CCRT), to investigate the effect of the tumor versus the effects of treatment. To evaluate voice functioning before and after treatment, voice quality and glottal behavior of sustained /a/ vowels were analyzed acoustically and compared with patient-based data on cigarette and alcohol usage. Acoustic measures of effort, nasality and regularity, such as periodicity or harmonics-to-noise ratio, differed significantly and progressed differently in dependence of the 3 distinct cancer/radiation sites. Baseline measures of voice stability correlated significantly with alcohol/smoking behavior.
11:40Acoustic Analysis of Intonation in Parkinson’s Disease
Joan Ma (Queen Margaret University)
Ruediger Hoffmann (Dresden University of Technology)
The aim of this study was to explore the prosodic characteristics of speakers with Parkinson’s disease (PD) in the marking of intonations. Twenty-four German PD speakers with either mild or moderate degree of dysarthria were compared with twelve non-dysarthric control speakers on the production of imperatives, questions and statements. Acoustic analyses of fundamental frequency (average F0, F0 range and F0 envelop), intensity (average intensity, intensity range and intensity envelop) and speech rate (number of syllable per second) were conducted to investigate the effect of PD on intonation marking. The results showed that the dysarthric and non-dysarthric speakers differed significantly in all F0 measures, with higher average F0 and reduced F0 variability noted for the PD speakers. Although the PD speakers were more monotonous in prosody, they showed similar intonation contrasts between intonations as in non-dysarthric speakers.

Pitch and glottal-waveform estimation and modeling II

Time:Thursday 10:00 Place:201B Type:Oral
Chair:Yannis Stylianou
10:00SAFE: a Statistical Algorithm for F0 Estimation for Both Clean and Noisy Speech
Wei Chu (University of California, Los Angeles)
Abeer Alwan (University of California, Los Angeles)
A novel Statistical Approach for F0 Estimation, SAFE, is proposed to improve the accuracy of F0 tracking under both clean and additive noise conditions. Prominent Signal-to-Noise Ratio (SNR) peaks in speech spectra are robust information source from which F0 can be inferred. A probabilistic framework is proposed to model the effect of additive noise on voiced speech spectra. It is observed that prominent SNR peaks located in the low frequency band are important to F0 estimation, and prominent SNR peaks in the middle and high frequency bands are also useful supplemental information to F0 estimation under noisy conditions, especially babble noise condition. Experiments show that the SAFE algorithm has the lowest Gross Pitch Errors (GPE) compared to prevailing F0 trackers: Get_F0, Praat, TEMPO, and YIN, in white and babble noise conditions at low SNRs.
10:20Robust and Efficient Pitch Estimation using an Iterative ARMA Technique
Jung Ook Hong (Harvard University)
Patrick J. Wolfe (Harvard University)
In this article, we propose an innovative way of estimating pitch from speech waveform data, using an iterative ARMA technique that efficiently estimates multiple frequency components of a time series. Additionally, the harmonic structure of voiced speech and the smoothness of its pitch period are incorporated into the iterative ARMA technique, and this novel integration results in an efficient, robust technique for pitch estimation. The KED-TIMIT database was used to evaluate the performance of our proposed algorithm against that of other state-of-the-art pitch estimators in terms of both root mean square error and gross error rate.
10:40Statistical Modeling of F0 Dynamics in Singing Voices Based on Gaussian Processes with Multiple Oscillation Bases
Yasunori Ohishi (NTT Communication Science Laboratories, NTT Corporation)
Hirokazu Kameoka (NTT Communication Science Laboratories, NTT Corporation)
Daichi Mochihashi (NTT Communication Science Laboratories, NTT Corporation)
Hidehisa Nagano (NTT Communication Science Laboratories, NTT Corporation)
Kunio Kashino (NTT Communication Science Laboratories, NTT Corporation)
We present a novel statistical model for dynamics of various singing behaviors, such as vibrato and overshoot, in a fundamental frequency (F0) contour. These dynamics are the important cues for perceiving individuality of a singer, and can be a useful measure for various applications, such as singing skill evaluation and singing voice synthesis. While most previous studies have modeled the dynamics using a second-order linear system, the automatic and accurate estimation of model parameters has yet to be accomplished. In this paper, we first develop a complete stochastic representation of the second-order system with Gaussian processes from parametric discretization, and propose a complete, efficient scheme for parameter estimation using the Expectation-Maximization (EM) algorithm. Experimental results show that the proposed method can decompose an F0 contour into a musical component and a dynamics component. Finally, we discuss estimating singing styles from the model parameters for each singer.
11:00Applying Geometric Source Separation for Improved Pitch Extraction in Human-Robot Interaction
Martin Heckmann (Honda Research Institute Europe GmbH)
Claudius Gläser (Honda Research Institute Europe GmbH)
Frank Joublin (Honda Research Institute Europe GmbH)
Kazuhiro Nakadai (Honda Research Institute Japan Co. Ltd.)
We present a system for robust pitch extraction in noisy and echoic environments consisting of a multi-channel signal enhancement, a biologically inspired pitch extraction algorithm and a pitch tracking based on a Bayesian filter. The multi-channel signal enhancement deploys an 8 channel Geometric Source Separation (GSS). During pitch extraction we apply a Gammatone filter bank and then calculate a histogram of zero crossing distances based on the band-pass signals. While calculating the histogram spurious side peaks at harmonics and sub-harmonics of the true fundamental frequency are inhibited. The following grid based Bayesian tracker comprises a Bayesian filtering in a forward step and Bayesian smoothing in a backward step. We evaluate the system in a realistic human-robot interaction scenario with several male and female speakers. Hereby, we also include the comparison to two well established pitch extraction frameworks, i.e. get_f0 included in the WaveSurfer Toolkit and Praat.
11:20A spectral LF model based approach to voice source parameterisation
John Kane (Trinity College Dublin)
Mark Kane (University College Dublin)
Christer Gobl (Trinity College Dublin)
This paper presents a new method of extracting LF model based parameters using a spectral model matching approach. Strategies are described for overcoming some of the known difficulties of this type of approach, in particular high frequency noise. The new method performed well compared to a typical time based method particularly in terms of robustness against distortions introduced by the recording system and in terms of the ability of parameters extracted in this manner to differentiate three discrete voice qualities. Results from this study are very promising for the new method and offer a way of extracting a set of non-redundant spectral parameters that may be very useful in both recognition and synthesis systems.
11:40Glottal-based Analysis of the Lombard Effect
Thomas Drugman (University of Mons)
Thierry Dutoit (University of Mons)
The Lombard effect refers to the speech changes due to the immersion of the speaker in a noisy environment. Among these changes, studies have already reported acoustic modifications mainly related to the vocal tract behaviour. In a complementary way, this paper investigates the variation of the glottal flow in Lombard speech. For this, the glottal flow is estimated by a closed-phase analysis and parametrized by a set of time and spectral features. Through a study on a database containing 25 speakers uttering in clean and noisy environments (with 4 noise types at 2 levels), it is highlighted that the glottal source is significantly modified due to the increased vocal effort. Such changes are of interest in several applications of speech processing, such as speech or speaker recognition, or speech synthesis.

ASR: Feature Extraction II

Time:Thursday 10:00 Place:302 Type:Oral
Chair:Richard Stern
10:00Hidden Logistic Linear Regression for Support Vector Machine based Phone Verification
Li Bo (National University of Singapore)
Sim Khe Chai (National University of Singapore)
Phone verification approach to mispronunciation detection using a combination of Neural Network (NN) and Support Vector Machine (SVM) has been shown to yield improved verification performance. This approach uses a NN to predict the HMM state posterior probabilities. The average posterior probability vectors computed over each phone segment are used as input features to a SVM back-end to generate the final verification scores. In this paper, a novel Hidden Logistic Feature (HLF) for SVM back-end is proposed, where the sigmoid activations from the hidden layer that contain rich information of the NN is used instead of the output layer and the generation of HLFs can be interpreted as a Hidden Logistic Linear Regression process. Experiments on the TIMIT database show that the proposed HLF gives the lowest Equal Error Rate of 3.63.
Tim Ng (Raytheon BBN Technologies)
Bing Zhang (Raytheon BBN Technologies)
Long Nguyen (Raytheon BBN Technologies)
In the past decade, methods to extract long-term acoustic features for speech recognition using Multi-Layer Perceptrons have been proposed. These features have been proved to be good complementary features in some feature augmentations and/or through system combination. Usually, conventional linear dimension reduction algorithms, e.g. Linear Discriminative Analysis, are not applied on the combined features. In this paper, Region Dependent Transform is applied to jointly optimize the feature combination under a discriminative training criterion. When compared to a conventional augmentation, 3%to 6%relative character error rate reduction forMandarin speech recognition has been achieved using Region Dependent Transform.
10:40Invariant Integration Features Combined with Speaker-Adaptation Methods
Florian Müller (Insititute for Signal Processing, University of Lübeck)
Alfred Mertins (Insititute for Signal Processing, University of Lübeck)
Speaker-normalization and -adaptation methods are essential components of state-of-the-art speech recognition systems nowadays. Recently, so-called emph{invariant integration features} were presented which are motivated by the theory of invariants. While it was shown that the integration features outperform MFCCs when used with a basic monophone recognition system, it was left open, if their benefits still can be observed when a more sophisticated recognition system with speaker-normalization and/or speaker-adaptation components is used. This work investigates the combination of the integration features with standard speaker-normalization and -adaptation methods. We show that the integration features benefit from adaptation methods and significantly outperform MFCCs in matching, as well as in mismatching training-test conditions.
11:00Multi resolution discriminative models for subvocalic speech recognition
Mark Raugas (Raytheon BBN Technologies)
Vivek Kumar Rangarajan Sridhar (Raytheon BBN Technologies)
Rohit Prasad (Raytheon BBN Technologies)
Prem Natarajan (Raytheon BBN Technologies)
In this work, we investigate the use of discriminative models for automatic speech recognition of subvocalic speech via surface electromyography (sEMG). We also investigate the suitability of multiresolution analysis in the form of discrete wavelet transform (DWT) for sEMG-based speech recognition. We examine appropriate dimensionality reduction techniques for features extracted using different wavelet families and compare our results with the conventional mel-frequency cepstral coefficients (MFCC) used in speech recognition. Our results indicate that a simple model fusion between cepstral and wavelet domain features can achieve superior recognition performance. Fusing the MFCC and wavelet based SVM models using principal component analysis for feature reduction yields the best performance, with a mean accuracy of 95.13% over a set of nine speakers on a 65 word closed vocabulary task.
11:20A Comparative Large Scale Study of MLP Features for Mandarin ASR
Fabio Valente (IDIAP Research Institute, CH-1920 Martigny, Switzerland)
Mathew Magimai-Doss (IDIAP Research Institute, CH-1920 Martigny, Switzerland)
Christian Plahl (Human Language Technology and Pattern Recognition, RWTH Aachen University, Germany)
Suman Ravuri (International Computer Science Institute, 1947 Center Street, Berkeley, CA 94704)
Wen Wang (Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, USA.)
MLP based front-ends have shown significant complementary properties to conventional spectral features. As part of DARPA GALE program, different MLP features were developed for Mandarin ASR. In this paper, all the proposed front-ends are compared in systematic manner and we extensively investigate the scalability of these features in terms of amount of training data (from 100 hours to 1600 hours) and system complexity (maximum likelihood training, SAT training, lattice level combination, and discriminative training). Results on 5 hours of evaluation data from the GALE project reveal that the MLP features consistently produce relative improvements in the range of (15%-23%) at the different step of a multipass system when compared to the conventional short-term spectral based features like MFCC and PLP.
11:40Recognizing Cochlear Implant-like Spectrally Reduced Speech with HMM-based ASR: Experiments with MFCCs and PLP Coefficients
Cong-Thanh Do (Institut Télécom, Télécom Bretagne, UMR CNRS 3192 Lab-STICC)
Dominique Pastor (Institut Télécom, Télécom Bretagne, UMR CNRS 3192 Lab-STICC)
Gaël Le Lan (Institut Télécom, Télécom Bretagne, UMR CNRS 3192 Lab-STICC)
André Goalic (Institut Télécom, Télécom Bretagne, UMR CNRS 3192 Lab-STICC)
In this paper, we investigate the recognition of cochlear implant-like spectrally reduced speech (SRS) using conventional speech features (MFCCs and PLP coefficients) and HMM-based ASR. The SRS was synthesized from subband temporal envelopes extracted from original clean speech for testing, whereas the acoustic models were trained on a different set of original clean speech signals of the same speech database. It was shown that changing the bandwidth of the subband temporal envelopes had no significant effect on the ASR word accuracy. In addition, increasing the number of frequency subbands of the SRS from 4 to 16 improved significantly the system performance. Furthermore, the ASR word accuracy attained with the original clean speech, by using both MFCC-based and PLP-based speech features, can be achieved by using the 16-, 24-, or 32-subband SRS. The experiments were carried out by using the TI-digits speech database and the HTK speech recognition toolkit.

Speaker diarization

Time:Thursday 10:00 Place:International Conference Room A Type:Poster
Chair:Gerald Friedland
#1A Hybrid Approach to Online Speaker Diarization
Carlos Vaquero (University of Zaragoza, Zaragoza Spain)
Oriol Vinyals (University of California and ICSI, Berkeley, CA, USA)
Gerald Friedland (International Computer Science Institute (ICSI), Berkeley, CA, USA)
This article presents a low-latency speaker diarization system (“who is speaking now?”) based on a hybrid approach that combines a traditional offline speaker diarization system (“who spoke when?”) with an online speaker identification system. The system fulfills all requirements of the diarization task, i.e. it does not need any a-priori information about the input, including no specific speaker models. After an initialization phase the approach allows a low-latency decision on the current speaker with an accuracy that is close to the underlying offline diarization system. The article describes the approach, evaluates the robustness of the system, and analyzes the latency/accuracy trade-off.
#2System output combination for improved speaker diarization
Simon Bozonnet (EURECOM)
Nicholas Evans (EURECOM)
Xavier Anguera (Telefonica Research)
Oriol Vinyals (ICSI, University of California at Berkeley)
Gerald Friedland (ICSI, University of California at Berkeley)
Corinne Fredouille (LIA, University of Avignon)
System combination or fusion is a popular, successful and sometimes straightforward means of improving performance in many fields of statistical pattern classification, including speech and speaker recognition. Whilst there is significant work in the literature which aims to improve speaker diarization performance by combining multiple feature streams, there is little work which aims to combine the outputs of multiple systems. This paper reports our first attempts to combine the outputs of two state-of-the-art speaker diarization systems, namely ICSI's bottom-up and LIA-EURECOM's top-down systems. We show that a cluster matching procedure reliably identifies corresponding speaker clusters in the two system outputs and that, when they are used in a new realignment and resegmentation stage, the combination leads to relative improvements of 13% and 7% DER on independent development and evaluation sets.
#3An Integrated Top-Down/Bottom-Up Approach To Speaker Diarization
Simon Bozonnet (EURECOM, BP193, F-06904 Sophia Antipolis Cedex, France)
Evans Nicholas (EURECOM, BP193, F-06904 Sophia Antipolis Cedex, France)
Corinne Fredouille (University of Avignon, LIA/CERI, BP1228, F-84911 Avignon Cedex 9, France)
Dong Wang (EURECOM, BP193, F-06904 Sophia Antipolis Cedex, France)
Raphael Troncy (EURECOM, BP193, F-06904 Sophia Antipolis Cedex, France)
Most speaker diarization systems fit into one of two categories: bottom-up or top-down. Bottom-up systems are the most popular but can sometimes suffer from instability from merging and stopping criteria difficulties. Top-down systems deliver competitive results but are particularly prone to poor model initialization which often leads to large variations in performance. This paper presents a new integrated bottom-up/top-down approach to speaker diarization which aims to harness the strengths of each system and thus to improve performance and stability. In contrast to previous work, here the two systems are fused at the heart of the segmentation and clustering stage. Experimental results show improvements in speaker diarization performance for both meeting and TV-show domain data indicating increased intra and inter-domain stability. On the TV-show data in particular, an average relative improvement of 26% DER is obtained.
#4Advances in Fast Multistream Diarization based on the Information Bottleneck Framework
Deepu Vijayasenan (Idiap Research Institute, CP 592, CH-1920, Martigny)
Fabio Valente (Idiap Research Institute, CP 592, CH-1920, Martigny)
Hervé Bourlard (Idiap Research Institute, CP 592, CH-1920, Martigny)
Multistream diarization is an effective way to improve the diarization performance, MFCC and Time Delay Of Arrivals (TDOA) being the most commonly used features. This paper extends our previous work on information bottleneck diarization aiming to include large number of features besides MFCC and TDOA while keeping computational costs low. At first HMM/GMM and IB systems are compared in case of two and four feature streams and analysis of errors is performed. Results on a dataset of 17 meetings show that, in spite of comparable oracle performances, the IB system is more robust to feature weight variations. Then a sequential optimization is introduced that further improves the speaker error by 5−8% relative. In the last part, computational issues are discussed. The proposed approach is significantly faster and its complexity marginally grows with the number of feature streams running in 0.75% real time even with four streams achieving a speaker error equal to 6%.
#5Audio-Visual Synchronisation for Speaker Diarisation
Giulia Garau (Idiap Research Institute, Switzerland)
Alfred Dielmann (Idiap Research Institute, Switzerland)
Hervé Bourlard (Idiap Research Institute, Switzerland)
The role of audio–visual speech synchrony for speaker diarisation is investigated on the multiparty meeting domain. We measured both mutual information and canonical correlation on different sets of audio and video features. As acoustic features we considered energy and MFCCs. As visual features we experimented both with motion intensity features, computed on the whole image, and Kanade Lucas Tomasi motion estimation. Thanks to KLT we decomposed the motion in its horizontal and vertical components. The vertical component was found to be more reliable for speech synchrony estimation. The mutual information between acoustic energy and KLT vertical motion of skin pixels, not only resulted in a 20% relative improvement over a MFCC only diarisation system, but also outperformed visual features such as motion intensities and head poses.
#7An Improved Cluster Model Selection Method for Agglomerative Hierarchical Speaker Clustering using Incremental Gaussian Mixture Models
Kyu Han (University of Southern California)
Shrikanth Narayanan (University of Southern California)
In this paper, we improve our previous cluster model selection method for agglomerative hierarchical speaker clustering (AHSC) based on incremental Gaussian mixture models (iGMMs). In the previous work, we measured the likelihood of all the data points in a given cluster for each mixture component of the GMM modeling the cluster. Then, we selected the N-best component Gaussians with the highest likelihoods to make the GMM refined for the purpose of better cluster representation. N was chosen empirically then, but it is hard to set an optimal N universally in general. In this work, we propose an improved method to adaptively select component Gaussians from the GMM considered, by measuring the degree of representativeness of each Gaussian component, which we define in this paper. Experiments on two data sets including 17 meeting speech excerpts verify that the proposed approach improves the overall clustering performance by approximately 20% and 10% (relative), respectively, compared to the previous method.
#8Dialog Prediction for a General Model of Turn-Taking
Nigel Ward (University of Texas at El Paso)
Olac Fuentes (University of Texas at El Paso)
Alejandro Vega (University of Texas at El Paso)
Today there are solutions for some specific turn-taking problems, but no general model. We show how turn-taking can be reduced to two more general problems, prediction and selection. By predicting not only speech/silence but also prosodic features, this framework can also handle some related dialog decisions. To illustrate how such predictions can be made, we trained a neural network predictor. This was adequate to support some specific turn-taking decisions and was modestly accurate overall.
#9Speaker Tracking in an Unsupervised Speech Controlled System
Tobias Herbig (Nuance Communications Aachen GmbH, Ulm, Germany)
Franz Gerl (SVOX Deutschland GmbH, Ulm, Germany)
Wolfgang Minker (University of Ulm, Institute of Information Technology, Ulm, German)
In this paper we present a technique to increase the robustness of a self-learning speech controlled system comprising speech recognition, speaker identification and speaker adaptation. Our goal is the automatic personalization of a speech controlled device for groups of 5-10 recurring speakers. Speakers should be identified and tracked across speaker turns only by their voice patterns. Efficient information retrieval and the statistical representation of speaker characteristics have to be combined with a reliable and flexible speaker identification. Even on limited adaptation data, e.g. 2-3 command and control utterances, speakers have to be reliably tracked to allow continuous adaptation of complex statistical models. We present a novel approach of speaker identification on different time-scales based on a unified speech and speaker model. Experiments were carried out on a subset of the SPEECON database.
#10MultiBIC: an Improved Speaker Segmentation Technique for TV Shows
Paula Lopez-Otero (University of Vigo)
Laura Docio-Fernandez (University of Vigo)
Carmen Garcia-Mateo (University of Vigo)
Speaker segmentation systems usually have problems detecting short segments, which causes the number of deletions to be high and therefore harming the performance of the system. This is a complication when it comes to segmenting multimedia information such as movies and TV shows, where dialogs among characters are very common. In this paper a modification of the BIC algorithm is presented, which will reduce remarkably the number of deletions without causing an increase in the number of false alarms. This modification, referred to as MultiBIC, assumes that two change-points are present in a window of data, while conventional BIC approach supposes that there is just one. This causes the system to notice when there is more than one change-point in a window, finding shorter segments than traditional BIC.

Multi-Modal ASR, Including Audio-Visual ASR

Time:Thursday 10:00 Place:International Conference Room B Type:Poster
Chair:Satoru Hayamizu
#1Automatic Speech Recognition for Assistive Writing in Speech Supplemented Word Prediction
John-Paul Hosom (CSLU, Oregon Health & Science University)
Tom Jakobs (InvoTek)
Allen Baker (InvoTek)
Susan Fager (Institute for Rehabilitation Science and Engineering, Madonna Rehabilitation Hospital)
This paper describes a system for assistive writing, the Speech Supplemented Word Prediction Program (SSWPP). This system uses the first letter of a word typed by the user as well as the user’s (possibly low-intelligibility) speech to predict the intended word. The ASR system, which is the focus of this paper, is a speaker-dependent isolated-word recognition system. Word-level results from a non-dysarthric speaker indicate that almost all errors could be corrected by the SSWPP language model. Results from five speakers with moderate to severe dysarthria (average intelligibility 61.7%) averaged 62% for word recognition and 65% for out-of-vocabulary identification.
#2Viseme-Dependent Weight Optimization for CHMM-Based Audio-Visual Speech Recognition
Alexey Karpov (St. Petersburg Institute for Informatics and Automation of Russian Academy of Sciences, Russia)
Andrey Ronzhin (St. Petersburg Institute for Informatics and Automation of Russian Academy of Sciences, Russia)
Konstantin Markov (Human Interface Laboratory, The University of Aizu, Fukushima, Japan)
Milos Zelezny (University of West Bohemia, Pilsen, Czech Republic)
The aim of the present study is to investigate some key challenges of the audio-visual speech recognition technology, such as asynchrony modeling of multimodal speech, estimation of auditory and visual speech significance, as well as stream weight optimization. Our research shows that the use of viseme-dependent significance weights improves the performance of state asynchronous CHMM-based speech recognizer. In addition, for a state synchronous MSHMM-based recognizer, fewer errors can be achieved using stationary time delays of visual data with respect to the corresponding audio signal. Evaluation experiments showed that individual audio-visual stream weights for each viseme-phoneme pair lead to relative reduction of WER by 20%.
#3Audio-Visual Anticipatory Coarticulation Modeling by Human and Machine
Louis Terry (Northwestern University, Department of EECS)
Karen Livescu (Toyota Technological Institute at Chicago)
Janet Pierrehumbert (Northwestern University, Department of Linguistics)
Aggelos Katsaggelos (Northwestern University, Department of EECS)
Anticipatory coarticulation provides a basis for the observed asynchrony between the acoustic and visual onsets of phones in certain linguistic contexts and is typically not explicitly modeled in audio-visual speech models. We study within-word audio-visual asynchrony using hand labeled words in which theory suggests that asynchrony should occur, and show that these labels confirm the theory. We introduce a new statistical model of AV speech, the asynchrony-dependent transition (ADT) model that allows asynchrony between AV states within word boundaries, where the state transitions depend on the instantaneous asynchrony as well as the modality's state. This model outperforms a baseline synchronous model in mimicking the hand labels in a forced alignment task, and its behavior as parameters are changed conforms to our expectations about anticipatory coarticulation. The same model could be used for ASR, although here we consider it for the task of forced alignment for linguistic analysis.
#4Impact of Lack of Acoustic Feedback in EMG-based Silent Speech Recognition
Matthias Janke (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Michael Wand (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
This paper presents our recent advances in speech recognition based on surface electromyography (EMG). This technology allows for Silent Speech Interfaces since EMG captures the electrical potentials of the human articulatory muscles rather than the acoustic speech signal. Our earlier experiments have shown that the EMG signal is greatly impacted by the mode of speaking. In this study we extend this line of research by comparing EMG signals from audible, whispered, and silent speaking mode. We distinguish between phonetic features like consonants and vowels and show that the lack of acoustic feedback in silent speech implies an increased focus on somatosensoric feedback, which is visible in the EMG signal. Based on this analysis we develop a spectral mapping method to compensate for these differences. Finally, we apply the spectral mapping to the front-end of our speech recognition system and show that recognition rates on silent speech improve by up to 11.59% relative.
#5Using prosody to improve Mandarin automatic speech recognition
Chong-Jia Ni (NLPR,CASIA; School of Statistics and Mathematics, Shandong University of Finance)
In this paper, these problems of how to model and train Mandarin prosody dependent acoustic model and how to decode input speech based on prosody dependent speech recognition system will be discussed. We use automatic prosody labeling methods to annotate syllable prosodic break type and stress type on continuous speech corpus, and utilize our proposed methods to train prosody dependent tonal syllable model aiming at data sparse problem after prosody labeling. In this paper, we also utilize MSD-HSMM to model pitch, duration etc. influence factors of prosody, and at the same time, we unite MSD-HSMM model, prosody dependent tonal syllable duration model based on GMM and syntactical prosody model based on Maximum Entropy to decode. When compared with the baseline system, the performance of our prosody dependent speech recognition systems improves the correct rate of tonal syllable significantly.
#6A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Satoshi Tamura (Gifu University)
Masato Ishikawa (Gifu University)
Takashi Hashiba (Gifu University)
Shin'ichi Takeuchi (Gifu University)
Satoru Hayamizu (Gifu University)
This paper proposes a novel speech recognition method combining Audio-Visual Voice Activity Detection (AVVAD) and Audio-Visual Automatic Speech Recognition (AVASR). AVASR has been developed to enhance the robustness of ASR in noisy environments, using visual information in addition to acoustic features. Similarly, AVVAD increases the precision of VAD in noisy conditions, which detects presence of speech from an audio signal. In our approach, AVVAD is conducted as a preprocessing followed by an AVASR system, making a significantly robust speech recognizer. To evaluate the proposed system, recognition experiments were conducted using noisy audio-visual data, testing several AVVAD approaches. Then it is found that the proposed AVASR system using the model-free feature-fusion AVVAD method outperforms not only non-VAD audio-only ASR but also conventional AVASR.
#7Efficient Manycore CHMM Speech Recognition for Audiovisual and Multistream Data
Dorothea Kolossa (TU Berlin)
Jike Chong (UC Berkeley)
Steffen Zeiler (TU Berlin)
Kurt Keutzer (UC Berkeley)
Robustness of speech recognition can be significantly improved by multi-stream and especially audiovisual speech recognition, which is of interest e.g. for human-machine interaction in noisy and reverberant environments. The most robust implementations of audiovisual speech recognition often utilize Coupled Hidden Markov Models (CHMMs), which allow for both modalities to be asynchronous to a certain degree. In contrast to conventional speech recognition, this increases the search space significantly, so current implementations of CHMM systems are often not real-time capable. Thus, in order to obtain responsive multi-modal interfaces, using current processing capabilities is vital. This paper describes how general purpose graphics processors can be used to obtain a real-time implementation of audiovisual and multi-stream speech recognition. The design has been integrated both with a WFST-decoder and a token passing system, leading to a maximum speedup factor of 32 and 25, respectively.
#8Two-Layered Audio-Visual Integration in Voice Activity Detection and Automatic Speech Recognition for Robots
Takami Yoshida (Graduate School of Information Science and Engineering, Tokyo Institute of Technology)
Kazuhiro Nakadai (Honda Research Institute Japan co., ltd)
Automatic Speech Recognition (ASR) which plays an important role in human-robot interaction should be noise-robust because robots are expected to work in noisy environments. Audio-Visual (AV) integration is one of the key ideas to improve robustness in such environments. This paper proposes two-layered AV integration for an ASR system which applies AV integration to Voice Activity Detection (VAD) and ASR decoding processes. We implement a prototype ASR system based on the proposed two-layered AV integration and evaluated the system in dynamically-changing situations where audio and/or visual information can be noisy or missing. Preliminary results showed that the proposed method improves the robustness of ASR system even in auditory- or visually-contaminated situations.
#9Non-Audible Murmur recognition based on fusion of audio and visual streams
Panikos Heracleous (ATR, Intelligent Robotics and Communication Laboratories)
Norihiro Hagita (ATR, Intelligent Robotics and Communication Laboratories)
Non-Audible Murmur (NAM) is an unvoiced speech signal that can be received through the body tissue with the use of special acoustic sensors (i.e., NAM microphones) attached behind the talker's ear. In a NAM microphone, body transmission and loss of lip radiation act as a low-pass filter. Consequently, higher frequency components are attenuated in a NAM signal. Owing to such factors as spectral reduction, the unvoiced nature of NAM, and the type of articulation, the NAM sounds become similar, thereby causing a larger number of confusions in comparison to normal speech. In the present article, the visual information extracted from the talker's facial movements is fused with NAM speech using three fusion methods, and phoneme classification experiments are conducted. The experimental results reveal a significant improvement when both fused NAM speech and facial information are used.

Speaker and language recognition

Time:Thursday 10:00 Place:International Conference Room C Type:Poster
Chair:Jean-Francois Bonastre
#1Improved N-gram Phonotactic Models For Language Recognition
Mohamed Faouzi BenZeghiba (LIMSI-CNRS)
Jean-luc Gauvain (LIMSI-CNRS)
Lori Lamel (LIMSI-CNRS)
This paper investigates various techniques to improve the estimation of n-gram phonotactic models for language recognition using single-best phone transcriptions and phone lattices. More precisely, we first report on the impact of the so-called {it acoustic scale factor} on the system accuracy when using lattice-based training, and then we report on the use of n-gram cutoff and pruning techniques. Several system configurations are explored, such as the use of context-independent and context-dependent phone models, the use of single-best phone hypotheses versus phone lattices, and the use of various n-gram orders. Experiments are conducted using the LRE 2007 evaluation data and the results are reported using the a posteriori EER. The results show that the impact of these techniques on the system accuracy is highly dependent on the training conditions and that careful optimization can lead to performance improvements.
#2A Study of Term Weighting in Phonotactic Approach to Spoken Language Recognition
Sirinoot Boonsuk (Chulalongkorn University,Thailand)
Donglai Zhu (Institute for Infocomm Research, Singapore)
Bin Ma (Institute for Infocomm Research, Singapore)
Atiwong Suchato (Chulalongkorn University,Thailand)
Proadpran Punyabukkana (Chulalongkorn University,Thailand)
Nattanun Thatphithakkul (National Electronics and Computer Technology Center (NECTEC), Thailand)
In the spoken language recognition approach of modeling phonetic lattice with the Support Vector Machine (SVM), term weighting on the supervector of N-gram probabilities is critical to the recognition performance because the weighting prevents the SVM kernel from being dominated by a few large probabilities. We investigate several term weighting functions that are used in text retrieval, which can incorporate the long-term semantic modeling in the short-term N-gram modeling. The functions are evaluated on the NIST 2007 Language Recognition Evaluation (LRE) task. Results suggest the term weighting with redundancy of term frequency (rd) which eliminates the redundancy of unit frequency co-occurrence across languages, and the combination of rd and logtf which demonstrates the effectiveness of combining the local and global weighting functions.
#3Exploiting Context-Dependency and Acoustic Resolution of Universal Speech Attribute Models in Spoken Language Recognition
Sabato Marco Siniscalchi (Universita' degli Studi di Enna "Kore")
Jeremy Reed (Gatech)
Torbjørn Svendsen (NTNU)
Chin-Hui Lee (Gatech)
This paper expands a previously proposed universal acoustic characterization approach to spoken language identification (LID) by studying different ways of modeling attributes to improve language recognition. The motivation is to describe any spoken language with a common set of fundamental units. Thus, a spoken utterance is first tokenized into a sequence of universal attributes. Then a vector space modeling approach delivers the final LID decision. Context-dependent attribute models are now used to better capture spectral and temporal characteristics. Also, an approach to expand the set of attributes to increase the acoustic resolution is studied. Our experiments show that the tokenization accuracy positively affects LID results by producing a 2.8% absolute improvement over our previous 30-second NIST 2003 performance. This result also compares favorably with the best results on the same task known by the authors when the tokenizers are trained on language-dependent OGI-TS data.
#4Hierarchical Multilayer Perceptron based Language Identification
David Imseng (Idiap Research Institute)
Mathew Magimai Doss (Idiap Research Institute)
Hervé Bourlard (Idiap Research Institute)
Automatic language identification (LID) systems generally exploit acoustic knowledge, possibly enriched by explicit language specific phonotactic or lexical constraints. This paper investigates a new LID approach based on hierarchical multilayer perceptron (MLP) classifiers, where the first layer is a ``universal phoneme set MLP classifier''. The resulting (multilingual) phoneme posterior sequence is fed into a second MLP taking a larger temporal context into account. The second MLP can learn/exploit implicitly different types of patterns/information such as confusion between phonemes and/or phonotactics for LID. We investigate the viability of the proposed approach by comparing it against 2 standard approaches which use phonotactic and lexical constraints with the universal phoneme set MLP classifier as emission probability estimator. On SpeechDat(II) datasets of 5 European languages, the proposed approach yields significantly better performance compared to the 2 standard approaches.
#5The NIST 2010 Speaker Recognition Evaluation
Alvin Martin (NIST)
Craig Greenberg (NIST)
The 2010 NIST Speaker Recognition Evaluation continues a series of evaluations of text independent speaker detection begun in 1996. It utilizes the newly collected Mixer-6 and Greybeard Corpora from the Linguistic Data Consortium. Major test conditions to be examined include variations in channel, speech style, vocal effort, and the effect of speaker aging over a multi-year period. A new primary evaluation metric giving increased weight to false alarm errors compared to misses is being used. A small evaluation test with a limited number of trials is also being offered for systems that include human expertise in their processing.
#6Bayesian Speaker Recognition Using Gaussian Mixture Model and Laplace Approximation
Shih-Sian Cheng (Institute of Information Science, Academica Sinica, Taipei, Taiwan)
I-Fan Chen (Institute of Information Science, Academica Sinica, Taipei, Taiwan)
Hsin-Min Wang (Institute of Information Science, Academica Sinica, Taipei, Taiwan)
This paper presents a Bayesian approach for Gaussian mixture model (GMM)-based speaker identification. Instead of evaluating the speaker score of a test speech utterance using a single data likelihood over the GMM learned by the point estimation methods according to the maximum likelihood or maximum a posteriori criteria, the Bayesian approach evaluates the score using the expectation of the data likelihood over the posterior distribution of the model parameters, which is depicted with Bayesian integration. However, the integration can not be derived analytically. Therefore, we apply Laplace approximation to the derivations. Theoretically, we show that the proposed Bayesian approach is equivalent to the GMM-UBM approach when infinite training data is available for each speaker. The results of speaker identification experiments on the TIMIT corpus show that the proposed Bayesian approach consistently outperforms GMM-UBM under very limited training data conditions.
#7What Else is New Than the Hamming Window? Robust MFCCs for Speaker Recognition via Multitapering
Tomi Kinnunen (University of Eastern Finland)
Rahim Saeidi (University of Eastern Finland)
Johan Sandberg (Lund University)
Maria Hansson-Sandsten (Lund University)
Usually the mel-frequency cepstral coefficients (MFCCs) are derived via Hamming windowed DFT spectrum. In this paper, we advocate to use a so-called multitaper method instead. Multitaper methods form a spectrum estimate using multiple window functions and frequency-domain averaging. Multitapers provide a robust spectrum estimate but have not received much attention in speech processing. Our speaker recognition experiment on NIST 2002 yields equal error rates (EERs) of 9.66 % (clean data) and 16.41 % (-10 dB SNR) for the conventional Hamming method and 8.13 % (clean data) and 14.63 % (-10 dB SNR) using multitapers. Multitapering is a simple and robust alternative to the Hamming window method.
#8Fast Computation of Speaker Characterization Vector using MLLR and Sufficient Statistics in Anchor Model Framework
Achintya Kumar Sarkar (Indian Institute of Technology Madras)
S. Umesh (Indian Institute of Technology Madras)
Anchor modeling technique is shown to be useful in reducing computational complexity for speaker identification and indexing of large audio database,where speakers are projected onto a talker space spanned by a set of pre-defined anchor models represented by GMMs.The characterization of each speaker involves likelihood calculation with each anchor models and is therefore expensive even in the GMM-UBM frame work using top-C mixtures scoring.An computationaly efficient method is proposed here to calculate the likelihood of speech utterances using anchor speaker-specific MLLR matrix and sufficient statistics estimated from the utterance.Since anchor models use distance measures to identify speakers, they are used as a first stage to select N probable speakers and then cascaded by a conventional GMM-UBM system which finally identifies the speaker from this reduced set.The proposed method is 4.21x faster than the conventional cascade anchor system with comparable performance on NIST-04 SRE.
#9Graph-Embedding for Speaker Recognition
Zahi N. Karam (DSPG, RLE at MIT / MIT Lincoln Laboratory)
William M. Campbell (MIT Lincoln Laboratory)
Popular methods for speaker classification perform speaker comparison in a high-dimensional space, however, recent work has shown that most of the speaker variability is captured by a low-dimensional subspace of that space. In this paper we examine whether additional structure in terms of nonlinear manifolds exist within the high-dimensional space. We will use graph embedding as a proxy to the manifold and show the use of the embedding in data visualization and exploration. ISOMAP will be used to explore the existence and dimension of the space. We also examine whether the manifold assumption can help in two classification tasks: data-mining and standard NIST speaker recognition evaluations (SRE). Our results show that the data lives on a manifold and that exploiting this structure can yield significant improvements on the data-mining task. The improvement in preliminary experiments on all trials of the NIST SRE Eval-06 core task are less but significant.
#10A Hybrid Modeling Strategy for GMM-SVM Speaker Recognition System with Adaptive Relevance factor
Chang Huai You (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)
Kong Aik Lee (Institute for Infocomm Research)
In Gaussian mixture model (GMM) approach to speaker recognition, it has been found that the maximum a posteriori (MAP) estimation is greatly affected by undesired variability due to varying duration of utterance as well as other hidden factors related to recording devices, session environment, and phonetic contents. We propose an adaptive relevance factor (RF) to compensate for this variability. In the other side, in realistic application, it is likely that the different channel corresponds to its different training and test conditions in terms of quantity and quality of the speech signals. In this connection, we develop a hybrid model that combines multiple complementary systems, each of which focuses on specific condition(s). We show the effectiveness of the proposed method on the core task of the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2008.
Harshavardhan Sundar (Indian Institute of Science)
Thippur Sreenivas (Indian Institute of Science)
Robust stochastic modeling of speech is an important issue for the performance of practical applications. The Gaussian mixture model, GMM, is widely used in speaker ID, but its performance would get limited in the presence of unseen noise and distortions. Such noisy data, referred to as ”out-liers” for the original distribution, can be better represented by the use of heavy-tail distributions, such as Student’s t-distribution. It provides a natural choice in which the heavy-tail can be controlled using the degrees-of-freedom parameter. We explore finite mixture of t-distributions model (tMM), to represent noisy speech data and show its robustness for speaker ID, compared to GMM. Using the TIMIT and NTIMIT databases, the recognition accuracy obtained are 100% and 79.68% with a 34 mixture tMM respectively much better than those reported in the literature.
#12A variable frame length and rate algorithm based on the spectral kurtosis measure for speaker verification
Chi-Sang Jung (School of Electrical and Electronic Engineering, Yonsei University, Korea)
Kyu Han (Ming Hsieh Department of Electrical Engineering, University of Southern California, USA)
Hyunson Seo (School of Electrical and Electronic Engineering, Yonsei University, Korea)
Shrikanth Narayanan (Ming Hsieh Department of Electrical Engineering, University of Southern California, USA)
Hong-Goo Kang (School of Electrical and Electronic Engineering, Yonsei University, Korea)
In this paper, we propose a spectral kurtosis based approach to extract features with a variable frame length and rate for speaker verification. Since the speaker-specific information of features in each frame changes depending upon the characteristics of speech, it is important to determine the appropriate frame length and rate to extract the salient feature frames. In order to distinctively represent the characteristics of vowels and consonants both in time and frequency domains, we introduce a variable frame length and rate (VFLR) method based on spectral kurtosis, which provides a local measure of time-frequency concentration. Experimental results verify that the proposed VFLR method improves the performance of the speaker verification system on the NIST SRE-06 database by 9.725% (relative) compared to the feature extraction method with the fixed length and rate.

Source localization and separation

Time:Thursday 10:00 Place:International Conference Room D Type:Poster
Chair:Hiroshi G. Okuno
#1Near Field Sound Source Localization Based on Cross-power Spectrum Phase Analysis with Multiple Channel Microphones
Kohei Hayashida (Graduate School of Science and Engineering, Ritsumeikan University)
Masanori Morise (College of Information and Science, Ritsumeikan University)
Takanobu Nishiura (College of Information and Science, Ritsumeikan University)
We study sound source localization in near field. In the research into sound source localization, 2D-MUSIC has already been developed. However, problematically its performance degrades in diffused noisy environments. The localization method based on CSP in near field has already been developed. However, the localization accuracy depends on the estimation accuracy for time delay between paired microphone. We have proposed 2D-CSP with multiple channel microphones for robustly localization. We carried out the evaluation experiment in a conference room. As a result of experiment, we confirmed that the proposed method can robustly localize a sound source more than conventional methods.
#2A Maximum a Posteriori Sound Source Localization in Reverberant and Noisy Conditions
Choi Jinho (KAIST)
Yoo Chang D. (KAIST)
In this paper, a maximum a posteriori sound source localization (MAP-SSL) is proposed in reverberant and noisy conditions. Incorporating a sparse prior related to the location of source into the existing maximum likelihood sound source localization (ML-SSL) framework, the proposed MAP-SSL algorithm is derived. In the proposed MAP-SSL algorithm, assuming the direction of an active source to be sparse in the space of all possible finite source directions, when a source is active, the criterion in deriving the proposed MAP-SSL algorithm is similar to the criterion used to derive the existing ML-SSL framework, except that in our criterion a sparse source prior that enforces a sparse source direction solution is added. The sparse source prior plays a key role in improving the SSL performance. The experimental results show the proposed MAP-SSL algorithm outperforms the variants of the ML-SSL framework.
#3Multichannel Source Separation Based on Source Location Cue with Log-Spectral Shaping by Hidden Markov Source Model
Tomohiro Nakatani (NTT Corporation)
Shoko Araki (NTT Corporation)
Takuya Yoshioka (NTT Corporation)
Masakiyo Fujimoto (NTT Corporation)
This paper proposes a multichannel source separation approach that exploits statistical characteristics of source location cues represented by inter-channel phase differences (IPD) and those of source log spectra represented by hidden Markov models (HMM). With this approach, source separation is achieved by iterating two simple sub-procedures, namely the clustering of the time-frequency (TF) bins into individual sources and the independent updating of the model parameters of each source. An advantage of this approach is that we can update the model parameters of each source independently of those of the other sources in each iteration, and thus the update can be computationally very efficient. We show by simulation experiments that the proposed method can greatly improve, in a computationally efficient manner, the quality of each source signal from sound mixtures in terms of cepstral distortion using an speaker independent HMM composed of very small number of states.
#4A DOA Estimation algorithm based on Equalization-Cancellation Theory
Duc Chau (School of Information Science, Japan Advanced Institute of Science and Technology)
Junfeng Li (School of Information Science, Japan Advanced Institute of Science and Technology)
Akagi Masato (School of Information Science, Japan Advanced Institute of Science and Technology)
Direction of arrival (DOA) estimation plays an important role in binaural hearing systems. Recently methods usually require a large array of microphones or do not adapt special conditions, e.g., humanoid robot with the effect of head-related transfer function. In this paper, we propose a two-microphone DOA estimation algorithm, namely EC-BEAM, which applies equalization-cancellation (EC) model into DOA estimation through beamforming-based technique. Specifically, the EC model is integrated into beamforming to remove the signal components from a given direction and yield the energy of the remained signals from other directions. Through searching several DOA candidates, the true DOA is determined as the direction at which the energy of the remained signals gets to minimum. Experimental results showed that EC-BEAM can not only well adapt to binaural hearing systems but also be able estimate much accurately the DOA of target signal in various noise conditions with only two microphones
#5Concurrent Speaker Localization using Multi-band Position-Pitch (M-PoPi) Algorithm with Spectro-Temporal Pre-Processing
Tania Habib (Signal Processing and Speech Communication Lab, Graz University of Technology, Austria)
Harald Romsdorfer (Signal Processing and Speech Communication Lab, Graz University of Technology, Austria)
Accurate, microphone-based speaker localization in real-world environments, like office spaces or meeting rooms, must be able to track a single speaker and multiple concurrent speakers in the presence of reverberations and background noise. Our Multiband Joint Position-Pitch (M-PoPi) algorithm for circular microphone arrays already shows a frame-wise localization estimation score of about 95% for tracking a single speaker in a noisy, reverberant setting. In this paper, we present two extensions of the M-PoPi algorithm to improve the localization estimation accuracy also for multiple concurrent speakers. These extensions are a weighted spectro-temporal fragment analysis as a pre-processing step for the M-PoPi algorithm and a particle filter-based tracking as a post-processing step. Experiments using real-world recordings of two concurrent speakers in a typically reverberant meeting room show an improvement of the frame-wise localization estimation score from 43% using the plain M-PoPi algorithm to 66% using the M-PoPi algorithm with both extensions.
#6On Using Gaussian Mixture Model for Double-Talk Detection in Acoustic Echo Suppression
Ji-Hyun Song (Inha University, Korea)
Kyu-Ho Lee (Inha University, Korea)
Yun-Sik Park (Inha University, Korea)
Sang-Ick Kang (Inha University, Korea)
Joon-Hyuk Chang (Inha University, Korea)
In this paper, we propose a novel frequency-domain approach to double-talk detection based on the Gaussian mixture model. In contrast to a previous approach based on a simple and heuristic decision rule utilizing time-domain cross-correlations, GMM is applied to a set of feature vectors extracted from the frequency-domain cross-correlation coefficients. Performance of the proposed approach is evaluated through objective tests under various environments, and better results are obtained as compared to the time-domain method.
#7Catalog-Based Single-Channel Speech-Music Separation
Ali Taylan Cemgil (Computer Engineering Department, Bogazici University)
Murat Saraçlar (Electrical and Electronics Engineering Department, Bogazici University)
We propose a new catalog-based speech-music separation method for background music removal. Assuming that we know a catalog of the background music, we develop a generative model for the superposed speech and music spectrograms. We represent the speech spectrogram by a Non-negative Matrix Factorization (NMF) model and the music spectrogram by a conditional Poisson Mixture Model (PMM). By choosing the size of the catalog, i.e., the number of mixture components we can tradeoff speed versus accuracy. The combined hierarchical model leads to a mixture of multinomial distributions as the joint posterior of music and speech. Separation and hyper-parameter adaptation can be achieved via an Expectation Maximization algorithm. Experimental results show that separation performance of the algorithm is promising. Furthermore, we show that incorporating prior information such as volume adjustment parameter can enhance the separation performance.
#8Unvoiced Speech Segregation Based on CASA and Spectral Subtraction
Ke Hu (Department of Computer Science and Engineering, The Ohio State University)
DeLiang Wang (Department of Computer Science and Engineering, The Ohio State University)
Unvoiced speech separation is an important and challenging problem that has not received much attention. We propose a CASA based approach to segregate unvoiced speech from nonspeech interference. As unvoiced speech does not contain periodic signals, we first remove the periodic portions of a mixture including voiced speech. With periodic components removed, the remaining interference becomes more stationary. We estimate the noise energy in unvoiced intervals on the basis of segregated voiced speech. Spectral subtraction is employed to extract time-frequency segments in unvoiced intervals, and we group the segments dominated by unvoiced speech by simple thresholding or Bayesian classification. Systematic evaluation and comparison show that the proposed method considerably improves the unvoiced speech segregation performance under various SNR conditions.
#9Unsupervised sequential organization for cochannel speech separation
Ke Hu (Department of Computer Science and Engineering, The Ohio State University)
DeLiang Wang (Department of Computer Science and Engineering, The Ohio State University)
The problem of sequential organization in the cochannel speech situation has previously been studied using speaker-model based methods. A major limitation of these methods is that they require the availability of pretrained speaker models and prior knowledge (or detection) of participating speakers. We propose an unsupervised clustering approach to cochannel speech sequential organization. Given enhanced cepstral features, we search for the optimal assignment of simultaneous speech streams by maximizing the between- and within-cluster scatter matrix ratio penalized by concurrent pitches within individual speakers. A genetic algorithm is employed to speed up the search. Our method does not require trained speaker models, and experiments with both ideal and estimated simultaneous streams show the proposed method outperforms a speaker-model based method in both speech segregation and computational efficiency.

Special Session: Social Signals in Speech

Time:Thursday 10:00 Place:301 Type:Special
Chair:Khiet Truong & Dirk Heylen
10:00Detecting politeness and efficiency in a cooperative social interaction
Paul Brunet (Queen's University Belfast)
Marcela Charfuelan (Deutsche Forschungszentrum für Künstliche Intelligenz)
Roddy Cowie (Queen's University Belfast)
Marc Schroeder (Deutsche Forschungszentrum für Künstliche Intelligenz)
Hastings Donnan (Queen's University Belfast)
Ellen Douglas-Cowie (Queen's University Belfast)
We developed a cooperative time-sensitive task to study vocal expression of politeness and efficiency. Sixteen dyads completed 20 trials of the ‘Maze Task’, where one participant (the ‘navigator’) gave oral instructions (mainly ‘up’, ‘down’, left’, ‘right’) for the other (the ‘pilot’) to follow. For half of the trials, navigators were instructed to be polite, and for the other half to be efficient. The simplicity of the task left few ways to express politeness. Nevertheless it significantly affected task accuracy, and pilots’ subjective ratings indicate that it was perceived. Efficiency was not as clearly perceived. Preliminary acoustic analysis suggests relevant dimensions.
10:20Comparing Measures of Synchrony and Alignment in Dialogue Speech Timing with respect to Turn-taking Activity
Nick Campbell (Trinity College Dublin)
Stefan Scherer (Ulm University)
This paper describes a system for predicting discourse-role features based on voice-activity detection. It takes as input a vector of values extracted from conversational speech and predicts turn-taking activity and active-listening patterns using an echo-state network. We observed evidence of frame-attunement using a measure of speech density which takes the ratio of speech to non-speech behaviour per utterance. We noted a synchrony of utterance timing and modelled this using the ESN. The system was trained on a subset of data from 100 telephone conversations from the 1,500-hour JST Expressive Speech Processing corpus, and predicts the interlocutor's timing behaviour with an error-rate of less than 15% based on one partner's speech-activity alone. An integrated system with access to content information would of course perform at higher rates.
10:40Resources for turn competition in overlap in multi-party conversations: Speech rate, pausing and duration
Emina Kurtic (University of Sheffield, Departments of Computer Science and Human Communication Sciences)
Guy Brown (University of Sheffield, Department of Computer Science)
Bill Wells (University of Sheffield, Department of Human Communication Sciences)
This paper investigates the prosodic features that speakers use to compete for the turn when they talk simultaneously. Most previous research has focused on F0 and energy variation as resources for turn competition; here, we investigate the relevance of speech rate, pausing and the duration of in-overlap talk. These features are extracted from a set of overlaps drawn from the ICSI Meetings Corpus, and used to derive decision trees that classify overlapping talk as competitive or non-competitive. The decision trees show that both pausing and the duration of the in-overlap speech are significantly related to turn competition for both overlappers and overlappees. Additionally, speech rate is used by overlappees to return competition upon a turn competitive incoming. These findings partially support and extend the observations made in previous studies within the framework of conversation analysis and interactional phonetics.
11:00Disambiguating the functions of conversational sounds with prosody: the case of `yeah'
Khiet Truong (University of Twente)
Dirk Heylen (University of Twente)
In this paper, we look at how prosody can be used to automatically distinguish between different dialogue act functions and how it determines degree of speaker incipiency. We focus on the different uses of `yeah'. Firstly, we investigate ambiguous dialogue act functions of `yeah': `yeah' is most frequently used as a backchannel or an assessment. Secondly, we look at the degree of speakership incipiency of `yeah': some `yeah' items display a greater intent of the speaker to take the floor. Classification experiments with decision trees were performed to assess the role of prosody: we found that prosody indeed plays a role in disambiguating dialogue act functions and in determining degree of speaker incipiency of `yeah'.
11:20Prosody and voice quality of vocal social signals: the case of dominance in scenario meetings
Marcela Charfuelan (DFKI)
Marc Schröder (DFKI)
Ingmar Steiner (DFKI)
In this paper we investigate the prosody and voice quality of dominance in scenario meetings. We have found that in these scenarios the most dominant person tends to speak with a louder-than-average voice quality and the least dominant person with a softer-than-average voice quality. We also found that the most dominant role in the meetings is the project manager and the least dominant the marketing expert. A set of raw and composite measures of prosody and voice quality are extracted from the meeting data followed by a Principal Components Analysis (PCA) to identify the core factors predicting the associated social signal or related annotation.
11:40The Prosody of Swedish Conversational Grunts
Daniel Neiberg (CTT, TMH, CSC, KTH)
Joakim Gustafson (CTT, TMH, CSC, KTH)
This paper explores conversational grunts in a face-to-face setting. The study investigates the prosody and turn-taking effect of fillers and feedback tokens that has been annotated for attitudes. The grunts were selected from the DEAL corpus and automatically annotated for their turn taking effect. A novel supra-segmental prosodic signal representation and contextual timing features are used for classification and visualization. Classification results using linear discriminant analysis, show that turn-initial feedback tokens lose some of their attitude-signaling prosodic cues compared to non-overlapping continuer feedback tokens. Turn taking effects can be predicted well over chance level, except Simultaneous Starts. However, feedback tokens before places where both speakers take the turn were more similar to feedback continuers than to turn initial feedback tokens.

New Paradigms in ASR II

Time:Thursday 13:30 Place:Hall A/B Type:Oral
Chair:Douglas O'Shaughnessy
13:30Improved topic classification and keyword discovery using an HMM-based speech recognizer trained without supervision
Man-Hung Siu (Raytheon BBN Technologies)
Herbert Gish (Raytheon BBN Technologies)
Arthur Chan (Raytheon BBN Technologies)
William Belfield (Raytheon BBN Technologies)
In our previous publication, we presented a new approach to HMM training, viz., training without supervision. We used an HMM trained without supervision for transcribing audio into self-organized units (SOUs) for the purpose of topic classification. In this paper we report improvements made to the system, including the use of context dependent acoustic models and lattice based features that together reduce the topic verification equal error rate from 12% to 7%. In addition to discussing the effectiveness of the SOU approach we describe how we analyzed some selected SOU n-grams and found that they were highly correlated with keywords, demonstrating the ability of the SOU technology to discover topic relevant keywords.
13:50An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification
Dimitri Kanevsky (IBM T.J. Watson Research Center)
Tara Sainath (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
David Nahamoo (IBM T.J. Watson Research Center)
The use of exemplar-based techniques for both speech classification and recognition tasks has become increasingly popular in recent years. However, the notion of why sparseness is important for exemplar-based speech processing has been relatively unexplored. In addition, little analysis has been done in speech processing on the appropriateness of different types of sparsity regularization constraints. The goal of this paper is to answer the above two questions, both through mathematically analyzing different sparseness methods and also comparing these approaches for phonetic classification in TIMIT.
14:10Investigation of Full-Sequence Training of Deep Belief Networks for Speech Recognition
Abdel-rahman Mohamed (Department of Computer Science, University of Toronto, Toronto, ON, Canada)
Dong Yu (Speech Technology Group, Microsoft Research, Redmond, WA, USA)
Deng Li (Speech Technology Group, Microsoft Research, Redmond, WA, USA)
Recently, Deep Belief Networks (DBNs) have been proposed for phone recognition and were found to achieve highly competitive performance. In the original DBNs, only frame-level information was used for training DBN weights while it has been known for long that sequential or full-sequence information can be helpful in improving speech recognition accuracy. In this paper we investigate approaches to optimizing the DBN weights, state-to-state transition parameters, and language model scores using the sequential discriminative training criterion. We describe and analyze the proposed training algorithm and strategy, and discuss practical issues and how they affect the final results. We show that the DBNs learned using the sequence-based training criterion outperforms that with frame-based criterion on three-layer DBNs and explain why the gain vanishes on six-layer DBNs, when evaluated on TIMIT.
14:30Mandarin Tone Recognition using Affine Invariant Prosodic Feature and Tone Posteriorgram
Yow-Bang Wang (1. Institute of Information Science, Academia Sinica, Taipei, Taiwan(R.O.C.); 2. Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan(R.O.C.))
Lin-Shan Lee (1. Institute of Information Science, Academia Sinica, Taipei, Taiwan(R.O.C.); 2. Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan(R.O.C.))
Many recent studies about tone recognition have focused on model-level issues, either for tone and prosody labeling or LVCSR. This paper, as a contrast, focus on feature-level issues. We propose to use both syllable-level mean and utterance-level standard deviation for pitch feature normalization, instead of the common approach that uses utterance-level mean only. We show its robustness with both affine-invariance property and experiment result. Also, we incorporate tone posteriorgrams in second-pass tone recognition, which further improves tone recognition accuracy.
14:50Continuous Speech Recognition with a TF-IDF Acoustic Model
Geoffrey Zweig (Microsoft)
Patrick Nguyen (Microsoft)
Jasha Droppo (Microsoft)
Alex Acero (Microsoft)
Information retrieval methods are frequently used for indexing and retrieving spoken documents, and more recently have been proposed for voice-search amongst a pre-defined set of business entries. In this paper, we show that these methods can be used in an even more fundamental way, as the core component in a continuous speech recognizer. Speech is initially processed and represented as a sequence of discrete symbols, specifically phoneme or multi-phone units. Recognition then operates on this sequence. The recognizer is segment-based, and the acoustic score for labeling a segment with a word is based on the TF-IDF similarity between the subword units detected in the segment, and those typically seen in association with the word. We present promising results on both a voice search task and the Wall Street Journal task. The development of this method brings us one step closer to being able to do speech recognition based on the detection of sub-word audio attributes.
15:10SCARF: A Segmental Conditional Random Field Toolkit for Speech Recognition
Geoffrey Zweig (Microsoft)
Patrick Nguyen (Microsoft)
This paper describes a new toolkit - SCARF - for doing speech recognition with segmental conditional random fields. It is designed to allow for the integration of numerous, possibly redundant segment level acoustic features, along with a complete language model, in a coherent speech recognition framework. SCARF performs a segmental analysis, where each segment corresponds to a word, thus allowing for the incorporation of acoustic features defined at the phoneme, multi-phone, syllable and word level. SCARF is designed to make it especially convenient to use acoustic detection events as input, such as the detection of energy bursts, phonemes, or other events. Language modeling is done by associating each state in the SCRF with a state in an underlying n-gram language model, and SCARF supports the joint and discriminative training of language model and acoustic model parameters. SCARF is available for download from

Spoken Language Understanding and Spoken Language Translation II

Time:Thursday 13:30 Place:201A Type:Oral
Chair:Wolfgang Minker
13:30Online SLU model adaptation with a partial Oracle
Pierre Gotab (Universite d'Avignon)
Geraldine Damnati (France Telecom - Orange Labs)
Frederic Bechet (Aix Marseille Universite)
Lionel Delphin-Poulat (France Telecom - Orange Labs)
Deployed Spoken Dialog Systems (SDS) evolve quickly while new services are added or dropped, and while users' behavior change. This dynamic aspect of SDS justify the need for a process allowing the system to keep up to date the Automatic Speech Recognition (ASR) and the Spoken Language Understanding (SLU) models in order to take into account this variability. This process usually consists in collecting new data from the deployed system, trancribing and annotating them, then adding these new examples to the ASR and SLU training corpora in order to retrain the models. This strategy, even when used with an active learning scheme, is costly as the transcription and annotation processes of the new collected samples has to be done manually. Because of this cost the models can't be adapted on a daily bases and the SDS remain unchanged between two revisions. This paper proposes a supervised approach for updating the SLU models of a deployed SDS which doesn't need any additional manual transcription or annotation processes. The limited supervision needed for this alternative approach is given by the users calling the SDS: each user can be seen as a partial Oracle who could confirm if a system prediction is right or wrong. We will illustrate on a real deployed SDS the efficiency of this cost-free method for the online adaptation of an SLU model.
13:50Role of language models in spoken fluency evaluation
Om D Deshmukh (IBM Research India)
Harish Doddala (IBM Research India)
Ashish Verma (IBM Research India)
Karthik Visweswariah (IBM Research India)
This paper addresses the task of automatic evaluation of spoken fluency skills of a speaker. Specifically, the paper evaluates the role of language models built from fluent and disfluent data in quantifying the fluency of a spoken monologue. We show that features based on relative perplexities of the fluent and the disfluent language models on a given utterance are indicative of the level of spoken fluency of the utterance. The proposed features lead to a spoken fluency classification accuracy of 39.8% for 4-class and 68.4% for 2-class classification. Combining these features with a set of prosodic features leads to further improvement in the classification accuracy thus highlighting the complementarity of the information they contribute compared to the low-level disfluency information captured by the prosodic features.
14:10Social Role Discovery from Spoken Language using Dynamic Bayesian Networks
Sibel Yaman (International Computer Science Institute)
Dilek Hakkani-Tur (International Computer Science Institute)
Gokhan Tur (Microsoft)
In this paper, we focus on inferring social roles in conversations using information extracted only from the speaking styles of the speakers. We use dynamic Bayesian networks (DBNs) to model the turn-taking behavior of the speakers. DBNs provide the capability of naturally formulating the dependencies between random variables. Specifically, we first model our problem as a hidden Markov model (HMM). As it turns out, the knowledge of the segments that belong to the same speaker can be augmented into this HMMstructure to form a DBN. This information places a constraint on two subsequent speaker roles such that the current speaker role depends not only on the previous speaker’s role but also on that most recent role assigned to the same speaker. We conducted an experimental study to compare these two modeling approaches using broadcast shows. In our experiments, the approach with the constraint on same speaker segments assigned 89.9% turns the correct role whereas the HMM-based approach assigned 79.2% of turns their correct role.
14:30Domain Adaptation and Compensation for Emotion Detection
Michelle Sanchez (SRI International, Speech Technology and Research Laboratory)
Gokhan Tur (Speech at Microsoft | Microsoft Research)
Luciana Ferrer (SRI International, Speech Technology and Research Laboratory)
Dilek Hakkani-Tür (International Computer Science Institute (ICSI))
Inspired by the recent improvements in domain adaptation and session variability compensation techniques used for speech and speaker processing, we study their effect for emotion prediction. More specifically, we investigated the use of publicly available out-of-domain data with emotion annotations for improving the performance of the in-domain model trained using 911 emergency-hotline calls. Following the emotion detection literature, we use prosodic (pitch, energy, and speaking rate) features as the inputs to a discriminative classifier. We performed segment-level n-fold cross validation emotion prediction experiments. Our results indicate significant improvement of performance for emotion prediction exploiting out-ofdomain data.
14:50Phrase Alignment Confidence for Statistical Machine Translation
Sankaranarayanan Ananthakrishnan (BBN Technologies)
Rohit Prasad (BBN Technologies)
Prem Natarajan (BBN Technologies)
The performance of phrase-based SMT systems is crucially dependent on the quality of the extracted phrase pairs, which is in turn a function of word alignment quality. Data sparsity, an inherent problem in SMT even with large training corpora, has an adverse impact on the reliability of the extracted phrase translation pairs. We present a novel feature based on bootstrap resampling of the training corpus, termed phrase alignment confidence, that measures the goodness of a phrase translation pair. We integrate this feature within a phrase-based SMT system and show an improvement of 1.7% BLEU and 4.4% METEOR over a baseline English-to-Pashto (E2P) SMT system that does not use any measure of phrase pair quality. We then show that the proposed measure compares well to an existing indicator of phrase pair reliability, the lexical smoothing probability. We also demonstrate that combining the two measures leads to a further improvement of 0.4% BLEU and 0.3% METEOR on the E2P system.
15:10Data-Driven Morphological Decomposition and Named-Entity Projection for Field Maintainable Speech-to-Speech Translation
Ian Lane (Carnegie Mellon University)
Alex Waibel (Carnegie Mellon University)
In this paper, we investigate methods to improve the handling of named-entities in speech-to-speech translation systems, specifically focusing on techniques applicable to under-resourced, morphologically complex languages. First, we introduce a method to efficiently bootstrap a named-entity recognizer for a new language by projecting tags from a well resourced language across a bilingual corpus; and second, we propose a novel approach to automatically induce decomposition rules for morphologically complex languages. In our English-Iraqi speech-to-speech translation system combining these two approaches significantly improved speech recognition and translation performance on military dialogs focused on the collection of information in the field.

Signal processing for music and song

Time:Thursday 13:30 Place:201B Type:Oral
Chair:Takanobu Nishiura
13:30Acoustic Correlates of Voice Quality Improvement by Voice Training
Kiyoaki Aikawa (Tokyo University of Technology, School of Media Science)
Junko Uenuma (Bijivo)
Tomoko Akitake (Bijivo)
This paper derived four acoustic parameters reflecting voice quality improvement by voice training. Voice training corrects respiration, phonation, articulation, and facial expression. The purpose of this study is to develop acoustic parameters which quantitatively represent the quality of each training item. This paper examined six acoustical feature parameters on the performance as the measure for voice quality improvement. Experimental results indicated that four of the six parameters, the intensity of harmonic structure, the dynamic range of the spectral envelope, formant intensity, and spectral slope were significantly improved after the training. Further analyses were carried out on the correlation between subjective evaluation and above acoustic properties. The results indicated that abdominal respiration was most important for improving these four parameters.
13:50Phonetic Segmentation of Singing Voice using MIDI and Parallel Speech
Minghui Dong (Institute for Infocomm Research)
Paul Chan (Institute for Infocomm Research)
Ling Cen (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)
Jason Teo (Nanyang Technological University)
Ping Jen Kua (Nanyang Technological University)
When analyzing singing voice signal, it is required to know the boundaries of each phonetic unit in the singing voice samples. However, due to prolonged vowels in the singing voice, it is not easy to accurately align a singing voice with the phonetic sequence of its lyrics by conventional speech recognition approach. This paper proposes a solution for the phonetic annotation of the singing voice with the provision of a MIDI file and a parallel speech recording of the lyrics. The MIDI file consisting of notation and lyric information is used to locate lyrics in the singing voice. The recording of parallel speech data is used to generate a reference phonetic annotation by forced aligning it with lyrics with a speech recognizer. The singing voice is then aligned with the speech, which has phonetic annotation, and the phonetic boundaries are mapped to the singing voice. The result shows that we are able to get an accurate annotation of phonetic boundaries in singing voice.
14:10A Singing Style Modeling System for Singing Voice Synthesizers
Keijiro Saino (Corporate Research & Development Center, Yamaha Corporation, Japan)
Makoto Tachibana (Corporate Research & Development Center, Yamaha Corporation, Japan)
Hideki Kenmochi (Corporate Research & Development Center, Yamaha Corporation, Japan)
This paper describes a method of modeling singing styles by a statistical method. In this system, singing expression parameters consisting of melody and dynamics which are derived from F0 and power are modeled by context-dependent Hidden Markov Models (HMMs.) A modeling method of the parameters are optimized for dealing with them. Since parameters we focus on are essential but general ones for singing synthesizers, generated parameters from the trained models may be possible to be applied to many of them. In the experiment, we trained singing style models by using singing recording with much expressive style, then parameters were generated for songs not included in training data and actually applied to our singing synthesizer VOCALOID. As a result, the style was well perceived in the synthesized sound with good synthetic quality.
14:30A Fast Query by Humming System Based on Notes
Jingzhou Yang (Tsinghua University, Beijing, China)
Jia Liu (Tsinghua University, Beijing, China)
Wei-Qiang Zhang (Tsinghua University, Beijing, China)
Query by humming (QBH), a content-based retrieval method, is an efficient way to search the song from a large database. The frame-based systems can achieve a good performance, but it is time-consuming. In this paper, we proposed an efficient note-based system, which is mainly comprised of noted-based linear scaling (NLS) and noted-based recursive align (NRA). The system after post-processing can achieve 96.1% in Top5 and 0.211s in time.
14:50Melody pitch estimation based on range estimation and candidate extraction using harmonic structure model
Seokhwan Jo (Dept. of EE, KAIST)
Sihyun Joo (Dept. of EE, KAIST)
Chang Dong Yoo (Dept. of EE, KAIST)
This paper proposes an algorithm to estimate the melody pitch line (the most dominant pitch sequence) of a given polyphonic audio based on melody range estimation and pitch candidate extraction using a harmonic structure model similar to that proposed by Goto. This paper defines melody pitch candidate as a list of pitch candidates that produces the best-fit harmonic models to the polyphonic audio. In many melody extraction algorithms proposed in the past, multiple-pitch extractor (MPE) is often performed for extracting melody pitch candidates; however, the MPE serves the purpose of estimating all pitches within a frame of a polyphonic audio and does not necessarily provide melody pitch candidates. The estimated weights of the harmonic structure model which must be obtained for extracting the pitch candidates are liable to octave error and strong low frequency interference, and therefore, certain refinement after the estimation must be performed. As a refinement, the algorithm measures the degree of harmonic fitness of each candidate. Furthermore, a melody pitch range is estimated to reduce false-positive pitch candidates. The melody pitch range is estimated based on the distribution of the best pitch candidates with long duration. Experimental results show that the proposed extraction algorithm performed better than many of the algorithms proposed in the past.
15:10Modified Spatial Audio Object Coding Scheme with Harmonic Extraction and Elimination Structure for Interactive Audio Service
Jihoon Park (Korea Advanced Institute of Science and Technology)
Kwangki Kim (Korea Advanced Institute of Science and Technology)
Jeongil Seo (Electronics and Telecommunications Research Institute)
Minsoo Hahn (Korea Advanced Institute of Science and Technology)
An interactive audio service provides an audio editing functionality to users. In the service, the users can control the wanted audio objects to make their own audio sound using a spatial audio object coding (SAOC) scheme. The SAOC has a problem in case of the Karaoke mode, because the vocal object cannot be removed perfectly from the down-mix signal. In this paper, a modified SAOC scheme with harmonic extraction and elimination structures are proposed. The proposed scheme perfectly removes a vocal object using harmonic information of the vocal object. Subjective and objective evaluation results show the proposed scheme is superior to the conventional ones.

Modeling first language acquisition

Time:Thursday 13:30 Place:302 Type:Oral
Chair:Keiichi Tajima
13:30Modelling the effect of speaker familiarity and noise on infant word recognition
Christina Bergmann (Centre for Language and Speech Techonology, Radboud University Nijmegen, The Netherlands; International Max Planck Research School for Language Sciences, Radboud University Nijmegen, The Netherlands)
Michele Gubian (Centre for Language and Speech Techonology, Radboud University Nijmegen, The Netherlands)
Lou Boves (Centre for Language and Speech Techonology, Radboud University Nijmegen, The Netherlands)
In the present paper we show that a general-purpose word learning model can simulate several important findings from recent experiments in language acquisition. Both the addition of background noise and varying the speaker have been found to influence infants' performance during word recognition experiments. We were able to replicate this behaviour in our artificial word learning agent. We use the results to discuss both advantages and limitations of computational models of language acquisition.
13:50Unsupervised Learning of Vowels from Continuous Speech based on Self-organized Phoneme Acquisition Model
Kouki Miyazawa (Graduate School of Human Sciences, Waseda University, Japan)
Hideaki Kikuchi (Graduate School of Human Sciences, Waseda University, Japan)
Reiko Mazuka (RIKEN Brain Science Institute, Japan)
All normal humans can acquire native phoneme systems naturally. However, it is unclear as to how infants learn the acoustic expression of each phoneme of their languages. In recent studies, researchers have inspected phoneme acquisition by using a computational model. However these studies have used a reading speech that has a limited vocabulary as input and do not handle a continuous speech. Therefore, we use a natural speech and build a self-organization model that simulates the cognitive ability, and we analyze the information that is necessary for the acquisition of the native vowels. Our model is designed to learn a natural continuation utterance and to estimate the number and boundaries of the vowel categories. In the simulation trial, we investigate the relationship between the quantity of learning and the accuracy for the vowels in a single Japanese speaker’s speech. As a result, it is found that the vowel recognition rate of our model is comparable to that of an adult.
14:10Learning speaker normalization using semisupervised manifold alignment
Andrew Plummer (Department of Linguistics, The Ohio State University, Columbus, OH, USA)
Mary Beckman (Department of Linguistics, The Ohio State University, Columbus, OH, USA)
Mikhail Belkin (Dept. of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA)
Eric Fosler-Lussier (Dept. of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA)
Benjamin Munson (Department of Speech-Language-Hearing Sciences, University of Minnesota, USA)
As a child acquires language, he or she: perceives acoustic information in his or her surrounding environment; identifies portions of the ambient acoustic information as language-related; and associates that language-related information with his or her perception of his or her own language-related acoustic productions. The present work models the third task. We use a semisupervised alignment algorithm based on manifold learning. We discuss the concepts behind this approach, and the application of the algorithm to this task. We present experimental evidence indicating the usefulness of manifold alignment in learning speaker normalization.
14:30Fully Unsupervised Word Learning from Continuous Speech Using Transitional Probabilities of Atomic Acoustic Events
Okko Johannes Räsänen (Department of Signal Processing and Acoustics, Aalto University School of Science and Technology, Finland)
This work presents a learning algorithm based on transitional probabilities of atomic acoustic events. The algorithm learns models for word-like units in speech without any supervision, and without a priori knowledge of phonemic or linguistic units. The learned models can be used to segment novel utterances into word-like units, supporting the theory that transitional probabilities of acoustic events could work as a bootstrapping mechanism of language learning. The performance of the algorithm is evaluated using a corpus of Finnish infant-directed speech.
14:50Language acquisition and cross-modal associations - computational simulation of the results of infant studies
Louis ten Bosch (Radboud University Nijmegen)
Lou Boves (Radboud University Nijmegen)
This paper discusses recent results obtained with a computational model of language acquisition. This model, developed in the ACORNS project, has shown to be able to learn word-like units from stimuli in which utterances are paired with visual information. In this paper we extend the ACORNS experiments to ambiguous stimuli, as to obtain a computational correlate of the findings by Smith and Yu in 2008. Smith and Yu stipulate that a young infant is confronted with an uncertainty problem, how to pair a word, embedded in a sentence, and a referent, embedded in a rich visual scene. They show that young infants can resolve the uncertainty problem by evaluating the statistical evidence across many individually ambiguous words and scenes. We investigate to what extent the ACORNS model is able to deal with cross-modal ambiguity. Moreover, we show the positive effect of an 'active' role during learning when confronted with ambiguity, based on internal confidence.
15:10Active word learning under uncertain input conditions
Maarten Versteegh (Radboud University Nijmegen, International Max Planck Research School for Language Sciences, Nijmegen)
Louis ten Bosch (Radboud University Nijmegen)
Lou Boves (Radboud University Nijmegen)
In this paper we investigate a computational model of word learning that is cognitively plausible. The model is partly trained on incorrect form-referent pairings, modelling the input to a word-learning child that may contain such mismatches due to inattention to a joint communicative scene. We introduce a procedure of active learning, based on attested cognitive processes. We then show how this procedure can help overcome the unreliability of the input by detecting and correcting the mismatches by reliance on previously built up experience.

ASR: Acoustic Models III

Time:Thursday 13:30 Place:International Conference Room A Type:Poster
Chair:Bhuvana Ramabhadran
#1Parallel Training of Neural Networks for Speech Recognition
Karel Veselý (Speech@FIT, Brno University of Technology, Czech Republic)
Lukáš Burget (Speech@FIT, Brno University of Technology, Czech Republic)
František Grézl (Speech@FIT, Brno University of Technology, Czech Republic)
In this paper we describe parallel implementation of ANN training procedure based on block mode back-propagation learning algorithm. Two different approaches to training parallelization were implemented. The first is data parallelization using POSIX threads, it is suitable for multi-core computers. The second is node parallelization using high performance SIMD architecture of GPU with CUDA, suitable for CUDA enabled computers. We compare the speedup of both approaches by learning typically-sized network on the real-world phoneme-state classification task, showing nearly 10 times reduction when using CUDA version, while the 8-core server with multi-thread version gives only 4 times reduction. In both cases we compared to an already BLAS optimized implementation. The training tool will be released as Open-Source software under project name TNet.
#2The use of sense in unsupervised training of acoustic models for HMM-based ASR systems
Rita Singh (Carnegie Mellon University)
Benjamin Lambert (Carnegie Mellon University)
Bhiksha Raj (Carnegie Mellon University)
In unsupervised training of ASR systems, no annotated data are assumed to exist. Word-level annotations for training audio are generated iteratively using an ASR system. At each iteration a subset of data judged as having the most reliable transcriptions is selected to train the next set of acoustic models. Data selection however remains a difficult problem, particularly when the error rate of the recognizer providing the initial annotation is very high. In this paper we propose an iterative algorithm that uses a combination of likelihoods and a simple model of sense to select data. We show that the algorithm is effective for unsupervised training of acoustic models, particularly when the initial annotation is highly erroneous. Experiments conducted on Fisher-1 data using initial models from Switchboard, and a vocabulary and LM derived from the Google N-grams, show that performance on a selected held-out test set from Fisher data improves when we use the proposed iterative approach.
#3Boosted Mixture Learning of Gaussian Mixture HMMs for Speech Recognition
Jun Du (iFlytek Research, Hefei, Anhui, P. R. China)
Yu Hu (iFlytek Research, Hefei, Anhui, P. R. China)
Hui Jiang (York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, Canada)
In this paper, we propose a novel boosted mixture learning (BML) framework for Gaussian mixture HMMs in speech recognition. BML is an incremental method to learn mixture models for classification problem. In each step of BML, one new mixture component is calculated according to functional gradient of an objective function to ensure that it is added along the direction to maximize the objective function the most. Several techniques have been proposed to extend BML from simple mixture models like Gaussian mixture model (GMM) to Gaussian mixture hidden Markov model (HMM), including Viterbi approximation to obtain state segmentation, weight decay to initialize sample weights to avoid overfitting, combining partial updating with global updating of parameters and using Bayesian information criterion (BIC) for parsimonious modeling. Experimental results on the WSJ0 task have shown that the proposed BML yields relative word and sentence error rate reduction of 10.9% and 12.9%, respectively, over the conventional training procedure.
#4On the Exploitation of Hidden Markov Models and Linear Dynamic Models in a Hybrid Decoder Architecture for Continuous Speech Recognition
Volker Leutnant (University of Paderborn)
Reinhold Haeb-Umbach (University of Paderborn)
Linear dynamic models (LDMs) have been shown to be a viable alternative to hidden Markov models (HMMs) on small-vocabulary recognition tasks, such as phone classification. In this paper we investigate various statistical model combination approaches for a hybrid HMM-LDM recognizer, resulting in a phone classification performance that outperforms the best individual classifier. Further, we report on continuous speech recognition experiments on the AURORA4 corpus, where the model combination is carried out on wordgraph rescoring. While the hybrid system improves the HMM system in the case of monophone HMMs, the performance of the triphone HMM model could not be improved by monophone LDMs, asking for the need to introduce context-dependency also in the LDM model inventory.
#5Context dependent modelling approaches for hybrid speech recognizers
Alberto Abad (INESC-ID Lisboa, Portugal)
Thomas Pellegrini (INESC-ID Lisboa, Portugal)
Isabel Trancoso (IST/INESC-ID Lisboa, Portugal)
João Neto (IST/INESC-ID Lisboa, Portugal)
Speech recognition based on connectionist approaches is one of the most successful alternatives to widespread Gaussian systems. One of the main claims against hybrid recognizers is the increased complexity for context-dependent phone modeling, which is a key aspect in medium to large size vocabulary tasks. In this paper, we investigate the use of context-dependent triphone models in a connectionist speech recognizer. Thus, most common triphone state clustering procedures for Gaussian models are compared and applied to our hybrid recognizer. The developed systems with clustered context-dependent triphones show above 20% relative word error rate reduction compared to a baseline hybrid system in two selected WSJ evaluation test sets. Additionally, the recent porting efforts of the proposed context modelling approaches to a LVCSR system for English Broadcast News transcription are reported.
#6A Regularized Discriminative Training Method of Acoustic Models Derived by Minimum Relative Entropy Discrimination
Yotaro Kubo (Waseda University)
Shinji Watanabe (NTT Communication Science Laboratory)
Atsushi Nakamura (NTT Communication Science Laboratory)
Tetsunori Kobayashi (Waseda University)
We present a realization method of the principle of minimum relative entropy discrimination (MRED) in order to derive a regularized discriminative training method. MRED is advantageous since it provides a Bayesian interpretations of the conventional discriminative training methods and regularization techniques. In order to realize MRED for speech recognition, we proposed an approximation method of MRED that strictly preserves the constraints used in MRED. Further, in order to practically perform MRED, an optimization method based on convex optimization and its solver based on the cutting plane algorithm are also proposed. The proposed methods were evaluated on continuous phoneme recognition tasks. We confirmed that the MRED-based training system outperformed conventional discriminative training methods in the experiments.
#7Decision Tree State Clustering with Word and Syllable Features
Hank Liao (Google)
Chris Alberti (Google)
Michiel Bacchiani (Google)
Olivier Siohan (Google)
In large vocabulary continuous speech recognition, decision trees are widely used to cluster triphone states. In addition to commonly used phonetically based questions, others have proposed additional questions such as phone position within word or syllable. This paper examines using the word or syllable context itself as a feature in the decision tree, providing an elegant way of introducing word- or syllable-specific models into the system. Positive results are reported on two state-of-the-art systems: voicemail transcription and a search by voice tasks across a variety of acoustic model and training set sizes.
#8A Duration Modeling Technique with Incremental Speech Rate Normalization
Mitsuyoshi Tachimori (TOSHIBA CORPORATION)
This paper describes a novel technique to exploit duration information for low resource speech recognition systems. Using explicit duration models significantly increases computational cost due to a large search space. To avoid this problem, most of techniques using duration information adopt two-pass and N-best re-scoring approaches. Meanwhile, we propose an algorithm using word duration models with incremental speech rate normalization for the one-pass decoding approach. In the proposed technique, penalties are only added to scores of words with outlier durations, and not all words need to have duration models. Experimental results show that the proposed technique reduces up to 17% of errors on in-car digit string tasks without significant increase in computational cost.
#9Long Short-Term Memory Networks for Noise Robust Speech Recognition
Martin Woellmer (Technische Universitaet Muenchen)
Yang Sun (Technische Universitaet Muenchen)
Florian Eyben (Technische Universitaet Muenchen)
Bjoern Schuller (Technische Universitaet Muenchen)
In this paper we introduce a novel hybrid model architecture for speech recognition and investigate its noise robustness on the Aurora 2 database. Our model is composed of a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net exploiting long-range context information for phoneme prediction and a Dynamic Bayesian Network (DBN) for decoding. The DBN is able to learn pronunciation variants as well as typical phoneme confusions of the BLSTM predictor in order to compensate signal disturbances. Unlike conventional Hidden Markov Model (HMM) systems, the proposed architecture is not based on Gaussian mixture modeling. Even without any feature enhancement, our BLSTM-DBN system outperforms a baseline HMM recognizer by up to 18%.
#10One-Model Speech Recognition and Synthesis Based on Articulatory Movement HMMs
Tsuneo Nitta (Toyohashi University of Technology)
Takayuki Onoda (Toyohashi University of Technology)
Masashi Kimura (Toyohashi University of Technology)
Yurie Iribe (Toyohashi University of Technology)
Koichi Katsurada (Toyohashi University of Technology)
One-model speech recognition (SR) and synthesis (SS) based on common articulatory movement model are described. The SR engine has an articulatory feature (AF) extractor and an HMM based classifier that models articulatory gestures. Experimental results of a phoneme recognition task show that AF outperforms MFCC even if the training data are limited to a single speaker. In the SS engine, the same speaker-invariant HMM is applied to generate an AF sequence, and then after converting AFs into vocal tract parameters, speech signal is synthesized by a PARCOR filter together with a residual signal. Phoneme-to-phoneme speech conversion using AF exchange is also described.
#11Acoustic Modeling with Bootstrap and Restructuring for Low-resourced Languages
Xiaodong Cui (IBM T. J. Watson Research Center)
Jian Xue (IBM T. J. Watson Research Center)
Pierre Dognin (IBM T. J. Watson Research Center)
Upendra Chaudhari (IBM T. J. Watson Research Center)
Bowen Zhou (IBM T. J. Watson Research Center)
This paper investigates an acoustic modeling approach for low-resourced languages based on bootstrap and model restructuring. The approach first creates an acoustic model with redundancy by averaging over bootstrapped models from resampled subsets of sparse training data, which is followed by model restructuring to scale down the model to a desired cardinality. A variety of techniques for Gaussian clustering and model refinement are discussed for the model restructuring. LVCSR experiments are carried out on Pashto language with up to 105 hours of training data. The proposed approach is shown to yield more robust acoustic models given sparse training data and obtain superior performance over the traditional training procedure.
#12Lecture Speech Recognition by Combining Word Graphs of Various Acoustic Models
Tetsuo Kosaka (Yamagata University)
Keisuke Goto (Yamagata University)
Takashi Ito (Yamagata University)
Masaharu Kato (Yamagata University)
The aim of this work is to improve the performance of lecture speech recognition by using a system combination approach. In this paper, we propose a new combination technique in which various types of acoustic models are combined. In the combination approach, the use of complementary information is important. In order to prepare acoustic models that incorporate a variety of acoustic features, we employ both continuous-mixture hidden Markov models (CMHMMs) and discrete-mixture hidden Markov models (DMHMMs). These models have different patterns of recognition errors. In addition, we propose a new maximum mutual information (MMI) estimation of the DMHMM parameters. In order to evaluate the performance of the proposed method, we conduct recognition experiments on "Corpus of Spontaneous Japanese." In the experiments, a combination of CMHMMs and DMHMMs whose parameters were estimated by using the MMI criterion exhibited the best recognition performance.
#13Semi-parametric Trajectory Modelling Using Temporally Varying Linear Feature Transformation for Speech Recognition
Khe Chai Sim (National University of Singapore)
Shilin Liu (National University of Singapore)
Recently, trajectory HMM has been shown to improve the performance of both speech recognition and speech synthesis. For efficiency, state sequence is required to compute likelihood for trajectory HMM which limits its use to N-best rescoring for speech recognition. Motivated by the success of models with temporally varying parameters, this paper proposes a Temporally Varying Feature Mapping (TVFM) model to transform the feature vector sequence such that the trajectory information as modelled by trajectory HMM is suppressed. Therefore, TVFM can be perceived as an implicit trajectory modelling technique. Two approaches for estimating the TVFM parameters are presented. Experimental results for phone recognition on TIMIT and word recognition on Wall Street Journal show that promising results can be obtained using TVFM.
#14Deep-Structured Hidden Conditional Random Fields for Phonetic Recognition
Dong Yu (Microsoft Research)
Li Deng (Microsoft Research)
We extend our earlier work on deep-structured conditional random field (DCRF) and develop deep-structured hidden conditional random field (DHCRF). We investigate the use of this new sequential deep-learning model for phonetic recognition. DHCRF is a hierarchical model in which the final layer is a hidden conditional random field (HCRF) and the intermediate layers are zero-th-order conditional random fields (CRFs). Parameter estimation and sequence inference in the DHCRF are carried out layer by layer. Note that the training label is available only at the final layer and the state boundary is unknown. This difficulty is addressed by using unsupervised learning for the intermediate layers and lattice-based supervised learning in the final layer. Experiments on the TIMIT phone recognition task show small performance improvement of a three-layer DHCRF over a two-layer DHCRF, both are superior to the single-layer DHCRF and the discriminatively trained tri-phone HMM with same features.
#15Using Semi-Supervised Learning to Smooth Class Transitions
Jon Malkin (University of Washington, Seattle)
Jeff Bilmes (University of Washington, Seattle)
Seeking classifier models that are not overconfident and that better represent the inherent uncertainty over a set of choices, we extend an objective for semi-supervised learning for neural networks to two models from the ratio semi-definite classifier (RSC) family. We show that the RSC family of classifiers produces smoother transitions between classes on a vowel classification task, and that the semi-supervised framework provides further benefits for smooth transitions. Finally, our testing methodology presents a novel way to evaluate the smoothness of classifier transitions (interpolating between vowels) by using samples from classes unseen during training time.
#16Modeling Posterior Probabilities using the Linear Exponential Family
Peder Olsen (IBM)
Vaibhava Goel (IBM)
Charles Micchelli (SUNY Albany)
John Hershey (IBM)
A commonly used distribution on the probability simplex is the Dirichlet distribution. In this paper we present the linear exponential family as an alternative. The distribution is known in the statistics community, but we present in this paper a numerically stable method to compute its parameters. Although the Dirichlet distribution is known to be a good Bayesian prior for probabilities we believe this paper shows that the linear exponential model offers a good alternative in other contexts, such as when we want to use posterior probabilities as features for automatic speech recognition. We show how to incorporate posterior probabilities as additional features to an existing GMM, and show that the resulting model gives a 3% relative gain on a broadcast news speech recognition system.

Spoken dialogue systems II

Time:Thursday 13:30 Place:International Conference Room B Type:Poster
Chair:Masahiro Araki
#1New Technique to Enhance the Performance of Spoken Dialogue Systems Based on Dialogue States-Dependent Language Models and Grammatical Rules
Ramón López-Cózar (University of Granada)
David Griol (University Carlos III of Madrid)
This paper proposes a new technique to enhance the performance of spoken dialogue systems which presents one novel contribution: the automatic correction of some ASR errors by using language models dependent on dialogue states, in conjunction with grammatical rules. These models are optimally selected by computing similarity scores between patterns obtained from uttered sentences and patterns learnt during training. Experimental results with a spoken dialogue system designed for the fast food domain show that our technique allows enhancing word accuracy, speech understanding and task completion rates of a spoken dialogue system by 8.5%, 16.54% and 44.17% absolute, respectively.
#2A Stochastic Finite-State Transducer Approach to Spoken Dialog Management
Lluís-F Hurtado (Universitat Politècnica de València)
Joaquín Planells (Universitat Politècnica de València)
Encarna Segarra (Universitat Politècnica de València)
Emilio Sanchis (Universitat Politècnica de València)
David Griol (Universidad Carlos III de Madrid)
In this paper, we present an approach to spoken dialog management based on the use of a Stochastic Finite-State Transducer estimated from a dialog corpus. The states of the Stochastic Finite-State Transducer represent the dialog states, the input alphabet includes all the possible user utterances, without considering specific values, and the set of system answers constitutes the output alphabet. Then, a dialog describes a path in the transducer model from the initial state to the final one. An automatic dialog generation technique was used in order to generate the dialog corpus from which the transducer parameters are estimated. Our proposal for dialog management has been evaluated in a sport facilities booking task.
#3Enhanced Monitoring Tools and Online Dialogue Optimisation Merged into a New Spoken Dialogue System Design Experience
Romain Laroche (Orange Labs, Issy-les-Moulineaux, France)
Philippe Bretier (Orange Labs, Lannion, France)
Ghislain Putois (Orange Labs, Lannion, France)
This paper shows how the convergence between design and monitoring tools, and the integration of a dedicated reinforcement learning can be complementary and offer a new design experience for Spoken Dialogue System (SDS) developers. This article proposes first to integrate dialogue logs into the design tool, so that it constitutes a monitoring tool as well, by revealing call flows and their associated Key Performance Indicators (KPI). Second, the SDS developer is opened the possibility of designing several alternatives and of comparing visually his design choice performances. Third, a reinforcement learning algorithm is integrated to automatically optimise the SDS choices. The design/monitoring tool helps the SDS developers to understand and analyse the user behaviour, with the assistance of the learning algorithm. The SDS developers can then confront the different KPI and control the further SDS choices by removing or adding alternatives.
#4Optimising a Handcrafted Dialogue System Design
Romain Laroche (Orange Labs, Issy-les-Moulineaux, France)
Ghislain Putois (Orange Labs, Lannion, France)
Philippe Bretier (Orange Labs, Lannion, France)
In the Spoken Dialogue System literature, all studies consider the dialogue move as the unquestionable unit for reinforcement learning. Rather than learning at the dialogue move level, we apply the learning at the design level for three reasons: 1/ to alleviate the high-skill prerequisite for developers, 2/ to reduce the learning complexity by taking into account just the relevant subset of the context and 3/ to have interpretable learning results that carry a reusable usage feedback. Unfortunately, tackling the problem at the design level breaks the Markovian assumptions that are required in most Reinforcement Learning techniques. Consequently, we decided to use a recent non-Markovian algorithm called Compliance Based Reinforcement Learning. This paper presents the first experimentation on online optimisation in dialogue systems. It reveals a fast and significant improvement of the system performance with by average one system misunderstanding less per dialogue.
#5Utterance Selection for Speech Acts in a Cognitive Tourguide Scenario
Felix Putze (Cognitive Systems Lab, Karlsruhe Institute of Technology)
Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology)
This paper describes the integration of a cognitive memory model into a spoken dialog system for an in-car tourguide application. This memory model enhances the capabilities of the system and of the simulated user by estimating if and which information is relevant and useful in a given situation. An evaluation study with 15 human judges is performed to demonstrate the feasibility of the described approach. The results show that the proposed utterance selection strategy and the memory model significantly improve the human-like interaction behavior of the spoken dialog system in terms of the amount and quality of given information, relevance, manner, and naturalness of the spoken interaction.
#6Lexical Entrainment of Real Users in the Let’s Go Spoken Dialog System
Gabriel Parent (Language Technologies Institute, Carnegie Mellon University)
Maxine Eskenazi (Language Technologies Institute, Carnegie Mellon University)
This paper examines the lexical entrainment of real users in the Let’s Go spoken dialog system. First it presents a study of the presence of entrainment in a year of human-transcribed dialogs, by using a linear regression model, and concludes that users adapt their vocabulary to the system’s. This is followed by a study of the effect of changing the system vocabulary on the distribution of words used by the callers. The latter analysis provides strong evidence for the presence of lexical entrainment between users and spoken dialog systems.
#7Combining User Intention and Error Modeling for Statistical Dialog Simulators
Silvia Quarteroni (DISI - University of Trento, Italy)
Meritxell González (DISI - University of Trento, Italy and UPC - Universitat Politecnica de Catalunya)
Giuseppe Riccardi (DISI - University of Trento, Italy)
Sebastian Varges (DISI - University of Trento, Italy)
Statistical user simulation is an efficient and effective way to train and test the performance of a (spoken) dialog system. In this paper, we design and evaluate a modular data-driven dialog simulator. We decouple the “intentional” component of the User Simulator, composed by a Dialog Act Model, a Concept Model and a User Model, from the Error Simulator where an Error Model represents different types of ASR/SLU noisy channel distortion. We test different Dialog Act models and two Error Models against the same dialog manager and compare our results with those of real dialogs obtained using such a dialog manager in the same domain. Our results show on the one hand that finer Dialog Act models achieve increasing levels of accuracy with respect to real user behavior and on the other that data-driven Error Models make task completion times and rates closer to real data.
#8Parallel Processing of Interruptions and Feedback in Companions Affective Dialogue System
Jaakko Hakulinen (Department of Computer Sciences, University of Tampere, Finland)
Markku Turunen (Department of Computer Sciences, University of Tampere, Finland)
Raúl Santos de la Camara (Telefónica I+D, Spain)
Nigel Crook (Oxford University Computing Laboratory, UK)
Much interest has recently been given to making dialogue systems more natural by implementing more flexible software solutions, such as parallel and incremental processing. In the How-Was-Your-Day prototype, parallel processing paths provide complementary information and the parallel processing loops enable the system to respond to user activity in a more flexible manner than traditional pipeline processing. While most of the components work as though they were in a pipeline, the Interruption Manager is a component which uses the available information to generate the system responses outside of the pipeline and handles situations such as user interruptions.
#9Dynamic Language Models using Bayesian Networks for Spoken Dialog Systems
Antoine Raux (Honda Research Institute USA)
Neville Mehta (School of Electrical Engineering and Computer Science, Oregon State University)
Deepak Ramachandran (Honda Research Institute USA)
Rakesh Gupta (Honda Research Institute USA)
We introduce a new framework employing statistical language models (SLMs) for spoken dialog systems that facilitates the dynamic update of word probabilities based on dialog history. In combination with traditional state-dependent SLMs, we use a Bayesian Network to capture dependencies between user goal concepts and compute accurate distributions over words that express these concepts. This allows the framework to exploit information provided by the user in previous turns to predict the value of the unobserved concepts. We evaluate this approach on a large corpus of publicly available dialogs from the CMU Let's Go bus information system, and show that our approach significantly improves concept understanding precision over purely state-dependent SLMs.
#10Automatic detection of task-incompleted dialog for spoken dialog system based on dialog act N-gram
Sunao Hara (Graduate School of Information Science, Nagoya University)
Norihide Kitaoka (Graduate School of Information Science, Nagoya University)
Kazuya Takeda (Graduate School of Information Science, Nagoya University)
In this paper, we propose a method of detecting task-incompleted users for a spoken dialog system using an N-gram-based dialog history model. We collected a large amount of spoken dialog data accompanied by usability evaluation scores by users in real environments. The database was made by a field test in which naive users used a client-server music retrieval system with a spoken dialog interface on their own PCs. An N-gram model was trained from sequences that consist of user dialog acts and/or system dialog acts for two dialog classes, that is, the dialog completed the music retrieval task or the dialog incompleted the task. Then the system detects unknown dialogs that is not completed the task based on the N-gram likelihood. Experiments were conducted on large real data, and the results show that our proposed method achieved good classification performance. When the classifier correctly detected all of the task-incompleted dialogs, our proposed method achieved a false detection rate of 6%.
#11Dialogue Act Detection in Error-Prone Spoken Dialogue Systems Using Partial Sentence Tree and Latent Dialogue Act Matrix
Wei-Bin Liang (Dept. of CSIE, NCKU, Taiwan)
Chung-Hsien Wu (Dept. of CSIE, NCKU, Taiwan)
Yu-Cheng Hsiao (Dept. of CSIE, NCKU, Taiwan)
In a spoken dialogue system, the major aim of spoken language understanding (SLU) is to detect the dialogue acts (DAs) of a speaker’s utterance. However, error-prone speech recognition often degrades the performance of the SLU. In this work, a DA detection approach using partial sentence trees (PSTs) and a latent dialogue act matrix (LDAM) is presented for SLU. For each input utterance with speech recognition errors, several partial sentences (PSs) derived from the recognized sentence can be obtained to construct a PST. A set of sentence grammar rules (GRs) is obtained for each PS using the Stanford parser. The relationship between the GRs and the DAs is modeled by an LDAM. Finally, the DA with the highest probability estimated from the speech recognition likelihood, the LDAM and the historical information is determined as the detected DA. In evaluation, compared to the slot-based method which achieved 48.1% detection accuracy, the proposed approach can achieve 84.3% accuracy.
#12Detection of Hot Spots in Poster Conversations based on Reactive Tokens of Audience
Tatsuya Kawahara (Kyoto University)
Kouhei Sumi (Kyoto University)
Zhi-Qiang Chang (Kyoto University)
Katsuya Takanashi (Kyoto University)
We present a novel scheme for indexing ``hot spots'' in conversations, such as poster sessions, based on the reaction of the audience. Specifically, we focus on laughters and non-lexical reactive tokens, which are presumably related with funny spots and interesting spots, respectively. A robust detection method of these acoustic events is realized by combining BIC-based segmentation and GMM-based classification, with additional verifiers for reactive tokens. Subjective evaluations suggest that hot spots associated with reactive tokens are consistently useful while those with laughters are not so reliable. Furthermore, we investigate prosodic patterns of those reactive tokens which are closely related with the interest level.
#13Psychological Evaluation of A Group Communication Activativation Robot in A Party Game
Yoichi Matsuyama (Department of Computer Science and Engineering, Waseda University)
Hikaru Taniyama (Department of Computer Science and Engineering, Waseda University)
Shinya Fujie (Waseda Institute of Advanced Study, Waseda University)
Tetsunori Kobayashi (Department of Computer Science and Engineering, Waseda University)
We propose a communication activation robot and evaluate effectiveness of communication activation. As an example of application, we developed the system participating in a quiz-formed party game called NANDOKU quiz on a multi-modal conversation robot SCHEMA, and we conducted an experiment in a laboratory to evaluate its capability of activation in group communication. We evaluated interaction in NANDOKU quiz game with subjects as panelists using video analysis and SD(Semantic Differential) method with questionnaires. The result of SD method indicates that subjects feel more pleased and more noisy with participation of a robot. As the result of video analysis, the smiling duration ratio is greater with participation of a robot. These results imply evidence of robot’s communication activation function in the party game.
#14Analyzing User Utterances in Barge-in-able Spoken Dialogue System for Improving Identification Accuracy
Kyoko Matsuyama (Graduate School of Informatics, Kyoto University)
Kazunori Komatani (Graduate School of Informatics, Kyoto University)
Ryu Takeda (Graduate School of Informatics, Kyoto University)
Toru Takahashi (Graduate School of Informatics, Kyoto University)
Tetsuya Ogata (Graduate School of Informatics, Kyoto University)
Hiroshi Okuno (Graduate School of Informatics, Kyoto University)
In our barge-in-able spoken dialogue system, the user's behaviors such as barge-in timing and utterance expressions vary according to his/her characteristics and situations. The system adapts to the behaviors by modeling them. We analyzed 1584 utterances collected by our systems of quiz and news-listing tasks and showed that ratio of using referential expressions depends on individual users and average lengths of listed items. This tendency was incorporated as a prior probability into our method and improved the identification accuracy of the user's intended items.
#15Pitch similarity in the vicinity of backchannels
Mattias Heldner (KTH Speech Music and Hearing, , Stockholm)
Jens Edlund (KTH Speech Music and Hearing, Stockholm)
Julia Hirschberg (Department of Computer Science, Columbia University, New York City)
Dynamic modeling of spoken dialogue seeks to capture how interlocutors change their speech over the course of a conversation. Much work has focused on how speakers adapt or entrain to different aspects of one another’s speaking style. In this paper we focus on local aspects of this adaptation. We investigate the relationship between backchannels and the interlocutor utterances that precede them with respect to pitch. We demonstrate that the pitch of backchannels is mo