Benigno Uria, Steve Renals, and Korin Richmond. A deep neural network for acoustic-articulatory speech inversion. In Proc. NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning, Sierra Nevada, Spain, December 2011. [ bib | .pdf ]
In this work, we implement a deep belief network for the acoustic-articulatory inversion mapping problem. We find that adding up to 3 hidden-layers improves inversion accuracy. We also show that this improvement is due to the higher ex- pressive capability of a deep model and not a consequence of adding more adjustable parameters. Additionally, we show unsupervised pretraining of the sys- tem improves its performance in all cases, even for a 1 hidden-layer model. Our implementation obtained an average root mean square error of 0.95 mm on the MNGU0 test dataset, beating all previously published results.
Atef Ben Youssef. Control of talking heads by acoustic-to-articulatory inversion for language learning and rehabilitation. PhD thesis, Grenoble University, October 2011. [ bib | .pdf ]
This thesis presents a visual articulatory feedback system in which the visible and non visible articulators of a talking head are controlled by inversion from a speaker's voice. Our approach to this inversion problem is based on statistical models built on acoustic and articulatory data recorded on a French speaker by means of an electromagnetic articulograph. A first system combines acoustic speech recognition and articulatory speech synthesis techniques based on hidden Markov Models (HMMs). A second system uses Gaussian mixture models (GMMs) to estimate directly the articulatory trajectories from the speech sound. In order to generalise the single speaker system to a multi-speaker system, we have implemented a speaker adaptation method based on the maximum likelihood linear regression (MLLR) that we have assessed by means of a reference articulatory recognition system. Finally, we present a complete visual articulatory feedback demonstrator.
Keywords: visual articulatory feedback; acoustic-to-articulatory speech inversion mapping; ElectroMagnetic Articulography (EMA); hidden Markov models (HMMs), Gaussian mixture models (GMMs); speaker adaptation; face-to-tongue mapping
Oliver Watts, Junichi Yamagishi, and Simon King. Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In Proc. Interspeech, pages 2157-2160, Florence, Italy, August 2011. [ bib | .pdf ]
Part of speech (POS) tags are foremost among the features conventionally used to predict intonational phrase-breaks for text to speech (TTS) conversion. The construction of such systems therefore presupposes the availability of a POS tagger for the relevant language, or of a corpus manually tagged with POS. However, such tools and resources are not available in the majority of the world’s languages, and manually labelling text with POS tags is an expensive and time-consuming process. We therefore propose the use of continuous-valued features that summarise the distributional characteristics of word types as surrogates for POS features. Importantly, such features are obtained in an unsupervised manner from an untagged text corpus. We present results on the phrase-break prediction task, where use of the features closes the gap in performance between a baseline system (using only basic punctuation-related features) and a topline system (incorporating a state-of-the-art POS tagger).
Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Can objective measures predict the intelligibility of modified HMM-based synthetic speech in noise? In Proc. Interspeech, August 2011. [ bib | .pdf ]
Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility – and on how well objective measures predict it – when we separately modify speaking rate, fundamental frequency, line spectral pairs and spectral peaks. Shifting LSPs can increase intelligibility for human listeners; other modifications had weaker effects. Among the objective measures we evaluated, the Dau model and the Glimpse proportion were the best predictors of human performance.
Korin Richmond, Phil Hoole, and Simon King. Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Proc. Interspeech, pages 1505-1508, Florence, Italy, August 2011. [ bib | .pdf ]
This paper serves as an initial announcement of the availability of a corpus of articulatory data called mngu0. This corpus will ultimately consist of a collection of multiple sources of articulatory data acquired from a single speaker: electromagnetic articulography (EMA), audio, video, volumetric MRI scans, and 3D scans of dental impressions. This data will be provided free for research use. In this first stage of the release, we are making available one subset of EMA data, consisting of more than 1,300 phonetically diverse utterances recorded with a Carstens AG500 electromagnetic articulograph. Distribution of mngu0 will be managed by a dedicated “forum-style” web site. This paper both outlines the general goals motivating the distribution of the data and the creation of the mngu0 web forum, and also provides a description of the EMA data contained in this initial release.
Ming Lei, Junichi Yamagishi, Korin Richmond, Zhen-Hua Ling, Simon King, and Li-Rong Dai. Formant-controlled HMM-based speech synthesis. In Proc. Interspeech, pages 2777-2780, Florence, Italy, August 2011. [ bib | .pdf ]
This paper proposes a novel framework that enables us to manipulate and control formants in HMM-based speech synthesis. In this framework, the dependency between formants and spectral features is modelled by piecewise linear transforms; formant parameters are effectively mapped by these to the means of Gaussian distributions over the spectral synthesis parameters. The spectral envelope features generated under the influence of formants in this way may then be passed to high-quality vocoders to generate the speech waveform. This provides two major advantages over conventional frameworks. First, we can achieve spectral modification by changing formants only in those parts where we want control, whereas the user must specify all formants manually in conventional formant synthesisers (e.g. Klatt). Second, this can produce high-quality speech. Our results show the proposed method can control vowels in the synthesized speech by manipulating F 1 and F 2 without any degradation in synthesis quality.
Oliver Watts and Bowen Zhou. Unsupervised features from text for speech synthesis in a speech-to-speech translation system. In Proc. Interspeech, pages 2153-2156, Florence, Italy, August 2011. [ bib | .pdf ]
We explore the use of linguistic features for text to speech (TTS) conversion in the context of a speech-to-speech translation system that can be extracted from unannotated text in an unsupervised, language-independent fashion. The features are intended to act as surrogates for conventional part of speech (POS) features. Unlike POS features, the experimental features assume only the availability of tools and data that must already be in place for the construction of other components of the translation system, and can therefore be used for the TTS module without incurring additional TTS-specific costs. We here describe the use of the experimental features in a speech synthesiser, using six different configurations of the system to allow the comparison of the proposed features with conventional, knowledge-based POS features. We present results of objective and subjective evaluations of the usefulness of the new features.
Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi. Feature-space transform tying in unified acoustic-articulatory modelling of articulatory control of HMM-based speech synthesis. In Proc. Interspeech, pages 117-120, Florence, Italy, August 2011. [ bib | .pdf ]
In previous work, we have proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into hidden Markov model (HMM) based parametric speech synthesis. A unified acoustic-articulatory model was trained and a piecewise linear transform was adopted to describe the dependency between these two feature streams. The transform matrices were trained for each HMM state and were tied based on each state's context. In this paper, an improved acoustic-articulatory modelling method is proposed. A Gaussian mixture model (GMM) is introduced to model the articulatory space and the cross-stream transform matrices are trained for each Gaussian mixture instead of context-dependently. This means the dependency relationship can vary with the change of articulatory features flexibly. Our results show this method improves the effectiveness of control over vowel quality by modifing articulatory trajectories without degrading naturalness.
Atef Ben Youssef, Thomas Hueber, Pierre Badin, and Gérard Bailly. Toward a multi-speaker visual articulatory feedback system. In Proc. Interspeech, pages 589-592, Florence, Italie, August 2011. [ bib | .pdf ]
In this paper, we present recent developments on the HMMbased acoustic-to-articulatory inversion approach that we develop for a "visual articulatory feedback" system. In this approach, multi-stream phoneme HMMs are trained jointly on synchronous streams of acoustic and articulatory data, acquired by electromagnetic articulography (EMA). Acousticto- articulatory inversion is achieved in two steps. Phonetic and state decoding is first performed. Then articulatory trajectories are inferred from the decoded phone and state sequence using the maximum-likelihood parameter generation algorithm (MLPG). We introduce here a new procedure for the reestimation of the HMM parameters, based on the Minimum Generation Error criterion (MGE). We also investigate the use of model adaptation techniques based on maximum likelihood linear regression (MLLR), as a first step toward a multispeaker visual articulatory feedback system.
Fergus R. McInnes and Sharon J. Goldwater. Unsupervised extraction of recurring words from infant-directed speech. In Proceedings of CogSci 2011, Boston, Massachusetts, July 2011. [ bib | .pdf ]
To date, most computational models of infant word segmentation have worked from phonemic or phonetic input, or have used toy datasets. In this paper, we present an algorithm for word extraction that works directly from naturalistic acoustic input: infant-directed speech from the CHILDES corpus. The algorithm identifies recurring acoustic patterns that are candidates for identification as words or phrases, and then clusters together the most similar patterns. The recurring patterns are found in a single pass through the corpus using an incremental method, where only a small number of utterances are considered at once. Despite this limitation, we show that the algorithm is able to extract a number of recurring words, including some that infants learn earliest, such as "Mommy" and the child’s name. We also introduce a novel information-theoretic evaluation measure.
Myroslava Dzikovska, Amy Isard, Peter Bell, Johanna Moore, Natalie Steinhauser, and Gwendolyn Campbell. Beetle II: an adaptable tutorial dialogue system. In Proceedings of the SIGDIAL 2011 Conference, demo session, pages 338-340, Portland, Oregon, June 2011. Association for Computational Linguistics. [ bib | http ]
We present Beetle II, a tutorial dialogue system which accepts unrestricted language input and supports experimentation with different tutorial planning and dialogue strategies. Our first system evaluation compared two tutorial policies and demonstrated that the system can be used to study the impact of different approaches to tutoring. The system is also designed to allow experimentation with a variety of natural language techniques, and discourse and dialogue strategies.
S. Andraszewicz, J. Yamagishi, and S. King. Vocal attractiveness of statistical speech synthesisers. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5368-5371, May 2011. [ bib | DOI ]
Our previous analysis of speaker-adaptive HMM-based speech synthesis methods suggested that there are two possible reasons why average voices can obtain higher subjective scores than any individual adapted voice: 1) model adaptation degrades speech quality proportionally to the distance 'moved' by the transforms, and 2) psychoacoustic effects relating to the attractiveness of the voice. This paper is a follow-on from that analysis and aims to separate these effects out. Our latest perceptual experiments focus on attractiveness, using average voices and speaker-dependent voices without model trans formation, and show that using several speakers to create a voice improves smoothness (measured by Harmonics-to-Noise Ratio), reduces distance from the the average voice in the log F0-F1 space of the final voice and hence makes it more attractive at the segmental level. However, this is weakened or overridden at supra-segmental or sentence levels.
Keywords: speaker-adaptive HMM-based speech synthesis methods;speaker-dependent voices;statistical speech synthesisers;vocal attractiveness;hidden Markov models;speaker recognition;speech synthesis;
P.L. De Leon, I. Hernaez, I. Saratxaga, M. Pucher, and J. Yamagishi. Detection of synthetic speech for the problem of imposture. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 4844-4847, May 2011. [ bib | DOI ]
In this paper, we present new results from our research into the vulnerability of a speaker verification (SV) system to synthetic speech. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and both GMM-UBM and support vector machine (SVM) SV systems. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV systems have a 0.35% EER. When the systems are tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, over 91% of the matched claims are accepted. We propose the use of relative phase shift (RPS) in order to detect synthetic speech and develop a GMM-based synthetic speech classifier (SSC). Using the SSC, we are able to correctly classify human speech in 95% of tests and synthetic speech in 88% of tests thus significantly reducing the vulnerability.
Keywords: EER;GMM-UBM;GMM-based synthetic speech classifier;HMM-based speech synthesizer;RPS;SSC;SV system;WSJ corpus;Wall-Street Journal corpus;relative phase shift;speaker verification system;support vector machine;hidden Markov models;speaker recognition;speech synthesis;support vector machines;
Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Evaluation of objective measures for intelligibility prediction of HMM-based synthetic speech in noise. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5112-5115, May 2011. [ bib | DOI | .pdf ]
In this paper we evaluate four objective measures of speech with regards to intelligibility prediction of synthesized speech in diverse noisy situations. We evaluated three intelligibility measures, the Dau measure, the glimpse proportion and the Speech Intelligibility Index (SII) and a quality measure, the Perceptual Evaluation of Speech Quality (PESQ). For the generation of synthesized speech we used a state of the art HMM-based speech synthesis system. The noisy conditions comprised four additive noises. The measures were compared with subjective intelligibility scores obtained in listening tests. The results show the Dau and the glimpse measures to be the best predictors of intelligibility, with correlations of around 0.83 to subjective scores. All measures gave less accurate predictions of intelligibility for synthetic speech than have previously been found for natural speech; in particular the SII measure. In additional experiments, we processed the synthesized speech by an ideal binary mask before adding noise. The Glimpse measure gave the most accurate intelligibility predictions in this situation.
J.P. Cabral, S. Renals, J. Yamagishi, and K. Richmond. HMM-based speech synthesiser using the LF-model of the glottal source. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 4704-4707, May 2011. [ bib | DOI | .pdf ]
A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper sets out a new approach to using an acoustic glottal source model in HMM-based synthesisers instead of the traditional pulse signal. The goal is to improve speech quality and to better model and transform voice characteristics. We have found the new method decreases buzziness and also improves prosodic modelling. A perceptual evaluation has supported this finding by showing a 55.6% preference for the new system, as against the baseline. This improvement, while not being as significant as we had initially expected, does encourage us to work on developing the proposed speech synthesiser further.
K. Hashimoto, J. Yamagishi, W. Byrne, S. King, and K. Tokuda. An analysis of machine translation and speech synthesis in speech-to-speech translation system. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5108-5111, May 2011. [ bib | DOI ]
This paper provides an analysis of the impacts of machine translation and speech synthesis on speech-to-speech translation systems. The speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques for integration of speech recognition and machine translation have been proposed. However, speech synthesis has not yet been considered. Therefore, in this paper, we focus on machine translation and speech synthesis, and report a subjective evaluation to analyze the impact of each component. The results of these analyses show that the naturalness and intelligibility of synthesized speech are strongly affected by the fluency of the translated sentences.
Keywords: machine translation;speech recognition;speech synthesis;speech-to-speech translation system;speech recognition;speech synthesis;
Dong Wang, Nicholas Evans, Raphael Troncy, and Simon King. Handling overlaps in spoken term detection. In Proc. International Conference on Acoustics, Speech and Signal Processing, pages 5656-5659, May 2011. [ bib | DOI | .pdf ]
Spoken term detection (STD) systems usually arrive at many overlapping detections which are often addressed with some pragmatic approaches, e.g. choosing the best detection to represent all the overlaps. In this paper we present a theoretical study based on a concept of acceptance space. In particular, we present two confidence estimation approaches based on Bayesian and evidence perspectives respectively. Analysis shows that both approaches possess respective ad vantages and shortcomings, and that their combination has the potential to provide an improved confidence estimation. Experiments conducted on meeting data confirm our analysis and show considerable performance improvement with the combined approach, in particular for out-of-vocabulary spoken term detection with stochastic pronunciation modeling.
Dong Wang and Simon King. Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Processing Letters, 18(2):122-125, February 2011. [ bib | DOI | .pdf ]
Pronunciation prediction, or letter-to-sound (LTS) conversion, is an essential task for speech synthesis, open vo- cabulary spoken term detection and other applications dealing with novel words. Most current approaches (at least for English) employ data-driven methods to learn and represent pronunciation “rules” using statistical models such as decision trees, hidden Markov models (HMMs) or joint-multigram models (JMMs). The LTS task remains challenging, particularly for languages with a complex relationship between spelling and pronunciation such as English. In this paper, we propose to use a conditional random field (CRF) to perform LTS because it avoids having to model a distribution over observations and can perform global inference, suggesting that it may be more suitable for LTS than decision trees, HMMs or JMMs. One challenge in applying CRFs to LTS is that the phoneme and grapheme sequences of a word are generally of different lengths, which makes CRF training difficult. To solve this problem, we employed a joint-multigram model to generate aligned training exemplars. Experiments conducted with the AMI05 dictionary demonstrate that a CRF significantly outperforms other models, especially if n-best lists of predictions are generated.
Reima Karhila and Mirjam Wester. Rapid adaptation of foreign-accented HMM-based speech synthesis. In Proc. Interspeech, Florence, Italy, 2011. [ bib | .pdf ]
This paper presents findings of listeners’ perception of speaker identity in synthetic speech. Specifically, we investigated what the effect is on the perceived identity of a speaker when using differently accented average voice models and limited amounts (five and fifteen sentences) of a speaker’s data to create the synthetic stimuli. A speaker discrimination task was used to measure speaker identity. Native English listeners were presented with natural and synthetic speech stimuli in English and were asked to decide whether they thought the sentences were spoken by the same person or not. An accent rating task was also carried out to measure the perceived accents of the synthetic speech stimuli. The results show that listeners, for the most part, perform as well at speaker discrimination when the stimuli have been created using five or fifteen adaptation sentences as when using 105 sentences. Furthermore, the accent of the average voice model does not affect listeners’ speaker discrimination performance even though the accent rating task shows listeners are perceiving different accents in the synthetic stimuli. Listeners do not base their speaker similarity decisions on perceived accent.
Myroslava Dzikovska, Amy Isard, Peter Bell, Johanna D. Moore, Natalie B. Steinhauser, Gwendolyn E. Campbell, Leanne S. Taylor, Simon Caine, and Charlie Scott. Adaptive intelligent tutorial dialogue in the Beetle II system. In Artificial Intelligence in Education - 15th International Conference (AIED 2011), interactive event, volume 6738 of Lecture Notes in Computer Science, page 621, Auckland, New Zealand, 2011. Springer. [ bib | DOI ]
Mirjam Wester and Hui Liang. Cross-lingual speaker discrimination using natural and synthetic speech. In Proc. Interspeech, Florence, Italy, 2011. [ bib | .pdf ]
This paper describes speaker discrimination experiments in which native English listeners were presented with either natural speech stimuli in English and Mandarin, synthetic speech stimuli in English and Mandarin, or natural Mandarin speech and synthetic English speech stimuli. In each experiment, listeners were asked to decide whether they thought the sentences were spoken by the same person or not. We found that the results for Mandarin/English speaker discrimination are very similar to results found in previous work on German/English and Finnish/English speaker discrimination. We conclude from this and previous work that listeners are able to identify speakers across languages and they are able to identify speakers across speech types, but the combination of these two factors leads to a speaker discrimination task which is too difficult for listeners to perform successfully, given the quality of across-language speaker adapted speech synthesis at present.
T. Raitio, A. Suni, J. Yamagishi, H. Pulakka, J. Nurminen, M. Vainio, and P. Alku. HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech and Language Processing, 19(1):153-165, January 2011. [ bib | DOI ]
This paper describes an hidden Markov model (HMM)-based speech synthesizer that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In the proposed method, speech is first decomposed into the glottal source signal and the model of the vocal tract filter through glottal inverse filtering, and thus parametrized into excitation and spectral features. The source and filter features are modeled individually in the framework of HMM and generated in the synthesis stage according to the text input. The glottal excitation is synthesized through interpolating and concatenating natural glottal flow pulses, and the excitation signal is further modified according to the spectrum of the desired voice source characteristics. Speech is synthesized by filtering the reconstructed source signal with the vocal tract filter. Experiments show that the proposed system is capable of generating natural sounding speech, and the quality is clearly better compared to two HMM-based speech synthesis systems based on widely used vocoder techniques.
Keywords: Glottal inverse filtering , hidden Markov model (HMM) , speech synthesis
Theresa Wilson and Gregor Hofer. Using linguistic and vocal expressiveness in social role recognition. In Proc Int. Conf. on Intelligent User Interfaces, IUI2011, Palo Alto, USA, 2011. ACM. [ bib | .pdf ]
In this paper, we investigate two types of expressiveness, linguistic and vocal, and whether they are useful for recog- nising the social roles of participants in meetings. Our ex- periments show that combining expressiveness features with speech activity does improve social role recognition over speech activity features alone.
J. Dines, J. Yamagishi, and S. King. Measuring the gap between HMM-based ASR and TTS. IEEE Selected Topics in Signal Processing, 2011. (in press). [ bib | DOI ]
The EMIME European project is conducting research in the development of technologies for mobile, personalised speech-to-speech translation systems. The hidden Markov model (HMM) is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components, thus, the investigation of unified statistical modelling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems measuring their performance with respect to phone set and lexicon; acoustic feature type and dimensionality; HMM topology; and speaker adaptation. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modelling approaches.
Keywords: Acoustics, Adaptation model, Context modeling, Hidden Markov models, Speech, Speech recognition, Training, speech recognition, speech synthesis, unified models
Mirjam Wester and Reima Karhila. Speaker similarity evaluation of foreign-accented speech synthesis using HMM-based speaker adaptation. In Proc. ICASSP, pages 5372-5375, Prague, Czech Republic, 2011. [ bib | .pdf ]
This paper describes a speaker discrimination experiment in which native English listeners were presented with natural and synthetic speech stimuli in English and were asked to judge whether they thought the sentences were spoken by the same person or not. The natural speech consisted of recordings of Finnish speakers speaking English. The synthetic stimuli were created using adaptation data from the same Finnish speakers. Two average voice models were compared: one trained on Finnish-accented English and the other on American-accented English. The experiments illustrate that listeners perform well at speaker discrimination when the stimuli are both natural or both synthetic, but when the speech types are crossed performance drops significantly. We also found that the type of accent in the average voice model had no effect on the listeners’ speaker discrimination performance.
Maria Klara Wolters, Christine Johnson, and Karl B Isaac. Can the hearing handicap inventory for adults be used as a screen for perception experiments? In Proc. ICPhS XVII, Hong Kong, 2011. [ bib | .pdf ]
When screening participants for speech perception experiments, formal audiometric screens are often not an option, especially when studies are conducted over the Internet. We investigated whether a brief standardized self-report questionnaire, the screening version of the Hearing Handicap Inventory for Adults (HHIA-S), could be used to approximate the results of audiometric screening. Our results suggest that while the HHIA-S is useful, it needs to be used with extremely strict cut-off values that could exclude around 25% of people with no hearing impairment who are interested in participating. Well constructed, standardized single questions might be a more feasible alternative, in particular for web experiments.
Adriana Stan, Junichi Yamagishi, Simon King, and Matthew Aylett. The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication, 53(3):442-450, 2011. [ bib | DOI | http ]
This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called “RSS”, along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given. Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis.
Keywords: Speech synthesis, HTS, Romanian, HMMs, Sampling frequency, Auditory scale
L. Lu, A. Ghoshal, and S. Renals. Regularized subspace gausian mixture models for speech recognition. IEEE Signal Processing Letters, 18(7):419-422, 2011. [ bib | .pdf ]
Subspace Gaussian mixture models (SGMMs) provide a compact representation of the Gaussian parameters in an acoustic model, but may still suffer from over-fitting with insufficient training data. In this letter, the SGMM state parameters are estimated using a penalized maximum-likelihood objective, based on 1 and 2 regularization, as well as their combination, referred to as the elastic net, for robust model estimation. Experiments on the 5000-word Wall Street Journal transcription task show word error rate reduction and improved model robustness with regularization.
A. G. Pipe, R. Vaidyanathan, C. Melhuish, P. Bremner, P. Robinson, R. A. J. Clark, A. Lenz, K. Eder, N. Hawes, Z. Ghahramani, M. Fraser, M. Mermehdi, P. Healey, and S. Skachek. Affective robotics: Human motion and behavioural inspiration for cooperation between humans and assistive robots. In Yoseph Bar-Cohen, editor, Biomimetics: Nature-Based Innovation, chapter 15. Taylor and Francis, 2011. [ bib ]
Michael A. Berger, Gregor Hofer, and Hiroshi Shimodaira. Carnival - combining speech technology and computer animation. IEEE Computer Graphics and Applications, 31:80-89, 2011. [ bib | DOI ]
Jonathan Kilgour, Jean Carletta, and Steve Renals. The Ambient Spotlight: Personal meeting capture with a microphone array. In Proc. HSCMA, 2011. [ bib | DOI | .pdf ]
We present the Ambient Spotlight system for personal meeting capture based on a portable USB microphone array and a laptop. The system combined distant speech recognition and content linking with personal productivity tools, and enables recognised meeting recordings to be integrated with desktop search, calender, and email.
S Renals. Automatic analysis of multiparty meetings. SADHANA - Academy Proceedings in Engineering Sciences, 36(5):917-932, 2011. [ bib | DOI | .pdf ]
This paper is about the recognition and interpretation of multiparty meetings captured as audio, video and other signals. This is a challenging task since the meetings consist of spontaneous and conversational interactions between a number of participants: it is a multimodal, multiparty, multistream problem. We discuss the capture and annotation of the AMI meeting corpus, the development of a meeting speech recognition system, and systems for the automatic segmentation, summarisation and social processing of meetings, together with some example applications based on these systems.
Mirjam Wester and Hui Liang. The EMIME Mandarin Bilingual Database. Technical Report EDI-INF-RR-1396, The University of Edinburgh, 2011. [ bib | .pdf ]
This paper describes the collection of a bilingual database of Mandarin/English data. In addition, the accents of the talkers in the database have been rated. English and Mandarin listeners assessed the English and Mandarin talkers' degree of foreign accent in English.
Andi K. Winterboer, Martin I. Tietze, Maria K. Wolters, and Johanna D. Moore. The user-model based summarize and refine approach improves information presentation in spoken dialog systems. Computer Speech and Language, 25(2):175-191, 2011. [ bib | .pdf ]
A common task for spoken dialog systems (SDS) is to help users select a suitable option (e.g., flight, hotel, and restaurant) from the set of options available. As the number of options increases, the system must have strategies for generating summaries that enable the user to browse the option space efficiently and successfully. In the user-model based summarize and refine approach (UMSR, Demberg and Moore, 2006), options are clustered to maximize utility with respect to a user model, and linguistic devices such as discourse cues and adverbials are used to highlight the trade-offs among the presented items. In a Wizard-of-Oz experiment, we show that the UMSR approach leads to improvements in task success, efficiency, and user satisfaction compared to an approach that clusters the available options to maximize coverage of the domain (Polifroni et al., 2003). In both a laboratory experiment and a web-based experimental paradigm employing the Amazon Mechanical Turk platform, we show that the discourse cues in UMSR summaries help users compare different options and choose between options, even though they do not improve verbatim recall. This effect was observed for both written and spoken stimuli.
C. Mayo, R. A. J. Clark, and S. King. Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis. Speech Communication, 53(3):311-326, 2011. [ bib | DOI ]
The quality of current commercial speech synthesis systems is now so high that system improvements are being made at subtle sub- and supra-segmental levels. Human perceptual evaluation of such subtle improvements requires a highly sophisticated level of perceptual attention to specific acoustic characteristics or cues. However, it is not well understood what acoustic cues listeners attend to by default when asked to evaluate synthetic speech. It may, therefore, be potentially quite difficult to design an evaluation method that allows listeners to concentrate on only one dimension of the signal, while ignoring others that are perceptually more important to them. The aim of the current study was to determine which acoustic characteristics of unit-selection synthetic speech are most salient to listeners when evaluating the naturalness of such speech. This study made use of multidimensional scaling techniques to analyse listeners' pairwise comparisons of synthetic speech sentences. Results indicate that listeners place a great deal of perceptual importance on the presence of artifacts and discontinuities in the speech, somewhat less importance on aspects of segmental quality, and very little importance on stress/intonation appropriateness. These relative differences in importance will impact on listeners' ability to attend to these different acoustic characteristics of synthetic speech, and should therefore be taken into account when designing appropriate methods of synthetic speech evaluation.
Keywords: Speech synthesis; Evaluation; Speech perception; Acoustic cue weighting; Multidimensional scaling
L. Lu, A. Ghoshal, and S. Renals. Regularized subspace Gausian mixture models for cross-lingual speech recognition. In Proc. ASRU, 2011. [ bib | .pdf ]
We investigate cross-lingual acoustic modelling for low resource languages using the subspace Gaussian mixture model (SGMM). We assume the presence of acoustic models trained on multiple source languages, and use the global subspace parameters from those models for improved modelling in a target language with limited amounts of transcribed speech. Experiments on the GlobalPhone corpus using Spanish, Portuguese, and Swedish as source languages and German as target language (with 1 hour and 5 hours of transcribed audio) show that multilingually trained SGMM shared parameters result in lower word error rates (WERs) than using those from a single source language. We also show that regularizing the estimation of the SGMM state vectors by penalizing their 1-norm help to overcome numerical instabilities and lead to lower WER.
Atef Ben Youssef, Thomas Hueber, Pierre Badin, Gérard Bailly, and Frédéric Elisei. Toward a speaker-independent visual articulatory feedback system. In 9th International Seminar on Speech Production, ISSP9, Montreal, Canada, 2011. [ bib | .pdf ]
Thomas Hueber, Pierre Badin, Gérard Bailly, Atef Ben Youssef, Frédéric Elisei, Bruce Denby, and Gérard Chollet. Statistical mapping between articulatory and acoustic data. application to silent speech interface and visual articulatory feedback. In Proceedings of the 1st International Workshop on Performative Speech and Singing Synthesis (p3s), Vancouver, Canada, 2011. [ bib | .pdf ]
This paper reviews some theoretical and practical aspects of different statistical mapping techniques used to model the relationships between the articulatory gestures and the resulting speech sound. These techniques are based on the joint modeling of articulatory and acoustic data using Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM). These methods are implemented in two systems: (1) the silent speech interface developed at SIGMA and LTCI laboratories which converts tongue and lip motions, captured during silent articulation by ultrasound and video imaging, into audible speech, and (2) the visual articulatory feedback system, developed at GIPSA-lab, which automatically animates, from the speech sound, a 3D orofacial clone displaying all articulators (including the tongue). These mapping techniques are also discussed in terms of real-time implementation.
Keywords: statistical mapping silent speech ultrasound visual articulatory feedback talking head HMM GMM