P. Swietojanski, A. Ghoshal, and S. Renals. Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR. In Proc. IEEE Workshop on Spoken Language Technology, pages 246-251, Miami, Florida, USA, December 2012. [ bib | DOI | .pdf ]

We investigate the use of cross-lingual acoustic data to initialise deep neural network (DNN) acoustic models by means of unsupervised restricted Boltzmann machine (RBM) pretraining. DNNs for German are pretrained using one or all of German, Portuguese, Spanish and Swedish. The DNNs are used in a tandem configuration, where the network outputs are used as features for a hidden Markov model (HMM) whose emission densities are modeled by Gaussian mixture models (GMMs), as well as in a hybrid configuration, where the network outputs are used as the HMM state likelihoods. The experiments show that unsupervised pretraining is more crucial for the hybrid setups, particularly with limited amounts of transcribed training data. More importantly, unsupervised pretraining is shown to be language-independent.

P. Bell, M. Gales, P. Lanchantin, X. Liu, Y. Long, S. Renals, P. Swietojanski, and P. Woodland. Transcription of multi-genre media archives using out-of-domain data. In Proc. IEEE Workshop on Spoken Language Technology, pages 324-329, Miami, Florida, USA, December 2012. [ bib | DOI | .pdf ]

We describe our work on developing a speech recognition system for multi-genre media archives. The high diversity of the data makes this a challenging recognition task, which may benefit from systems trained on a combination of in-domain and out-of-domain data. Working with tandem HMMs, we present Multi-level Adaptive Networks (MLAN), a novel technique for incorporating information from out-of-domain posterior features using deep neural networks. We show that it provides a substantial reduction in WER over other systems, with relative WER reductions of 15% over a PLP baseline, 9% over in-domain tandem features and 8% over the best out-of-domain tandem features.

Adriana Stan, Peter Bell, and Simon King. A grapheme-based method for automatic alignment of speech and text data. In Proc. IEEE Workshop on Spoken Language Technology, Miami, Florida, USA, December 2012. [ bib | .pdf ]

This paper introduces a method for automatic alignment of speech data with unsynchronised, imperfect transcripts, for a domain where no initial acoustic models are available. Using grapheme-based acoustic models, word skip networks and orthographic speech transcripts, we are able to harvest 55% of the speech with a 93% utterance-level accuracy and 99% word accuracy for the produced transcriptions. The work is based on the assumption that there is a high degree of correspondence between the speech and text, and that a full transcription of all of the speech is not required. The method is language independent and the only prior knowledge and resources required are the speech and text transcripts, and a few minor user interventions.

P. L. De Leon, M. Pucher, J. Yamagishi, I. Hernaez, and I. Saratxaga. Evaluation of speaker verification security and detection of HMM-based synthetic speech. Audio, Speech, and Language Processing, IEEE Transactions on, 20(8):2280-2290, October 2012. [ bib | DOI ]

In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model #x2013;universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.

Korin Richmond and Steve Renals. Ultrax: An animated midsagittal vocal tract display for speech therapy. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]

Speech sound disorders (SSD) are the most common communication impairment in childhood, and can hamper social development and learning. Current speech therapy interventions rely predominantly on the auditory skills of the child, as little technology is available to assist in diagnosis and therapy of SSDs. Realtime visualisation of tongue movements has the potential to bring enormous benefit to speech therapy. Ultrasound scanning offers this possibility, although its display may be hard to interpret. Our ultimate goal is to exploit ultrasound to track tongue movement, while displaying a simplified, diagrammatic vocal tract that is easier for the user to interpret. In this paper, we outline a general approach to this problem, combining a latent space model with a dimensionality reducing model of vocal tract shapes. We assess the feasibility of this approach using magnetic resonance imaging (MRI) scans to train a model of vocal tract shapes, which is animated using electromagnetic articulography (EMA) data from the same speaker.

Keywords: Ultrasound, speech therapy, vocal tract visualisation

Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi. Vowel creation by articulatory control in HMM-based parametric speech synthesis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]

This paper presents a method to produce a new vowel by articulatory control in hidden Markov model (HMM) based parametric speech synthesis. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as external auxiliary variables. The dependency between acoustic and articulatory features is modelled by a group of linear transforms that are either estimated context-dependently or determined by the distribution of articulatory features. Vowel identity is removed from the set of context features used to ensure compatibility between the context-dependent model parameters and the articulatory features of a new vowel. At synthesis time, acoustic features are predicted according to the input articulatory features as well as context information. With an appropriate articulatory feature sequence, a new vowel can be generated even when it does not exist in the training set. Experimental results show this method is effective in creating the English vowel /2/ by articulatory control without using any acoustic samples of this vowel.

Keywords: Speech synthesis, articulatory features, multiple-regression hidden Markov model

Heng Lu and Simon King. Using Bayesian networks to find relevant context features for HMM-based speech synthesis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]

Speech units are highly context-dependent, so taking contextual features into account is essential for speech modelling. Context is employed in HMM-based Text-to-Speech speech synthesis systems via context-dependent phone models. A very wide context is taken into account, represented by a large set of contextual factors. However, most of these factors probably have no significant influence on the speech, most of the time. To discover which combinations of features should be taken into account, decision tree-based context clustering is used. But the space of context-dependent models is vast, and the number of contexts seen in the training data is only a tiny fraction of this space, so the task of the decision tree is very hard: to generalise from observations of a tiny fraction of the space to the rest of the space, whilst ignoring uninformative or redundant context features. The structure of the context feature space has not been systematically studied for speech synthesis. In this paper we discover a dependency structure by learning a Bayesian Network over the joint distribution of the features and the speech. We demonstrate that it is possible to discard the majority of context features with minimal impact on quality, measured by a perceptual test.

Keywords: HMM-based speech synthesis, Bayesian Networks, context information

Phillip L. De Leon, Bryan Stewart, and Junichi Yamagishi. Synthetic speech discrimination using pitch pattern statistics derived from image analysis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib ]

In this paper, we extend the work by Ogihara, et al. to discriminate between human and synthetic speech using features based on pitch patterns. As previously demonstrated, significant differences in pitch patterns between human and synthetic speech can be leveraged to classify speech as being human or synthetic in origin. We propose using mean pitch stability, mean pitch stability range, and jitter as features extracted after image analysis of pitch patterns. We have observed that for synthetic speech, these features lie in a small and distinct space as compared to human speech and have modeled them with a multivariate Gaussian distribution. Our classifier is trained using synthetic speech collected from the 2008 and 2011 Blizzard Challenge along with Festival pre-built voices and human speech from the NIST2002 corpus. We evaluate the classifier on a much larger corpus than previously studied using human speech from the Switchboard corpus, synthetic speech from the Resource Management corpus, and synthetic speech generated from Festival trained on the Wall Street Journal corpus. Results show 98% accuracy in correctly classifying human speech and 96% accuracy in correctly classifying synthetic speech.

J. Lorenzo, B. Martinez, R. Barra-Chicote, V. Lopez–Ludena, J. Ferreiros, J. Yamagishi, and J.M. Montero. Towards an unsupervised speaking style voice building framework: Multi–style speaker diarization. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib ]

Current text–to–speech systems are developed using studio-recorded speech in a neutral style or based on acted emotions. However, the proliferation of media sharing sites would allow developing a new generation of speech–based systems which could cope with sponta- neous and styled speech. This paper proposes an architecture to deal with realistic recordings and carries out some experiments on unsupervised speaker diarization. In order to maximize the speaker purity of the clusters while keeping a high speaker coverage, the paper evaluates the F–measure of a diarization module, achieving high scores (>85%) especially when the clusters are longer than 30 seconds, even for the more spontaneous and expressive styles (such as talk shows or sports).

Rasmus Dall, Christophe Veaux, Junichi Yamagishi, and Simon King. Analysis of speaker clustering techniques for HMM-based speech synthesis. In Proc. Interspeech, September 2012. [ bib | .pdf ]

This paper describes a method for speaker clustering, with the application of building average voice models for speaker-adaptive HMM-based speech synthesis that are a good basis for adapting to specific target speakers. Our main hypothesis is that using perceptually similar speakers to build the average voice model will be better than use unselected speakers, even if the amount of data available from perceptually similar speakers is smaller. We measure the perceived similarities among a group of 30 female speakers in a listening test and then apply multiple linear regression to automatically predict these listener judgements of speaker similarity and thus to identify similar speakers automatically. We then compare a variety of average voice models trained on either speakers who were perceptually judged to be similar to the target speaker, or speakers selected by the multiple linear regression, or a large global set of unselected speakers. We find that the average voice model trained on perceptually similar speakers provides better performance than the global model, even though the latter is trained on more data, confirming our main hypothesis. However, the average voice model using speakers selected automatically by the multiple linear regression does not reach the same level of performance.

Jaime Lorenzo-Trueba, Roberto Barra-Chicote, Tuomo Raitio, Nicolas Obin, Paavo Alku, Junichi Yamagishi, and Juan M Montero. Towards glottal source controllability in expressive speech synthesis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib ]

In order to obtain more human like sounding human- machine interfaces we must first be able to give them expressive capabilities in the way of emotional and stylistic features so as to closely adequate them to the intended task. If we want to replicate those features it is not enough to merely replicate the prosodic information of fundamental frequency and speaking rhythm. The proposed additional layer is the modification of the glottal model, for which we make use of the GlottHMM parameters. This paper analyzes the viability of such an approach by verifying that the expressive nuances are captured by the aforementioned features, obtaining 95% recognition rates on styled speaking and 82% on emotional speech. Then we evaluate the effect of speaker bias and recording environment on the source modeling in order to quantify possible problems when analyzing multi-speaker databases. Finally we propose a speaking styles separation for Spanish based on prosodic features and check its perceptual significance.

Peter Bell, Myroslava Dzikovska, and Amy Isard. Designing a spoken language interface for a tutorial dialogue system. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]

We describe our work in building a spoken language interface for a tutorial dialogue system. Our goal is to allow natural, unrestricted student interaction with the computer tutor, which has been shown to improve the student's learning gain, but presents challenges for speech recognition and spoken language understanding. We discuss the choice of system components and present the results of development experiments in both acoustic and language modelling for speech recognition in this domain.

C. Valentini-Botinhao, J. Yamagishi, and S. King. Evaluating speech intelligibility enhancement for HMM-based synthetic speech in noise. In Proc. Sapa Workshop, Portland, USA, September 2012. [ bib | .pdf ]

It is possible to increase the intelligibility of speech in noise by enhancing the clean speech signal. In this paper we demonstrate the effects of modifying the spectral envelope of synthetic speech according to the environmental noise. To achieve this, we modify Mel cepstral coefficients according to an intelligibility measure that accounts for glimpses of speech in noise: the Glimpse Proportion measure. We evaluate this method against a baseline synthetic voice trained only with normal speech and a topline voice trained with Lombard speech, as well as natural speech. The intelligibility of these voices was measured when mixed with speech-shaped noise and with a competing speaker at three different levels. The Lombard voices, both natural and synthetic, were more intelligible than the normal voices in all conditions. For speech-shaped noise, the proposed modified voice was as intelligible as the Lombard synthetic voice without requiring any recordings of Lombard speech, which are hard to obtain. However, in the case of competing talker noise, the Lombard synthetic voice was more intelligible than the proposed modified voice.

C. Valentini-Botinhao, S. Degenkolb-Weyers, A. Maier, E. Noeth, U. Eysholdt, and T. Bocklet. Automatic detection of sigmatism in children. In Proc. WOCCI, Portland, USA, September 2012. [ bib | .pdf ]

We propose in this paper an automatic system to detect sigmatism from the speech signal. Sigmatism occurs when the tongue is positioned incorrectly during articulation of sibilant phones like /s/ and /z/. For our task we extracted various sets of features from speech: Mel frequency cepstral coefficients, energies in specific bandwidths of the spectral envelope, and the so-called supervectors, which are the parameters of an adapted speaker model. We then trained several classifiers on a speech database of German adults simulating three different types of sigmatism. Recognition results were calculated at a phone, word and speaker level for both the simulated database and for a database of pathological speakers. For the simulated database, we achieved recognition rates of up to 86%, 87% and 94% at a phone, word and speaker level. The best classifier was then integrated as part of a Java applet that allows patients to record their own speech, either by pronouncing isolated phones, a specific word or a list of words, and provides them with a feedback whether the sibilant phones are being correctly pronounced.

Ruben San-Segundo, Juan M. Montero, Veronica Lopez-Luden, and Simon King. Detecting acronyms from capital letter sequences in spanish. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]

This paper presents an automatic strategy to decide how to pronounce a Capital Letter Sequence (CLS) in a Text to Speech system (TTS). If CLS is well known by the TTS, it can be expanded in several words. But when the CLS is unknown, the system has two alternatives: spelling it (abbreviation) or pronouncing it as a new word (acronym). In Spanish, there is a high relationship between letters and phonemes. Because of this, when a CLS is similar to other words in Spanish, there is a high tendency to pronounce it as a standard word. This paper proposes an automatic method for detecting acronyms. Additionally, this paper analyses the discrimination capability of some features, and several strategies for combining them in order to obtain the best classifier. For the best classifier, the classification error is 8.45%. About the feature analysis, the best features have been the Letter Sequence Perplexity and the Average N-gram order.

C. Valentini-Botinhao, J. Yamagishi, and S. King. Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. In Proc. Interspeech, Portland, USA, September 2012. [ bib ]

We propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation we previously proposed for the Glimpse Proportion measure. Here we show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. To evaluate the method we built eight different voices from normal read-text speech data from a male speaker. Some voices were also built from Lombard speech data produced by the same speaker. Listening experiments with speech-shaped noise and with a single competing talker indicate that our method significantly improves intelligibility when compared to unmodified synthetic speech. The voices built from Lombard speech outperformed the proposed method particularly for the competing talker case. However, compared to a voice using only the spectral parameters from Lombard speech, the proposed method obtains similar or higher performance.

Benigno Uria, Iain Murray, Steve Renals, and Korin Richmond. Deep architectures for articulatory inversion. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]

We implement two deep architectures for the acoustic-articulatory inversion mapping problem: a deep neural network and a deep trajectory mixture density network. We find that in both cases, deep architectures produce more accurate predictions than shallow architectures and that this is due to the higher expressive capability of a deep model and not a consequence of adding more adjustable parameters. We also find that a deep trajectory mixture density network is able to obtain better inversion accuracies than smoothing the results of a deep neural network. Our best model obtained an average root mean square error of 0.885 mm on the MNGU0 test dataset.

Keywords: Articulatory inversion, deep neural network, deep belief network, deep regression network, pretraining

Zhenhua Ling, Korin Richmond, and Junichi Yamagishi. Vowel creation by articulatory control in HMM-based parametric speech synthesis. In Proc. The Listening Talker Workshop, page 72, Edinburgh, UK, May 2012. [ bib | .pdf ]

C. Valentini-Botinhao, J. Yamagishi, and S. King. Using an intelligibility measure to create noise robust cepstral coefficients for HMM-based speech synthesis. In Proc. LISTA Workshop, Edinburgh, UK, May 2012. [ bib | .pdf ]

Myroslava O. Dzikovska, Peter Bell, Amy Isard, and Johanna D. Moore. Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 471-481, Avignon, France, April 2012. Association for Computational Linguistics. [ bib | http ]

C. Valentini-Botinhao, R. Maia, J. Yamagishi, S. King, and H. Zen. Cepstral analysis based on the Glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise. In Proc. ICASSP, pages 3997-4000, Kyoto, Japan, March 2012. [ bib | DOI | .pdf ]

In this paper we introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. This new method aims to increase the intelligibility of speech in noise by modifying the clean speech, and has applications in scenarios such as public announcement and car navigation systems. We first explain how the Glimpse Proportion measure operates and further show how we approximated it to integrate it into an existing spectral envelope parameter extraction method commonly used in the HMM-based speech synthesis framework. We then demonstrate how this new method changes the modelled spectrum according to the characteristics of the noise and show results for a listening test with vocoded and HMM-based synthetic speech. The test indicates that the proposed method can significantly improve intelligibility of synthetic speech in speech shaped noise.

L. Saheer, J. Yamagishi, P.N. Garner, and J. Dines. Combining vocal tract length normalization with hierarchial linear transformations. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 4493 -4496, March 2012. [ bib | DOI ]

Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLR-based adaptation techniques, being much closer in quality to that generated by the original average voice model. However with only a single parameter, VTLN captures very few speaker specific characteristics when compared to linear transform based adaptation techniques. This paper proposes that the merits of VTLN can be combined with those of linear transform based adaptation in a hierarchial Bayesian framework, where VTLN is used as the prior information. A novel technique for propagating the gender information from the VTLN prior through constrained structural maximum a posteriori linear regression (CSMAPLR) adaptation is presented. Experiments show that the resulting transformation has improved speech quality with better naturalness, intelligibility and improved speaker similarity.

Keywords: CSMAPLR adaptation;MLLR based adaptation technique;constrained structural maximum a posteriori linear regression;hierarchial Bayesian framework;hierarchial linear transformation;intelligibility;rapid adaptation technique;speaker similarity;statistical parametric speech synthesis;vocal tract length normalization;Bayes methods;speech intelligibility;

Chen-Yu Yang, G. Brown, Liang Lu, J. Yamagishi, and S. King. Noise-robust whispered speech recognition using a non-audible-murmur microphone with vts compensation. In Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on, pages 220-223, 2012. [ bib | DOI ]

In this paper, we introduce a newly-created corpus of whispered speech simultaneously recorded via a close-talking microphone and a non-audible murmur (NAM) microphone in both clean and noisy conditions. To benchmark the corpus, which has been freely released recently, experiments on automatic recognition of continuous whispered speech were conducted. When training and test conditions are matched, the NAM microphone is found to be more robust against background noise than the close-talking microphone. In mismatched conditions (noisy data, models trained on clean speech), we found that Vector Taylor Series (VTS) compensation is particularly effective for the NAM signal.

Jaime Lorenzo-Trueba, Oliver Watts, Roberto Barra-Chicote, Junichi Yamagishi, Simon King, and Juan M Montero. Simple4all proposals for the albayzin evaluations in speech synthesis. In Proc. Iberspeech 2012, 2012. [ bib | .pdf ]

Simple4All is a European funded project that aims to streamline the production of multilanguage expressive synthetic voices by means of unsupervised data extraction techniques, allowing the automatic processing of freely available data into flexible task-specific voices. In this paper we describe three different approaches for this task, the first two covering enhancements in expressivity and flexibility with the final one focusing on the development of unsupervised voices. The first technique introduces the principle of speaker adaptation from average models consisting of multiple voices, with the second being an extension of this adaptation concept into allowing the control of the expressive strength of the synthetic voice. Finally, an unsupervised approach to synthesis capable of learning from unlabelled text data is introduced in detail

Eva Hasler, Peter Bell, Arnab Ghoshal, Barry Haddow, Philipp Koehn, Fergus McInnes, Steve Renals, and Pawel Swietojanski. The UEDIN system for the IWSLT 2012 evaluation. In Proc. International Workshop on Spoken Language Translation, 2012. [ bib | .pdf ]

This paper describes the University of Edinburgh (UEDIN) systems for the IWSLT 2012 Evaluation. We participated in the ASR (English), MT (English-French, German-English) and SLT (English-French) tracks.

Ravichander Vipperla, Maria Wolters, and Steve Renals. Spoken dialogue interfaces for older people. In Kenneth J. Turner, editor, Advances in Home Care Technologies. IOS Press, 2012. [ bib | .pdf ]

Although speech is a highly natural mode of communication, building robust and usable speech-based interfaces is still a challenge, even if the target user group is restricted to younger users. When designing for older users, there are added complications due to cognitive, physiological, and anatomical ageing. Users may also find it difficult to adapt to the interaction style required by the speech interface. In this chapter, we summarise the work on spoken dialogue interfaces that was carried out during the MATCH project. After a brief overview of relevant aspects of ageing and previous work on spoken dialogue interfaces for older people, we summarise our work on managing spoken interactions (dialogue management), understanding older people's speech (speech recognition), and generating spoken messages that older people can understand (speech synthesis). We conclude with suggestions for design guidelines that have emerged from our work and suggest directions for future research.

E. Zwyssig, S. Renals, and M. Lincoln. On the effect of SNR and superdirective beamforming in speaker diarisation in meetings. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 4177-4180, 2012. [ bib | DOI | .pdf ]

This paper examines the effect of sensor performance on speaker diarisation in meetings and investigates the use of more advanced beamforming techniques, beyond the typically employed delay-sum beamformer, for mitigating the effects of poorer sensor performance. We present super-directive beamforming and investigate how different time difference of arrival (TDOA) smoothing and beamforming techniques influence the performance of state-of-the-art diarisation systems. We produced and transcribed a new corpus of meetings recorded in the instrumented meeting room using a high SNR analogue and a newly developed low SNR digital MEMS microphone array (DMMA.2). This research demonstrates that TDOA smoothing has a significant effect on the diarisation error rate and that simple noise reduction and beamforming schemes suffice to overcome audio signal degradation due to the lower SNR of modern MEMS microphones.

E. Zwyssig, S. Renals, and M. Lincoln. Determining the number of speakers in a meeting using microphone array features. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 4765-4768, 2012. [ bib | DOI | .pdf ]

The accuracy of speaker diarisation in meetings relies heavily on determining the correct number of speakers. In this paper we present a novel algorithm based on time difference of arrival (TDOA) features that aims to find the correct number of active speakers in a meeting and thus aid the speaker segmentation and clustering process. With our proposed method the microphone array TDOA values and known geometry of the array are used to calculate a speaker matrix from which we determine the correct number of active speakers with the aid of the Bayesian information criterion (BIC). In addition, we analyse several well-known voice activity detection (VAD) algorithms and verified their fitness for meeting recordings. Experiments were performed using the NIST RT06, RT07 and RT09 data sets, and resulted in reduced error rates compared with BIC-based approaches.

Sebastian Andersson, Junichi Yamagishi, and Robert A.J. Clark. Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Communication, 54(2):175-188, 2012. [ bib | DOI | http ]

Spontaneous conversational speech has many characteristics that are currently not modelled well by HMM-based speech synthesis and in order to build synthetic voices that can give an impression of someone partaking in a conversation, we need to utilise data that exhibits more of the speech phenomena associated with conversations than the more generally used carefully read aloud sentences. In this paper we show that synthetic voices built with HMM-based speech synthesis techniques from conversational speech data, preserved segmental and prosodic characteristics of frequent conversational speech phenomena. An analysis of an evaluation investigating the perception of quality and speaking style of HMM-based voices confirms that speech with conversational characteristics are instrumental for listeners to perceive successful integration of conversational speech phenomena in synthetic speech. The achieved synthetic speech quality provides an encouraging start for the continued use of conversational speech in HMM-based speech synthesis.

Keywords: Speech synthesis, HMM, Conversation, Spontaneous speech, Filled pauses, Discourse marker

Ingmar Steiner, Korin Richmond, Ian Marshall, and Calum D. Gray. The magnetic resonance imaging subset of the mngu0 articulatory corpus. The Journal of the Acoustical Society of America, 131(2):EL106-EL111, January 2012. [ bib | DOI | .pdf ]

This paper announces the availability of the magnetic resonance imaging (MRI) subset of the mngu0 corpus, a collection of articulatory speech data from one speaker containing different modalities. This subset comprises volumetric MRI scans of the speaker's vocal tract during sustained production of vowels and consonants, as well as dynamic mid-sagittal scans of repetitive consonant-vowel (CV) syllable production. For reference, high-quality acoustic recordings of the speech material are also available. The raw data are made freely available for research purposes.

Keywords: audio recording; magnetic resonance imaging; speech processing

Christopher Burton, Brian McKinstry, Aurora Szentagotai Tatar, Antoni Serrano-Blanco, Claudia Pagliari, and Maria Wolters. Activity monitoring in patients with depression: A systematic review. Journal of Affective Disorders, 145(1):21-28, 2012. [ bib | DOI | http ]

Background: Altered physical activity is an important feature of depression. It is manifested in psychomotor retardation, agitation and withdrawal from engagement in normal activities. Modern devices for activity monitoring (actigraphs) make it possible to monitor physical activity unobtrusively but the validity of actigraphy as an indicator of mood state is uncertain. We carried out a systematic review of digital actigraphy in patients with depression to investigate the associations between measured physical activity and depression. Methods: Systematic review and meta-analysis. Studies were identified from Medline, EMBASE and Psycinfo databases and included if they were either case control or longitudinal studies of actigraphy in adults aged between 18 and 65 diagnosed with a depressive disorder. Outcomes were daytime and night-time activity and actigraphic measures of sleep. Results: We identified 19 eligible papers from 16 studies (412 patients). Case control studies showed less daytime activity in patients with depression (standardised mean difference −0.76, 95% confidence intervals −1.05 to −0.47). Longitudinal studies showed moderate increase in daytime activity (0.53, 0.20 to 0.87) and a reduction in night-time activity (−0.36, −0.65 to −0.06) over the course of treatment. Limitations: All study participants were unblinded. Only seven papers included patients treated in the community. Conclusions: Actigraphy is a potentially valuable source of additional information about patients with depression. However, there are no clear guidelines for use of actigraphy in studies of patients with depression. Further studies should investigate patients treated in the community. Additional work to develop algorithms for differentiating behaviour patterns is also needed.

Dong Wang, Javier Tejedor, Simon King, and Joe Frankel. Term-dependent confidence normalization for out-of-vocabulary spoken term detection. Journal of Computer Science and Technology, 27(2), 2012. [ bib | DOI ]

Spoken Term Detection (STD) is a fundamental component of spoken information retrieval systems. A key task of an STD system is to determine reliable detections and reject false alarms based on certain confidence measures. The detection posterior probability, which is often computed from lattices, is a widely used confidence measure. However, a potential problem of this confidence measure is that the confidence scores of detections of all search terms are treated uniformly, regardless of how much they may differ in terms of phonetic or linguistic properties. This problem is particularly evident for out-of-vocabulary (OOV) terms which tend to exhibit high intra-term diversity. To address the discrepancy on confidence levels that the same confidence score may convey for different terms, a term-dependent decision strategy is desirable - for example, the term-specific threshold (TST) approach. In this work, we propose a term-dependent normalisation technique which compensates for term diversity on confidence estimation. Particularly, we propose a linear bias compensation and a discriminative compensation to deal with the bias problem that is inherent in lattice-based confidence measuring from which the TST approach suffers. We tested the proposed technique on speech data from the multi-party meeting domain with two state-of-the-art STD systems based on phonemes and words respectively. The experimental results demonstrate that the confidence normalisation approach leads to a significant performance improvement in STD, particularly for OOV terms with phoneme-based systems.

Maria Wolters, Karl Isaac, and Jason Doherty. Hold that thought: are spearcons less disruptive than spoken reminders? In CHI '12 Extended Abstracts on Human Factors in Computing Systems, CHI EA '12, pages 1745-1750, New York, NY, USA, 2012. ACM. [ bib | DOI | http ]

Keywords: irrelevant speech effect, reminders, spearcon, speech, working memory

Maria Wolters and Colin Matheson. Designing Help4Mood: Trade-offs and choices. In Juan Miguel Garcia-Gomez and Patricia Paniagua-Paniagua, editors, Information and Communication Technologies applied to Mental Health. Editorial Universitat Politecnica de Valencia, 2012. [ bib ]

Oliver Watts. Unsupervised Learning for Text-to-Speech Synthesis. PhD thesis, University of Edinburgh, 2012. [ bib | .pdf ]

This thesis introduces a general method for incorporating the distributional analysis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to produce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characterisation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manually labelled data or language-specific expertise. Results of a subjective evaluation are presented.

Maria Wolters, Lucy McCloughan, Martin Gibson, Chris Weatherall, Colin Matheson, Tim Maloney, Juan Carlos Castro-Robles, and Soraya Estevez. Monitoring people with depression in the community-regulatory aspectts. In Workshop on People, Computers and Psychiatry at the British Computer Society's Conference on Human Computer Interaction, pages 1745-1750, 2012. [ bib ]

C. Mayo, V. Aubanel, and M. Cooke. Effect of prosodic changes on speech intelligibility. In Proc. Interspeech, Portland, OR, USA, 2012. [ bib ]

Claudia Pagliari, Maria Wolters, Chris Burton, Brian McKinstry, Aurora Szentagotai, Antoni Serrano-Blanco, Daniel David, Luis Ferrini, Susanna Albertini, Joan Carlos Castro, and Soraya Estévez. Psychosocial implications of avatar use in supporting therapy of depression. In CYBER17-17th Annual CyberPsychology & CyberTherapy Conference, 2012. [ bib ]

Mirjam Wester. Talker discrimination across languages. Speech Communication, 54:781-790, 2012. [ bib | DOI | .pdf ]

This study investigated the extent to which listeners are able to discriminate between bilingual talkers in three language pairs – English–German, English–Finnish and English–Mandarin. Native English listeners were presented with two sentences spoken by bilingual talkers and were asked to judge whether they thought the sentences were spoken by the same person. Equal amounts of cross-language and matched-language trials were presented. The results show that native English listeners are able to carry out this task well; achieving percent correct levels at well above chance for all three language pairs. Previous research has shown this for English–German, this research shows listeners also extend this to Finnish and Mandarin, languages that are quite distinct from English from a genetic and phonetic similarity perspective. However, listeners are significantly less accurate on cross-language talker trials (English–foreign) than on matched-language trials (English–English and foreign–foreign). Understanding listeners’ behaviour in cross-language talker discrimination using natural speech is the first step in developing principled evaluation techniques for synthesis systems in which the goal is for the synthesised voice to sound like the original speaker, for instance, in speech-to-speech translation systems, voice conversion and reconstruction.

L. Lu, A. Ghoshal, and S. Renals. Maximum a posteriori adaptation of subspace Gaussian mixture models for cross-lingual speech recognition. In Proc. ICASSP, pages 4877-4880, 2012. [ bib | DOI | .pdf ]

This paper concerns cross-lingual acoustic modeling in the case when there are limited target language resources. We build on an approach in which a subspace Gaussian mixture model (SGMM) is adapted to the target language by reusing the globally shared parameters estimated from out-of-language training data. In current cross-lingual systems, these parameters are fixed when training the target system, which can give rise to a mismatch between the source and target systems. We investigate a maximum a posteriori (MAP) adaptation approach to alleviate the potential mismatch. In particular, we focus on the adaptation of phonetic subspace parameters using a matrix variate Gaussian prior distribution. Experiments on the GlobalPhone corpus using the MAP adaptation approach results in word error rate reductions, compared with the cross-lingual baseline systems and systems updated using maximum likelihood, for training conditions with 1 hour and 5 hours of target language data.

Keywords: Subspace Gaussian Mixture Model, Maximum a Posteriori Adaptation, Cross-lingual Speech Recognition

S. Andersson, J. Yamagishi, and R.A.J. Clark. Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Communication, 54(2):175-188, 2012. [ bib | DOI ]

Spontaneous conversational speech has many characteristics that are currently not modelled well by HMM-based speech synthesis and in order to build synthetic voices that can give an impression of someone partaking in a conversation, we need to utilise data that exhibits more of the speech phenomena associated with conversations than the more generally used carefully read aloud sentences. In this paper we show that synthetic voices built with HMM-based speech synthesis techniques from conversational speech data, preserved segmental and prosodic characteristics of frequent conversational speech phenomena. An analysis of an evaluation investigating the perception of quality and speaking style of HMM-based voices confirms that speech with conversational characteristics are instrumental for listeners to perceive successful integration of conversational speech phenomena in synthetic speech. The achieved synthetic speech quality provides an encouraging start for the continued use of conversational speech in HMM-based speech synthesis.

Martin Cooke, Maria Luisa García Lecumberri, Yan Tang, and Mirjam Wester. Do non-native listeners benefit from speech modifications designed to promote intelligibility for native listeners? In Proceedings of The Listening Talker Workshop, page 59, 2012. http://listening-talker.org/workshop/programme.html. [ bib ]

Keiichiro Oura, Junichi Yamagishi, Mirjam Wester, Simon King, and Keiichi Tokuda. Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping. Speech Communication, 54(6):703-714, 2012. [ bib | DOI | http ]

In the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech.

Keywords: HMM-based speech synthesis, Unsupervised speaker adaptation, Cross-lingual speaker adaptation, Speech-to-speech translation

Leonardo Badino, Robert A.J. Clark, and Mirjam Wester. Towards hierarchical prosodic prominence generation in TTS synthesis. In Proc. Interspeech, Portland, USA, 2012. [ bib | .pdf ]

Kei Hashimoto, Junichi Yamagishi, William Byrne, Simon King, and Keiichi Tokuda. Impacts of machine translation and speech synthesis on speech-to-speech translation. Speech Communication, 54(7):857-866, 2012. [ bib | DOI | http ]

This paper analyzes the impacts of machine translation and speech synthesis on speech-to-speech translation systems. A typical speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques have been proposed for integration of speech recognition and machine translation. However, corresponding techniques have not yet been considered for speech synthesis. The focus of the current work is machine translation and speech synthesis, and we present a subjective evaluation designed to analyze their impact on speech-to-speech translation. The results of these analyses show that the naturalness and intelligibility of the synthesized speech are strongly affected by the fluency of the translated sentences. In addition, several features were found to correlate well with the average fluency of the translated sentences and the average naturalness of the synthesized speech.

Keywords: Speech-to-speech translation, Machine translation, Speech synthesis, Subjective evaluation

Maria Wolters, Louis Ferrini, Juan Martinez-Miranda, Helen Hastie, and Chris Burton. Help4Mood - a flexible solution for supporting people with depression in the community across europe. In Proceedings of The International eHealth, Telemedicine and Health ICT Forum For Education, Networking and Business (MedeTel, 2012). International Society for Telemedicine & eHealth (ISfTeH), 2012. [ bib ]

Anna C. Janska, Erich Schröger, Thomas Jacobsen, and Robert A. J. Clark. Asymmetries in the perception of synthesized speech. In Proc. Interspeech, Portland, USA, 2012. [ bib | .pdf ]

M. Koutsogiannaki, M. Pettinato, C. Mayo, V. Kandia, and Y. Stylianou. Can modified casual speech reach the intelligibility of clear speech? In Proc. Interspeech, Portland, OR, USA, 2012. [ bib ]

Managing data in Help4Mood. ICST Transactions in Ambient Systems, (Special Issue on Technology in Mental Health):-, 2012. [ bib ]

L. Lu, A. Ghoshal, and S. Renals. Joint uncertainty decoding with unscented transform for noise robust subspace Gaussian mixture model. In Proc. Sapa-Scale workshop, 2012. [ bib | .pdf ]

Common noise compensation techniques use vector Taylor series (VTS) to approximate the mismatch function. Recent work shows that the approximation accuracy may be improved by sampling. One such sampling technique is the unscented transform (UT), which draws samples deterministically from clean speech and noise model to derive the noise corrupted speech parameters. This paper applies UT to noise compensation of the subspace Gaussian mixture model (SGMM). Since UT requires relatively smaller number of samples for accurate estimation, it has significantly lower computational cost compared to other random sampling techniques. However, the number of surface Gaussians in an SGMM is typically very large, making the direct application of UT, for compensating individual Gaussian components, computationally impractical. In this paper, we avoid the computational burden by employing UT in the framework of joint uncertainty decoding (JUD), which groups all the Gaussian components into small number of classes, sharing the compensation parameters by class. We evaluate the JUD-UT technique for an SGMM system using the Aurora 4 corpus. Experimental results indicate that UT can lead to increased accuracy compared to VTS approximation if the JUD phase factor is untuned, and to similar accuracy if the phase factor is tuned empirically

Keywords: noise compensation, SGMM, JUD, UT

V. Aubanel, M. Cooke, E. Foster, M. L. Garcia-Lecumberri, and C. Mayo. Effects of the availability of visual information and presence of competing conversations on speech production. In Proc. Interspeech, Portland, OR, USA, 2012. [ bib ]

Soraya Estevez, Juan Carlos Castro-Robles, and Maria Wolters. Help4Mood: First release of a computational distributed system to support the treatment of patients with major depression. In Proceedings of The International eHealth, Telemedicine and Health ICT Forum For Education, Networking and Business (MedeTel, 2012), pages 1745-1750. International Society for Telemedicine & eHealth (ISfTeH), 2012. [ bib ]

L. Lu, KK Chin, A. Ghoshal, and S. Renals. Noise compensation for subspace Gaussian mixture models. In Proc. Interspeech, 2012. [ bib | .pdf ]

Joint uncertainty decoding (JUD) is an effective model-based noise compensation technique for conventional Gaussian mixture model (GMM) based speech recognition systems. In this paper, we apply JUD to subspace Gaussian mixture model (SGMM) based acoustic models. The total number of Gaussians in the SGMM acoustic model is usually much larger than for conventional GMMs, which limits the application of approaches which explicitly compensate each Gaussian, such as vector Taylor series (VTS). However, by clustering the Gaussian components into a number of regression classes, JUD-based noise compensation can be successfully applied to SGMM systems. We evaluate the JUD/SGMM technique using the Aurora 4 corpus, and the experimental results indicated that it is more accurate than conventional GMM-based systems using either VTS or JUD noise compensation.

Keywords: acoustic modelling, noise compensation, SGMM, JUD

Maria Wolters, Juan Martínez-Miranda, Helen Hastie, and Colin Matheson. Managing data in Help4Mood. In The 2nd International Workshop on Computing Paradigms for Mental Health - MindCare 2012, 2012. [ bib ]

Junichi Yamagishi, Christophe Veaux, Simon King, and Steve Renals. Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction. Acoustical Science and Technology, 33(1):1-5, 2012. [ bib | DOI | http | .pdf ]

In this invited paper, we overview the clinical applications of speech synthesis technologies and explain a few selected researches. We also introduce the University of Edinburgh’s new project “Voice Banking and reconstruction” for patients with degenerative diseases, such as motor neurone disease and Parkinson's disease and show how speech synthesis technologies can improve the quality of life for the patients.

Sarah Creer, Stuart Cunningham, Phil Green, and Junichi Yamagishi. Building personalised synthetic voices for individuals with severe speech impairment. Computer Speech and Language, 27(6):1178-1193, 2012. [ bib | DOI | http ]

For individuals with severe speech impairment accurate spoken communication can be difficult and require considerable effort. Some may choose to use a voice output communication aid (or VOCA) to support their spoken communication needs. A VOCA typically takes input from the user through a keyboard or switch-based interface and produces spoken output using either synthesised or recorded speech. The type and number of synthetic voices that can be accessed with a VOCA is often limited and this has been implicated as a factor for rejection of the devices. Therefore, there is a need to be able to provide voices that are more appropriate and acceptable for users. This paper reports on a study that utilises recent advances in speech synthesis to produce personalised synthetic voices for 3 speakers with mild to severe dysarthria, one of the most common speech disorders. Using a statistical parametric approach to synthesis, an average voice trained on data from several unimpaired speakers was adapted using recordings of the impaired speech of 3 dysarthric speakers. By careful selection of the speech data and the model parameters, several exemplar voices were produced for each speaker. A qualitative evaluation was conducted with the speakers and listeners who were familiar with the speaker. The evaluation showed that for one of the 3 speakers a voice could be created which conveyed many of his personal characteristics, such as regional identity, sex and age.

Keywords: Speech synthesis, Augmentative and alternative communication, Disordered speech, Voice output communication aid

Thomas Hueber, Atef Ben Youssef, Gérard Bailly, Pierre Badin, and Frédéric Elisei. Cross-speaker acoustic-to-articulatory inversion using phone-based trajectory HMM for pronunciation training. In Proc. Interspeech, Portland, Oregon, USA, 2012. [ bib | .pdf ]

The article presents a statistical mapping approach for crossspeaker acoustic-to-articulatory inversion. The goal is to estimate the most likely articulatory trajectories for a reference speaker from the speech audio signal of another speaker. This approach is developed in the framework of our system of visual articulatory feedback developed for computer-assisted pronunciation training applications (CAPT). The proposed technique is based on the joint modeling of articulatory and acoustic features, for each phonetic class, using full-covariance trajectory HMM. The acoustic-to-articulatory inversion is achieved in 2 steps: 1) finding the most likely HMM state sequence from the acoustic observations; 2) inferring the articulatory trajectories from both the decoded state sequence and the acoustic observations. The problem of speaker adaptation is addressed using a voice conversion approach, based on trajectory GMM.

Gérard Bailly, Pierre Badin, Lionel Revéret, and Atef Ben Youssef. Sensorimotor characteristics of speech production. Cambridge University Press, 2012. [ bib | DOI ]

Ingmar Steiner, Korin Richmond, and Slim Ouni. Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis. In 3rd International Symposium on Facial Analysis and Animation, Vienna, Austria, 2012. [ bib | .pdf ]

Steve Renals, Hervé Bourlard, Jean Carletta, and Andrei Popescu-Belis, editors. Multimodal Signal Processing: Human Interactions in Meetings. Cambridge University Press, 2012. [ bib ]

Aciel Eshky, Ben Allison, and Mark Steedman. Generative goal-driven user simulation for dialog management. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 71-81. Association for Computational Linguistics, 2012. [ bib | .pdf ]

User simulation is frequently used to train statistical dialog managers for task-oriented domains. At present, goal-driven simulators (those that have a persistent notion of what they wish to achieve in the dialog) require some task-specific engineering, making them impossible to evaluate intrinsically. Instead, they have been evaluated extrinsically by means of the dialog managers they are intended to train, leading to circularity of argument. In this paper, we propose the first fully generative goal-driven simulator that is fully induced from data, without hand-crafting or goal annotation. Our goals are latent, and take the form of topics in a topic model, clustering together semantically equivalent and phonetically confusable strings, implicitly modelling synonymy and speech recognition noise. We evaluate on two standard dialog resources, the Communicator and Let’s Go datasets, and demonstrate that our model has substantially better fit to held out data than competing approaches. We also show that features derived from our model allow significantly greater improvement over a baseline at distinguishing real from randomly permuted dialogs.