The Centre for Speech Technology Research, The university of Edinburgh

Publications by Simon King

[1] Srikanth Ronanki, Siva Reddy, Bajibabu Bollepalli, and Simon King. DNN-based Speech Synthesis for Indian Languages from ASCII text. In Proc. 9th ISCA Speech Synthesis Workshop (SSW9), Sunnyvale, CA, USA, September 2016. [ bib | .pdf ]
Text-to-Speech synthesis in Indian languages has a seen lot of progress over the decade partly due to the annual Blizzard challenges. These systems assume the text to be written in Devanagari or Dravidian scripts which are nearly phonemic orthography scripts. However, the most common form of computer interaction among Indians is ASCII written transliterated text. Such text is generally noisy with many variations in spelling for the same word. In this paper we evaluate three approaches to synthesize speech from such noisy ASCII text: a naive Uni-Grapheme approach, a Multi-Grapheme approach, and a supervised Grapheme-to-Phoneme (G2P) approach. These methods first convert the ASCII text to a phonetic script, and then learn a Deep Neural Network to synthesize speech from that. We train and test our models on Blizzard Challenge datasets that were transliterated to ASCII using crowdsourcing. Our experiments on Hindi, Tamil and Telugu demonstrate that our models generate speech of competetive quality from ASCII text compared to the speech synthesized from the native scripts. All the accompanying transliterated datasets are released for public access.

[2] Srikanth Ronanki, Gustav Eje Henter, Zhizheng Wu, and Simon King. A template-based approach for speech synthesis intonation generation using LSTMs. In Proc. Interspeech, San Francisco, USA, September 2016. [ bib | .pdf ]
The absence of convincing intonation makes current parametric speech synthesis systems sound dull and lifeless, even when trained on expressive speech data. Typically, these systems use regression techniques to predict the fundamental frequency (F0) frame-by-frame. This approach leads to overly-smooth pitch contours and fails to construct an appropriate prosodic structure across the full utterance. In order to capture and reproduce larger-scale pitch patterns, this paper proposes a template-based approach for automatic F0 generation, where per-syllable pitch-contour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN). The use of syllable templates mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. The use of an RNN, paired with connectionist temporal classification (CTC), enables the prediction of structure in the pitch contour spanning the entire utterance. This novel F0 prediction system is used alongside separate LSTMs for predicting phone durations and the other acoustic features, to construct a complete text-to-speech system. We report the results of objective and subjective tests on an expressive speech corpus of children's audiobooks, and include comparisons to a conventional baseline that predicts F0 directly at the frame level.

[3] Korin Richmond and Simon King. Smooth talking: Articulatory join costs for unit selection. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 5150-5154, March 2016. [ bib | .pdf ]
Join cost calculation has so far dealt exclusively with acoustic speech parameters, and a large number of distance metrics have previously been tested in conjunction with a wide variety of acoustic parameterisations. In contrast, we propose here to calculate distance in articulatory space. The motivation for this is simple: physical constraints mean a human talker's mouth cannot “jump” from one configuration to a different one, so smooth evolution of articulator positions would also seem desirable for a good candidate unit sequence. To test this, we built Festival Multisyn voices using a large articulatory-acoustic dataset. We first synthesised 460 TIMIT sentences and confirmed our articulatory join cost gives appreciably different unit sequences compared to the standard Multisyn acoustic join cost. A listening test (3 sets of 25 sentence pairs, 30 listeners) then showed our articulatory cost is preferred at a rate of 58% compared to the standard Multisyn acoustic join cost.

Keywords: speech synthesis, unit selection, electromagnetic articulography, join cost
[4] Gustav Eje Henter, Srikanth Ronanki, Oliver Watts, Mirjam Wester, Zhizheng Wu, and Simon King. Robust TTS duration modelling using DNNs. In Proc. ICASSP, volume 41, pages 5130-5134, Shanghai, China, March 2016. [ bib | http | .pdf ]
Accurate modelling and prediction of speech-sound durations is an important component in generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful modelling paradigm, and large, found corpora of natural and expressive speech are easy to acquire for training them. Unfortunately, found datasets are seldom subject to the quality-control that traditional synthesis methods expect. Common issues likely to affect duration modelling include transcription errors, reductions, filled pauses, and forced-alignment inaccuracies. To combat this, we propose to improve modelling and prediction of speech durations using methods from robust statistics, which are able to disregard ill-fitting points in the training material. We describe a robust fitting criterion based on the density power divergence (the beta-divergence) and a robust generation heuristic using mixture density networks (MDNs). Perceptual tests indicate that subjects prefer synthetic speech generated using robust models of duration over the baselines.

Keywords: Speech synthesis, duration modelling, robust statistics
[5] Oliver Watts, Gustav Eje Henter, Thomas Merritt, Zhizheng Wu, and Simon King. From HMMs to DNNs: where do the improvements come from? In Proc. ICASSP, volume 41, pages 5505-5509, Shanghai, China, March 2016. [ bib | http | .pdf ]
Deep neural networks (DNNs) have recently been the focus of much text-to-speech research as a replacement for decision trees and hidden Markov models (HMMs) in statistical parametric synthesis systems. Performance improvements have been reported; however, the configuration of systems evaluated makes it impossible to judge how much of the improvement is due to the new machine learning methods, and how much is due to other novel aspects of the systems. Specifically, whereas the decision trees in HMM-based systems typically operate at the state-level, and separate trees are used to handle separate acoustic streams, most DNN-based systems are trained to make predictions simultaneously for all streams at the level of the acoustic frame. This paper isolates the influence of three factors (machine learning method; state vs. frame predictions; separate vs. combined stream predictions) by building a continuum of systems along which only a single factor is varied at a time. We find that replacing decision trees with DNNs and moving from state-level to frame-level predictions both significantly improve listeners' naturalness ratings of synthetic speech produced by the systems. No improvement is found to result from switching from separate-stream to combined-stream predictions.

Keywords: speech synthesis, hidden Markov model, decision tree, deep neural network
[6] Adriana Stan, Yoshitaka Mamiya, Junichi Yamagishi, Peter Bell, Oliver Watts, Rob Clark, and Simon King. ALISA: An automatic lightly supervised speech segmentation and alignment tool. Computer Speech and Language, 35:116-133, 2016. [ bib | DOI | http | .pdf ]
This paper describes the ALISA tool, which implements a lightly supervised method for sentence-level alignment of speech with imperfect transcripts. Its intended use is to enable the creation of new speech corpora from a multitude of resources in a language-independent fashion, thus avoiding the need to record or transcribe speech data. The method is designed so that it requires minimum user intervention and expert knowledge, and it is able to align data in languages which employ alphabetic scripts. It comprises a GMM-based voice activity detector and a highly constrained grapheme-based speech aligner. The method is evaluated objectively against a gold standard segmentation and transcription, as well as subjectively through building and testing speech synthesis systems from the retrieved data. Results show that on average, 70% of the original data is correctly aligned, with a word error rate of less than 0.5%. In one case, subjective listening tests show a statistically significant preference for voices built on the gold transcript, but this is small and in other tests, no statistically significant differences between the systems built from the fully supervised training data and the one which uses the proposed method are found.

[7] Thomas Merritt, Robert A J Clark, Zhizheng Wu, Junichi Yamagishi, and Simon King. Deep neural network-guided unit selection synthesis. In Proc. ICASSP, 2016. [ bib | .pdf ]
Vocoding of speech is a standard part of statistical parametric speech synthesis systems. It imposes an upper bound of the naturalness that can possibly be achieved. Hybrid systems using parametric models to guide the selection of natural speech units can combine the benefits of robust statistical models with the high level of naturalness of waveform concatenation. Existing hybrid systems use Hidden Markov Models (HMMs) as the statistical model. This paper demonstrates that the superiority of Deep Neural Network (DNN) acoustic models over HMMs in conventional statistical parametric speech synthesis also carries over to hybrid synthesis. We compare various DNN and HMM hybrid configurations, guiding the selection of waveform units in either the vocoder parameter domain, or in the domain of embeddings (bottleneck features).

[8] C. Valentini-Botinhao, Z. Wu, and S. King. Towards minimum perceptual error training for DNN-based speech synthesis. In Proc. Interspeech, Dresden, Germany, September 2015. [ bib | .pdf ]
We propose to use a perceptually-oriented domain to improve the quality of text-to-speech generated by deep neural networks (DNNs). We train a DNN that predicts the parameters required for speech reconstruction but whose cost function is calculated in another domain. In this paper, to represent this perceptual domain we extract an approximated version of the Spectro-Temporal Excitation Pattern that was originally proposed as part of a model of hearing speech in noise. We train DNNs that predict band aperiodicity, fundamental frequency and Mel cepstral coefficients and compare generated speech when the spectral cost function is defined in the Mel cepstral, warped log spectrum or perceptual domains. Objective results indicate that the perceptual domain system achieves the highest quality.

[9] Thomas Merritt, Junichi Yamagishi, Zhizheng Wu, Oliver Watts, and Simon King. Deep neural network context embeddings for model selection in rich-context HMM synthesis. In Proc. Interspeech, Dresden, September 2015. [ bib | .pdf ]
This paper introduces a novel form of parametric synthesis that uses context embeddings produced by the bottleneck layer of a deep neural network to guide the selection of models in a rich-context HMM-based synthesiser. Rich-context synthesis – in which Gaussian distributions estimated from single linguistic contexts seen in the training data are used for synthesis, rather than more conventional decision tree-tied models – was originally proposed to address over-smoothing due to averaging across contexts. Our previous investigations have confirmed experimentally that averaging across different contexts is indeed one of the largest factors contributing to the limited quality of statistical parametric speech synthesis. However, a possible weakness of the rich context approach as previously formulated is that a conventional tied model is still used to guide selection of Gaussians at synthesis time. Our proposed approach replaces this with context embeddings derived from a neural network.

[10] Marcus Tomalin, Mirjam Wester, Rasmus Dall, Bill Byrne, and Simon King. A lattice-based approach to automatic filled pause insertion. In Proc. DiSS 2015, Edinburgh, August 2015. [ bib | .pdf ]
This paper describes a novel method for automatically inserting filled pauses (e.g., UM) into fluent texts. Although filled pauses are known to serve a wide range of psychological and structural functions in conversational speech, they have not traditionally been modelled overtly by state-of-the-art speech synthesis systems. However, several recent systems have started to model disfluencies specifically, and so there is an increasing need to create disfluent speech synthesis input by automatically inserting filled pauses into otherwise fluent text. The approach presented here interpolates Ngrams and Full-Output Recurrent Neural Network Language Models (f-RNNLMs) in a lattice-rescoring framework. It is shown that the interpolated system outperforms separate Ngram and f-RNNLM systems, where performance is analysed using the Precision, Recall, and F-score metrics.

[11] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Proc. ICASSP, pages 4460-4464, Brisbane, Australia, April 2015. [ bib | .pdf ]
Deep neural networks (DNNs) use a cascade of hidden representations to enable the learning of complex mappings from input to output features. They are able to learn the complex mapping from textbased linguistic features to speech acoustic features, and so perform text-to-speech synthesis. Recent results suggest that DNNs can produce more natural synthetic speech than conventional HMM-based statistical parametric systems. In this paper, we show that the hidden representation used within a DNN can be improved through the use of Multi-Task Learning, and that stacking multiple frames of hidden layer activations (stacked bottleneck features) also leads to improvements. Experimental results confirmed the effectiveness of the proposed methods, and in listening tests we find that stacked bottleneck features in particular offer a significant improvement over both a baseline DNN and a benchmark HMM system.

[12] Thomas Merritt, Javier Latorre, and Simon King. Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 4220-4224, Brisbane, April 2015. [ bib | .pdf ]
Even the best statistical parametric speech synthesis systems do not achieve the naturalness of good unit selection. We investigated possible causes of this. By constructing speech signals that lie inbetween natural speech and the output from a complete HMM synthesis system, we investigated various effects of modelling. We manipulated the temporal smoothness and the variance of the spectral parameters to create stimuli, then presented these to listeners alongside natural and vocoded speech, as well as output from a full HMM-based text-to-speech system and from an idealised `pseudo-HMM'. All speech signals, except the natural waveform, were created using vocoders employing one of two popular spectral parameterisations: Mel-Cepstra or Mel-Line Spectral Pairs. Listeners made `same or different' pairwise judgements, from which we generated a perceptual map using Multidimensional Scaling. We draw conclusions about which aspects of HMM synthesis are limiting the naturalness of the synthetic speech.

[13] Zhizheng Wu and Simon King. Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features. In Interspeech, 2015. [ bib | .pdf ]
[14] Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, and Simon King. A study of speaker adaptation for DNN-based speech synthesis. In Interspeech, 2015. [ bib | .pdf ]
[15] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King. Deep neural network employing multi-task learning and stacked bottleneck features for speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015. [ bib | .pdf ]
[16] Zhizheng Wu, Ali Khodabakhsh, Cenk Demiroglu, Junichi Yamagishi, Daisuke Saito, Tomoki Toda, and Simon King. SAS: A speaker verification spoofing database containing diverse attacks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015. [ bib | .pdf ]
[17] Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Intelligibility enhancement of speech in noise. In Proceedings of the Institute of Acoustics, volume 36, pages 96-103, Birmingham, UK, October 2014. [ bib | .pdf ]
To maintain communication success, humans change the way they speak and hear according to many factors, like the age, gender, native language and social relationship between talker and listener. Other factors are dictated by how communication takes place, such as environmental factors like an active competing speaker or limitations on the communication channel. As in natural interaction, we expect to communicate with and use synthetic voices that can also adapt to different listening scenarios and keep the level of intelligibility high. Research in speech technology needs to account for this to change the way we transmit, store and artificially generate speech accordingly.

[18] Thomas Merritt, Tuomo Raitio, and Simon King. Investigating source and filter contributions, and their interaction, to statistical parametric speech synthesis. In Proc. Interspeech, pages 1509-1513, Singapore, September 2014. [ bib | .pdf ]
This paper presents an investigation of the separate perceptual degradations introduced by the modelling of source and filter features in statistical parametric speech synthesis. This is achieved using stimuli in which various permutations of natural, vocoded and modelled source and filter are combined, optionally with the addition of filter modifications (e.g. global variance or modulation spectrum scaling). We also examine the assumption of independence between source and filter parameters. Two complementary perceptual testing paradigms are adopted. In the first, we ask listeners to perform “same or different quality” judgements between pairs of stimuli from different configurations. In the second, we ask listeners to give an opinion score for individual stimuli. Combining the findings from these tests, we draw some conclusions regarding the relative contributions of source and filter to the currently rather limited naturalness of statistical parametric synthetic speech, and test whether current independence assumptions are justified.

[19] Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, and Simon King. Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In Proc. Interspeech, volume 15, pages 1504-1508, September 2014. [ bib | .pdf ]
Acoustic models used for statistical parametric speech synthesis typically incorporate many modelling assumptions. It is an open question to what extent these assumptions limit the naturalness of synthesised speech. To investigate this question, we recorded a speech corpus where each prompt was read aloud multiple times. By combining speech parameter trajectories extracted from different repetitions, we were able to quantify the perceptual effects of certain commonly used modelling assumptions. Subjective listening tests show that taking the source and filter parameters to be conditionally independent, or using diagonal covariance matrices, significantly limits the naturalness that can be achieved. Our experimental results also demonstrate the shortcomings of mean-based parameter generation.

Keywords: speech synthesis, acoustic modelling, stream independence, diagonal covariance matrices, repeated speech
[20] Oliver Watts, Siva Gangireddy, Junichi Yamagishi, Simon King, Steve Renals, Adriana Stan, and Mircea Giurgiu. Neural net word representations for phrase-break prediction without a part of speech tagger. In Proc. ICASSP, pages 2618-2622, Florence, Italy, May 2014. [ bib | .pdf ]
The use of shared projection neural nets of the sort used in language modelling is proposed as a way of sharing parameters between multiple text-to-speech system components. We experiment with pretraining the weights of such a shared projection on an auxiliary language modelling task and then apply the resulting word representations to the task of phrase-break prediction. Doing so allows us to build phrase-break predictors that rival conventional systems without any reliance on conventional knowledge-based resources such as part of speech taggers.

[21] Rasmus Dall, Junichi Yamagishi, and Simon King. Rating naturalness in speech synthesis: The effect of style and expectation. In Proc. Speech Prosody, May 2014. [ bib | .pdf ]
In this paper we present evidence that speech produced spontaneously in a conversation is considered more natural than read prompts. We also explore the relationship between participants' expectations of the speech style under evaluation and their actual ratings. In successive listening tests subjects rated the naturalness of either spontaneously produced, read aloud or written sentences, with instructions toward either conversational, reading or general naturalness. It was found that, when presented with spontaneous or read aloud speech, participants consistently rated spontaneous speech more natural - even when asked to rate naturalness in the reading case. Presented with only text, participants generally preferred transcriptions of spontaneous utterances, except when asked to evaluate naturalness in terms of reading aloud. This has implications for the application of MOS-scale naturalness ratings in Speech Synthesis, and potentially on the type of data suitable for use both in general TTS, dialogue systems and specifically in Conversational TTS, in which the goal is to reproduce speech as it is produced in a spontaneous conversational setting.

[22] C. Valentini-Botinhao, J. Yamagishi, S. King, and R. Maia. Intelligibility enhancement of HMM-generated speech in additive noise by modifying mel cepstral coefficients to increase the glimpse proportion. Computer Speech and Language, 28(2):665-686, 2014. [ bib | DOI | .pdf ]
This paper describes speech intelligibility enhancement for hidden Markov model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the Glimpse Proportion – an objective measure of the intelligibility of speech in noise – increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1-4kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains.

[23] Moses Ekpenyong, Eno-Abasi Urua, Oliver Watts, Simon King, and Junichi Yamagishi. Statistical parametric speech synthesis for Ibibio. Speech Communication, 56:243-251, January 2014. [ bib | DOI | http | .pdf ]
Ibibio is a Nigerian tone language, spoken in the south-east coastal region of Nigeria. Like most African languages, it is resource-limited. This presents a major challenge to conventional approaches to speech synthesis, which typically require the training of numerous predictive models of linguistic features such as the phoneme sequence (i.e., a pronunciation dictionary plus a letter-to-sound model) and prosodic structure (e.g., a phrase break predictor). This training is invariably supervised, requiring a corpus of training data labelled with the linguistic feature to be predicted. In this paper, we investigate what can be achieved in the absence of many of these expensive resources, and also with a limited amount of speech recordings. We employ a statistical parametric method, because this has been found to offer good performance even on small corpora, and because it is able to directly learn the relationship between acoustics and whatever linguistic features are available, potentially mitigating the absence of explicit representations of intermediate linguistic layers such as prosody. We present an evaluation that compares systems that have access to varying degrees of linguistic structure. The simplest system only uses phonetic context (quinphones), and this is compared to systems with access to a richer set of context features, with or without tone marking. It is found that the use of tone marking contributes significantly to the quality of synthetic speech. Future work should therefore address the problem of tone assignment using a dictionary and the building of a prediction module for out-of-vocabulary words.

[24] P. Lanchantin, M. J. F. Gales, S. King, and J. Yamagishi. Multiple-average-voice-based speech synthesis. In Proc. ICASSP, 2014. [ bib ]
This paper describes a novel approach for the speaker adaptation of statistical parametric speech synthesis systems based on the interpolation of a set of average voice models (AVM). Recent results have shown that the quality/naturalness of adapted voices directly depends on the distance from the average voice model that the speaker adaptation starts from. This suggests the use of several AVMs trained on carefully chosen speaker clusters from which a more suitable AVM can be selected/interpolated during the adaptation. In the proposed approach, a Multiple-AVM is trained on clusters of speakers, iteratively re-assigned during the estimation process initialised according to metadata. In contrast with the cluster adaptive training (CAT) framework, the training stage is computationally less expensive as the amount of training data and clusters gets larger. Additionally, during adaptation, each AVM constituting the multiple-AVM are first adapted towards the speaker which suggests a better tuning to the individual speaker of the space in which the interpolation takes place. It is shown via experiments, ran on a corpus of British speakers with various regional accents, that the quality/naturalness of synthetic speech of adapted voices is significantly higher than when considering a single factor-independent AVM selected according to the target speaker characteristics.

[25] Rasmus Dall, Marcus Tomalin, Mirjam Wester, William Byrne, and Simon King. Investigating automatic & human filled pause insertion for speech synthesis. In Proc. Interspeech, 2014. [ bib | .pdf ]
Filled pauses are pervasive in conversational speech and have been shown to serve several psychological and structural purposes. Despite this, they are seldom modelled overtly by state-of-the-art speech synthesis systems. This paper seeks to motivate the incorporation of filled pauses into speech synthesis systems by exploring their use in conversational speech, and by comparing the performance of several automatic systems inserting filled pauses into fluent text. Two initial experiments are described which seek to determine whether people's predicted insertion points are consistent with actual practice and/or with each other. The experiments also investigate whether there are `right' and `wrong' places to insert filled pauses. The results show good consistency between people's predictions of usage and their actual practice, as well as a perceptual preference for the `right' placement. The third experiment contrasts the performance of several automatic systems that insert filled pauses into fluent sentences. The best performance (determined by F-score) was achieved through the by-word interpolation of probabilities predicted by Recurrent Neural Network and 4gram Language Models. The results offer insights into the use and perception of filled pauses by humans, and how automatic systems can be used to predict their locations.

[26] Herman Kamper, Aren Jansen, Simon King, and S. J. Goldwater. Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings. In Proc. SLT, 2014. [ bib | .pdf ]
Unsupervised speech processing methods are essential for applications ranging from zero-resource speech technology to modelling child language acquisition. One challenging problem is discovering the word inventory of the language: the lexicon. Lexical clustering is the task of grouping unlabelled acoustic word tokens according to type. We propose a novel lexical clustering model: variable-length word segments are embedded in a fixed-dimensional acoustic space in which clustering is then performed. We evaluate several clustering algorithms and find that the best methods produce clusters with wide variation in sizes, as observed in natural language. The best probabilistic approach is an infinite Gaussian mixture model (IGMM), which automatically chooses the number of clusters. Performance is comparable to that of non-probabilistic Chinese Whispers and average-linkage hierarchical clustering. We conclude that IGMM clustering of fixed-dimensional embeddings holds promise as the lexical clustering component in unsupervised speech processing systems.

[27] C. Valentini-Botinhao, J. Yamagishi, S. King, and Y. Stylianou. Combining perceptually-motivated spectral shaping with loudness and duration modification for intelligibility enhancement of HMM-based synthetic speech in noise. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]
[28] Cassia Valentini-Botinhao, Mirjam Wester, Junichi Yamagishi, and Simon King. Using neighbourhood density and selective SNR boosting to increase the intelligibility of synthetic speech in noise. In 8th ISCA Workshop on Speech Synthesis, pages 133-138, Barcelona, Spain, August 2013. [ bib | .pdf ]
Motivated by the fact that words are not equally confusable, we explore the idea of using word-level intelligibility predictions to selectively boost the harder-to-understand words in a sentence, aiming to improve overall intelligibility in the presence of noise. First, the intelligibility of a set of words from dense and sparse phonetic neighbourhoods was evaluated in isolation. The resulting intelligibility scores were used to inform two sentencelevel experiments. In the first experiment the signal-to-noise ratio of one word was boosted to the detriment of another word. Sentence intelligibility did not generally improve. The intelligibility of words in isolation and in a sentence were found to be significantly different, both in clean and in noisy conditions. For the second experiment, one word was selectively boosted while slightly attenuating all other words in the sentence. This strategy was successful for words that were poorly recognised in that particular context. However, a reliable predictor of word-in-context intelligibility remains elusive, since this involves – as our results indicate – semantic, syntactic and acoustic information about the word and the sentence.

[29] Thomas Merritt and Simon King. Investigating the shortcomings of HMM synthesis. In 8th ISCA Workshop on Speech Synthesis, pages 185-190, Barcelona, Spain, August 2013. [ bib | .pdf ]
This paper presents the beginnings of a framework for formal testing of the causes of the current limited quality of HMM (Hidden Markov Model) speech synthesis. This framework separates each of the effects of modelling to observe their independent effects on vocoded speech parameters in order to address the issues that are restricting the progression to highly intelligible and natural-sounding speech synthesis. The simulated HMM synthesis conditions are performed on spectral speech parameters and tested via a pairwise listening test, asking listeners to perform a “same or different” judgement on the quality of the synthesised speech produced between these conditions. These responses are then processed using multidimensional scaling to identify the qualities in modelled speech that listeners are attending to and thus forms the basis of why they are distinguishable from natural speech. The future improvements to be made to the framework will finally be discussed which include the extension to more of the parameters modelled during speech synthesis.

[30] Maria Astrinaki, Alexis Moinet, Junichi Yamagishi, Korin Richmond, Zhen-Hua Ling, Simon King, and Thierry Dutoit. Mage - reactive articulatory feature control of HMM-based parametric speech synthesis. In 8th ISCA Workshop on Speech Synthesis, pages 227-231, Barcelona, Spain, August 2013. [ bib | .pdf ]
[31] Adriana Stan, Peter Bell, Junichi Yamagishi, and Simon King. Lightly supervised discriminative training of grapheme models for improved sentence-level alignment of speech and text data. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]
This paper introduces a method for lightly supervised discriminative training using MMI to improve the alignment of speech and text data for use in training HMM-based TTS systems for low-resource languages. In TTS applications, due to the use of long-span contexts, it is important to select training utterances which have wholly correct transcriptions. In a low-resource setting, when using poorly trained grapheme models, we show that the use of MMI discriminative training at the grapheme-level enables us to increase the amount of correctly aligned data by 40%, while maintaining a 7% sentence error rate and 0.8% word error rate. We present the procedure for lightly supervised discriminative training with regard to the objective of minimising sentence error rate.

[32] H. Christensen, M. Aniol, P. Bell, P. Green, T. Hain, S. King, and P. Swietojanski. Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]
Recently there has been increasing interest in ways of using out-of-domain (OOD) data to improve automatic speech recognition performance in domains where only limited data is available. This paper focuses on one such domain, namely that of disordered speech for which only very small databases exist, but where normal speech can be considered OOD. Standard approaches for handling small data domains use adaptation from OOD models into the target domain, but here we investigate an alternative approach with its focus on the feature extraction stage: OOD data is used to train feature-generating deep belief neural networks. Using AMI meeting and TED talk datasets, we investigate various tandem-based speaker independent systems as well as maximum a posteriori adapted speaker dependent systems. Results on the UAspeech isolated word task of disordered speech are very promising with our overall best system (using a combination of AMI and TED data) giving a correctness of 62.5%; an increase of 15% on previously best published results based on conventional model adaptation. We show that the relative benefit of using OOD data varies considerably from speaker to speaker and is only loosely correlated with the severity of a speaker's impairments.

[33] Kayoko Yanagisawa, Javier Latorre, Vincent Wan, Mark J. F. Gales, and Simon King. Noise robustness in HMM-TTS speaker adaptation. In 8th ISCA Workshop on Speech Synthesis, pages 139-144, Barcelona, Spain, August 2013. [ bib | .pdf ]
Speaker adaptation for TTS applications has been receiving more attention in recent years for applications such as voice customisation or voice banking. If these applications are offered as an Internet service, there is no control on the quality of the data that can be collected. It can be noisy with people talking in the background or recorded in a reverberant environment. This makes the adaptation more difficult. This paper explores the effect of different levels of additive and convolutional noise on speaker adaptation techniques based on cluster adaptive training (CAT) and average voice model (AVM). The results indicate that although both techniques suffer degradation to some extent, CAT is in general more robust than AVM.

[34] Rubén San-Segundo, Juan Manuel Montero, Mircea Giurgiu, Ioana Muresan, and Simon King. Multilingual number transcription for text-to-speech conversion. In 8th ISCA Workshop on Speech Synthesis, pages 85-89, Barcelona, Spain, August 2013. [ bib | .pdf ]
This paper describes the text normalization module of a text to speech fully-trainable conversion system and its application to number transcription. The main target is to generate a language independent text normalization module, based on data instead of on expert rules. This paper proposes a general architecture based on statistical ma- chine translation techniques. This proposal is composed of three main modules: a tokenizer for splitting the text input into a token graph, a phrase-based translation module for token translation, and a post-processing module for removing some tokens. This architecture has been evaluated for number transcription in several languages: English, Spanish and Romanian. Number transcription is an important aspect in the text normalization problem.

[35] Heng Lu, Simon King, and Oliver Watts. Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis. In 8th ISCA Workshop on Speech Synthesis, pages 281-285, Barcelona, Spain, August 2013. [ bib | .pdf ]
Conventional statistical parametric speech synthesis relies on decision trees to cluster together similar contexts, result- ing in tied-parameter context-dependent hidden Markov models (HMMs). However, decision tree clustering has a major weak- ness: it use hard division and subdivides the model space based on one feature at a time, fragmenting the data and failing to exploit interactions between linguistic context features. These linguistic features themselves are also problematic, being noisy and of varied relevance to the acoustics. We propose to combine our previous work on vector-space representations of linguistic context, which have the added ad- vantage of working directly from textual input, and Deep Neural Networks (DNNs), which can directly accept such continuous representations as input. The outputs of the network are probability distributions over speech features. Maximum Likelihood Parameter Generation is then used to create parameter trajectories, which in turn drive a vocoder to generate the waveform. Various configurations of the system are compared, using both conventional and vector space context representations and with the DNN making speech parameter predictions at two dif- ferent temporal resolutions: frames, or states. Both objective and subjective results are presented.

[36] Yoshitaka Mamiya, Adriana Stan, Junichi Yamagishi, Peter Bell, Oliver Watts, Robert Clark, and Simon King. Using adaptation to improve speech transcription alignment in noisy and reverberant environments. In 8th ISCA Workshop on Speech Synthesis, pages 61-66, Barcelona, Spain, August 2013. [ bib | .pdf ]
When using data retrieved from the internet to create new speech databases, the recording conditions can often be highly variable within and between sessions. This variance influences the overall performance of any automatic speech and text alignment techniques used to process this data. In this paper we discuss the use of speaker adaptation methods to address this issue. Starting from a baseline system for automatic sentence-level segmentation and speech and text alignment based on GMMs and grapheme HMMs, respectively, we employ Maximum A Posteriori (MAP) and Constrained Maximum Likelihood Linear Regression (CMLLR) techniques to model the variation in the data in order to increase the amount of confidently aligned speech. We tested 29 different scenarios, which include reverberation, 8 talker babble noise and white noise, each in various combinations and SNRs. Results show that the MAP-based segmentation's performance is very much influenced by the noise type, as well as the presence or absence of reverberation. On the other hand, the CMLLR adaptation of the acoustic models gives an average 20% increase in the aligned data percentage for the majority of the studied scenarios.

[37] Oliver Watts, Adriana Stan, Rob Clark, Yoshitaka Mamiya, Mircea Giurgiu, Junichi Yamagishi, and Simon King. Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from 'found' data: evaluation and analysis. In 8th ISCA Workshop on Speech Synthesis, pages 121-126, Barcelona, Spain, August 2013. [ bib | .pdf ]
This paper presents techniques for building text-to-speech front-ends in a way that avoids the need for language-specific expert knowledge, but instead relies on universal resources (such as the Unicode character database) and unsupervised learning from unannotated data to ease system development. The acquisition of expert language-specific knowledge and expert annotated data is a major bottleneck in the development of corpus-based TTS systems in new languages. The methods presented here side-step the need for such resources as pronunciation lexicons, phonetic feature sets, part of speech tagged data, etc. The paper explains how the techniques introduced are applied to the 14 languages of a corpus of `found' audiobook data. Results of an evaluation of the intelligibility of the systems resulting from applying these novel techniques to this data are presented.

[38] Adriana Stan, Oliver Watts, Yoshitaka Mamiya, Mircea Giurgiu, Rob Clark, Junichi Yamagishi, and Simon King. TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]
Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained with a lightly-supervised process. Future versions of the corpus will include finer-grained alignment and prosodic annotation, all of which will be made freely available. This paper gives a general outline of the data collected so far, as well as a detailed description of how this has been done, emphasizing the minimal language-specific knowledge and manual intervention used to compile the corpus. To demonstrate its potential use, text-to-speech systems have been built for all languages using unsupervised or lightly supervised methods, also briefly presented in the paper.

[39] James Scobbie, Alice Turk, Christian Geng, Simon King, Robin Lickley, and Korin Richmond. The Edinburgh speech production facility DoubleTalk corpus. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]
The DoubleTalk articulatory corpus was collected at the Edinburgh Speech Production Facility (ESPF) using two synchronized Carstens AG500 electromagnetic articulometers. The first release of the corpus comprises orthographic transcriptions aligned at phrasal level to EMA and audio data for each of 6 mixed-dialect speaker pairs. It is available from the ESPF online archive. A variety of tasks were used to elicit a wide range of speech styles, including monologue (a modified Comma Gets a Cure and spontaneous story-telling), structured spontaneous dialogue (Map Task and Diapix), a wordlist task, a memory-recall task, and a shadowing task. In this session we will demo the corpus with various examples.

Keywords: discourse, EMA, spontaneous speech
[40] Maria Astrinaki, Alexis Moinet, Junichi Yamagishi, Korin Richmond, Zhen-Hua Ling, Simon King, and Thierry Dutoit. Mage - HMM-based speech synthesis reactively controlled by the articulators. In 8th ISCA Workshop on Speech Synthesis, page 243, Barcelona, Spain, August 2013. [ bib | .pdf ]
In this paper, we present the recent progress in the MAGE project. MAGE is a library for realtime and interactive (reactive) parametric speech synthesis using hidden Markov models (HMMs). Here, it is broadened in order to support not only the standard acoustic features (spectrum and f0) to model and synthesize speech but also to combine acoustic and articulatory features, such as tongue, lips and jaw positions. Such an integration enables the user to have a straight forward and meaningful control space to intuitively modify the synthesized phones in real time only by configuring the position of the articulators.

Keywords: speech synthesis, reactive, articulators
[41] Chee-Ming Ting, Simon King, Sh-Hussain Salleh, and A. K. Ariff. Discriminative tandem features for HMM-based EEG classification. In Proc. 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 13), Osaka, Japan, July 2013. [ bib | .pdf ]
We investigate the use of discriminative feature extractors in tandem configuration with generative EEG classification system. Existing studies on dynamic EEG classification typically use hidden Markov models (HMMs) which lack discriminative capability. In this paper, a linear and a non-linear classifier are discriminatively trained to produce complementary input features to the conventional HMM system. Two sets of tandem features are derived from linear discriminant analysis (LDA) projection output and multilayer perceptron (MLP) class-posterior probability, before appended to the standard autoregressive (AR) features. Evaluation on a two-class motor-imagery classification task shows that both the proposed tandem features yield consistent gains over the AR baseline, resulting in significant relative improvement of 6.2% and 11.2% for the LDA and MLP features respectively. We also explore portability of these features across different subjects.

[42] C. Valentini-Botinhao, E. Godoy, Y. Stylianou, B. Sauert, S. King, and J. Yamagishi. Improving intelligibility in noise of HMM-generated speech via noise-dependent and -independent methods. In Proc. ICASSP, Vancouver, Canada, May 2013. [ bib | .pdf ]
[43] H. Lu and S. King. Factorized context modelling for text-to-speech synthesis. In Proc. ICASSP, Vancouver, Canada, May 2013. [ bib | .pdf ]
Because speech units are so context-dependent, a large number of linguistic context features are generally used by HMM- based Text-to-Speech (TTS) speech synthesis systems, via context-dependent models. Since it is impossible to train separate models for every context, decision trees are used to discover the most important combinations of features that should be modelled. The task of the decision tree is very hard to generalize from a very small observed part of the context feature space to the rest and they have a major weakness: they cannot directly take advantage of factorial properties: they subdivide the model space based on one feature at a time. We propose a Dynamic Bayesian Network (DBN) based Mixed Memory Markov Model (MMMM) to provide factorization of the context space. The results of a listening test are provided as evidence that the model successfully learns the factorial nature of this space.

[44] Mark Sinclair and Simon King. Where are the challenges in speaker diarization? In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, Vancouver, British Columbia, USA, May 2013. [ bib | .pdf ]
We present a study on the contributions to Diarization Error Rate by the various components of speaker diarization system. Following on from an earlier study by Huijbregts and Wooters, we extend into more areas and draw somewhat different conclusions. From a series of experiments combining real, oracle and ideal system components, we are able to conclude that the primary cause of error in diarization is the training of speaker models on impure data, something that is in fact done in every current system. We conclude by suggesting ways to improve future systems, including a focus on training the speaker models from smaller quantities of pure data instead of all the data, as is currently done.

[45] John Dines, Hui Liang, Lakshmi Saheer, Matthew Gibson, William Byrne, Keiichiro Oura, Keiichi Tokuda, Junichi Yamagishi, Simon King, Mirjam Wester, Teemu Hirsimäki, Reima Karhila, and Mikko Kurimo. Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. Computer Speech and Language, 27(2):420-437, February 2013. [ bib | DOI | http ]
In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics.

Keywords: Speech-to-speech translation, Cross-lingual speaker adaptation, HMM-based speech synthesis, Speaker adaptation, Voice conversion
[46] Javier Tejedor, Doroteo T. Toledano, Dong Wang, Simon King, and Jose Colas. Feature analysis for discriminative confidence estimation in spoken term detection. Computer Speech and Language, To appear, 2013. [ bib | .pdf ]
Discriminative confidence based on multi-layer perceptrons (MLPs) and multiple features has shown significant advantage compared to the widely used lattice-based confidence in spoken term detection (STD). Although the MLP-based framework can handle any features derived from a multitude of sources, choosing all possible features may lead to over complex models and hence less generality. In this paper, we design an extensive set of features and analyze their contribution to STD individually and as a group. The main goal is to choose a small set of features that are sufficiently informative while keeping the model simple and generalizable. We employ two established models to conduct the analysis: one is linear regression which targets for the most relevant features and the other is logistic linear regression which targets for the most discriminative features. We find the most informative features are comprised of those derived from diverse sources (ASR decoding, duration and lexical properties) and the two models deliver highly consistent feature ranks. STD experiments on both English and Spanish data demonstrate significant performance gains with the proposed feature sets.

[47] P. Lal and S. King. Cross-lingual automatic speech recognition using tandem features. IEEE Transactions on Audio, Speech, and Language Processing, To appear, 2013. [ bib | DOI | .pdf ]
Automatic speech recognition depends on large amounts of transcribed speech recordings in order to estimate the parameters of the acoustic model. Recording such large speech corpora is time-consuming and expensive; as a result, sufficient quantities of data exist only for a handful of languages — there are many more languages for which little or no data exist. Given that there are acoustic similarities between speech in different languages, it may be fruitful to use data from a well-resourced source language to estimate the acoustic models for a recogniser in a poorly-resourced target language. Previous approaches to this task have often involved making assumptions about shared phonetic inventories between the languages. Unfortunately pairs of languages do not generally share a common phonetic inventory. We propose an indirect way of transferring information from a source language acoustic model to a target language acoustic model without having to make any assumptions about the phonetic inventory overlap. To do this, we employ tandem features, in which class-posteriors from a separate classifier are decorrelated and appended to conventional acoustic features. Tandem features have the advantage that the language of the speech data used to train the classifier need not be the same as the target language to be recognised. This is because the class-posteriors are not used directly, so do not have to be over any particular set of classes. We demonstrate the use of tandem features in cross-lingual settings, including training on one or several source languages. We also examine factors which may predict a priori how much relative improvement will be brought about by using such tandem features, for a given source and target pair. In addition to conventional phoneme class-posteriors, we also investigate whether articulatory features (AFs) - a multistream, discrete, multi-valued labelling of speech — can be used instead. This is motivated by an assumption that AFs are less language-specific than a phoneme set.

Keywords: Acoustics;Data models;Hidden Markov models;Speech;Speech recognition;Training;Transforms;Automatic speech recognition;Multilayer perceptrons
[48] Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts, Robert A.J. Clark, Simon King, and Adriana Stan. Lightly supervised gmm vad to use audiobook for speech synthesiser. In Proc. ICASSP, 2013. [ bib | .pdf ]
Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semi-automatically build TTS systems from audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on audiobooks.

[49] Christian Geng, Alice Turk, James M. Scobbie, Cedric Macmartin, Philip Hoole, Korin Richmond, Alan Wrench, Marianne Pouplier, Ellen Gurman Bard, Ziggy Campbell, Catherine Dickie, Eddie Dubourg, William Hardcastle, Evia Kainada, Simon King, Robin Lickley, Satsuki Nakai, Steve Renals, Kevin White, and Ronny Wiegand. Recording speech articulation in dialogue: Evaluating a synchronized double electromagnetic articulography setup. Journal of Phonetics, 41(6):421 - 431, 2013. [ bib | DOI | http | .pdf ]
Abstract We demonstrate the workability of an experimental facility that is geared towards the acquisition of articulatory data from a variety of speech styles common in language use, by means of two synchronized electromagnetic articulography (EMA) devices. This approach synthesizes the advantages of real dialogue settings for speech research with a detailed description of the physiological reality of speech production. We describe the facility's method for acquiring synchronized audio streams of two speakers and the system that enables communication among control room technicians, experimenters and participants. Further, we demonstrate the feasibility of the approach by evaluating problems inherent to this specific setup: The first problem is the accuracy of temporal synchronization of the two {EMA} machines, the second is the severity of electromagnetic interference between the two machines. Our results suggest that the synchronization method used yields an accuracy of approximately 1 ms. Electromagnetic interference was derived from the complex-valued signal amplitudes. This dependent variable was analyzed as a function of the recording status - i.e. on/off - of the interfering machine's transmitters. The intermachine distance was varied between 1 m and 8.5 m. Results suggest that a distance of approximately 6.5 m is appropriate to achieve data quality comparable to that of single speaker recordings.

[50] Adriana Stan, Peter Bell, and Simon King. A grapheme-based method for automatic alignment of speech and text data. In Proc. IEEE Workshop on Spoken Language Technology, Miami, Florida, USA, December 2012. [ bib | .pdf ]
This paper introduces a method for automatic alignment of speech data with unsynchronised, imperfect transcripts, for a domain where no initial acoustic models are available. Using grapheme-based acoustic models, word skip networks and orthographic speech transcripts, we are able to harvest 55% of the speech with a 93% utterance-level accuracy and 99% word accuracy for the produced transcriptions. The work is based on the assumption that there is a high degree of correspondence between the speech and text, and that a full transcription of all of the speech is not required. The method is language independent and the only prior knowledge and resources required are the speech and text transcripts, and a few minor user interventions.

[51] Heng Lu and Simon King. Using Bayesian networks to find relevant context features for HMM-based speech synthesis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]
Speech units are highly context-dependent, so taking contextual features into account is essential for speech modelling. Context is employed in HMM-based Text-to-Speech speech synthesis systems via context-dependent phone models. A very wide context is taken into account, represented by a large set of contextual factors. However, most of these factors probably have no significant influence on the speech, most of the time. To discover which combinations of features should be taken into account, decision tree-based context clustering is used. But the space of context-dependent models is vast, and the number of contexts seen in the training data is only a tiny fraction of this space, so the task of the decision tree is very hard: to generalise from observations of a tiny fraction of the space to the rest of the space, whilst ignoring uninformative or redundant context features. The structure of the context feature space has not been systematically studied for speech synthesis. In this paper we discover a dependency structure by learning a Bayesian Network over the joint distribution of the features and the speech. We demonstrate that it is possible to discard the majority of context features with minimal impact on quality, measured by a perceptual test.

Keywords: HMM-based speech synthesis, Bayesian Networks, context information
[52] Rasmus Dall, Christophe Veaux, Junichi Yamagishi, and Simon King. Analysis of speaker clustering techniques for HMM-based speech synthesis. In Proc. Interspeech, September 2012. [ bib | .pdf ]
This paper describes a method for speaker clustering, with the application of building average voice models for speaker-adaptive HMM-based speech synthesis that are a good basis for adapting to specific target speakers. Our main hypothesis is that using perceptually similar speakers to build the average voice model will be better than use unselected speakers, even if the amount of data available from perceptually similar speakers is smaller. We measure the perceived similarities among a group of 30 female speakers in a listening test and then apply multiple linear regression to automatically predict these listener judgements of speaker similarity and thus to identify similar speakers automatically. We then compare a variety of average voice models trained on either speakers who were perceptually judged to be similar to the target speaker, or speakers selected by the multiple linear regression, or a large global set of unselected speakers. We find that the average voice model trained on perceptually similar speakers provides better performance than the global model, even though the latter is trained on more data, confirming our main hypothesis. However, the average voice model using speakers selected automatically by the multiple linear regression does not reach the same level of performance.

[53] C. Valentini-Botinhao, J. Yamagishi, and S. King. Evaluating speech intelligibility enhancement for HMM-based synthetic speech in noise. In Proc. Sapa Workshop, Portland, USA, September 2012. [ bib | .pdf ]
It is possible to increase the intelligibility of speech in noise by enhancing the clean speech signal. In this paper we demonstrate the effects of modifying the spectral envelope of synthetic speech according to the environmental noise. To achieve this, we modify Mel cepstral coefficients according to an intelligibility measure that accounts for glimpses of speech in noise: the Glimpse Proportion measure. We evaluate this method against a baseline synthetic voice trained only with normal speech and a topline voice trained with Lombard speech, as well as natural speech. The intelligibility of these voices was measured when mixed with speech-shaped noise and with a competing speaker at three different levels. The Lombard voices, both natural and synthetic, were more intelligible than the normal voices in all conditions. For speech-shaped noise, the proposed modified voice was as intelligible as the Lombard synthetic voice without requiring any recordings of Lombard speech, which are hard to obtain. However, in the case of competing talker noise, the Lombard synthetic voice was more intelligible than the proposed modified voice.

[54] Ruben San-Segundo, Juan M. Montero, Veronica Lopez-Luden, and Simon King. Detecting acronyms from capital letter sequences in spanish. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]
This paper presents an automatic strategy to decide how to pronounce a Capital Letter Sequence (CLS) in a Text to Speech system (TTS). If CLS is well known by the TTS, it can be expanded in several words. But when the CLS is unknown, the system has two alternatives: spelling it (abbreviation) or pronouncing it as a new word (acronym). In Spanish, there is a high relationship between letters and phonemes. Because of this, when a CLS is similar to other words in Spanish, there is a high tendency to pronounce it as a standard word. This paper proposes an automatic method for detecting acronyms. Additionally, this paper analyses the discrimination capability of some features, and several strategies for combining them in order to obtain the best classifier. For the best classifier, the classification error is 8.45%. About the feature analysis, the best features have been the Letter Sequence Perplexity and the Average N-gram order.

[55] C. Valentini-Botinhao, J. Yamagishi, and S. King. Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. In Proc. Interspeech, Portland, USA, September 2012. [ bib ]
We propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation we previously proposed for the Glimpse Proportion measure. Here we show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. To evaluate the method we built eight different voices from normal read-text speech data from a male speaker. Some voices were also built from Lombard speech data produced by the same speaker. Listening experiments with speech-shaped noise and with a single competing talker indicate that our method significantly improves intelligibility when compared to unmodified synthetic speech. The voices built from Lombard speech outperformed the proposed method particularly for the competing talker case. However, compared to a voice using only the spectral parameters from Lombard speech, the proposed method obtains similar or higher performance.

[56] C. Valentini-Botinhao, J. Yamagishi, and S. King. Using an intelligibility measure to create noise robust cepstral coefficients for HMM-based speech synthesis. In Proc. LISTA Workshop, Edinburgh, UK, May 2012. [ bib | .pdf ]
[57] C. Valentini-Botinhao, R. Maia, J. Yamagishi, S. King, and H. Zen. Cepstral analysis based on the Glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise. In Proc. ICASSP, pages 3997-4000, Kyoto, Japan, March 2012. [ bib | DOI | .pdf ]
In this paper we introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. This new method aims to increase the intelligibility of speech in noise by modifying the clean speech, and has applications in scenarios such as public announcement and car navigation systems. We first explain how the Glimpse Proportion measure operates and further show how we approximated it to integrate it into an existing spectral envelope parameter extraction method commonly used in the HMM-based speech synthesis framework. We then demonstrate how this new method changes the modelled spectrum according to the characteristics of the noise and show results for a listening test with vocoded and HMM-based synthetic speech. The test indicates that the proposed method can significantly improve intelligibility of synthetic speech in speech shaped noise.

[58] Chen-Yu Yang, G. Brown, Liang Lu, J. Yamagishi, and S. King. Noise-robust whispered speech recognition using a non-audible-murmur microphone with vts compensation. In Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on, pages 220-223, 2012. [ bib | DOI ]
In this paper, we introduce a newly-created corpus of whispered speech simultaneously recorded via a close-talking microphone and a non-audible murmur (NAM) microphone in both clean and noisy conditions. To benchmark the corpus, which has been freely released recently, experiments on automatic recognition of continuous whispered speech were conducted. When training and test conditions are matched, the NAM microphone is found to be more robust against background noise than the close-talking microphone. In mismatched conditions (noisy data, models trained on clean speech), we found that Vector Taylor Series (VTS) compensation is particularly effective for the NAM signal.

[59] Jaime Lorenzo-Trueba, Oliver Watts, Roberto Barra-Chicote, Junichi Yamagishi, Simon King, and Juan M Montero. Simple4all proposals for the albayzin evaluations in speech synthesis. In Proc. Iberspeech 2012, 2012. [ bib | .pdf ]
Simple4All is a European funded project that aims to streamline the production of multilanguage expressive synthetic voices by means of unsupervised data extraction techniques, allowing the automatic processing of freely available data into flexible task-specific voices. In this paper we describe three different approaches for this task, the first two covering enhancements in expressivity and flexibility with the final one focusing on the development of unsupervised voices. The first technique introduces the principle of speaker adaptation from average models consisting of multiple voices, with the second being an extension of this adaptation concept into allowing the control of the expressive strength of the synthetic voice. Finally, an unsupervised approach to synthesis capable of learning from unlabelled text data is introduced in detail

[60] Dong Wang, Javier Tejedor, Simon King, and Joe Frankel. Term-dependent confidence normalization for out-of-vocabulary spoken term detection. Journal of Computer Science and Technology, 27(2), 2012. [ bib | DOI ]
Spoken Term Detection (STD) is a fundamental component of spoken information retrieval systems. A key task of an STD system is to determine reliable detections and reject false alarms based on certain confidence measures. The detection posterior probability, which is often computed from lattices, is a widely used confidence measure. However, a potential problem of this confidence measure is that the confidence scores of detections of all search terms are treated uniformly, regardless of how much they may differ in terms of phonetic or linguistic properties. This problem is particularly evident for out-of-vocabulary (OOV) terms which tend to exhibit high intra-term diversity. To address the discrepancy on confidence levels that the same confidence score may convey for different terms, a term-dependent decision strategy is desirable - for example, the term-specific threshold (TST) approach. In this work, we propose a term-dependent normalisation technique which compensates for term diversity on confidence estimation. Particularly, we propose a linear bias compensation and a discriminative compensation to deal with the bias problem that is inherent in lattice-based confidence measuring from which the TST approach suffers. We tested the proposed technique on speech data from the multi-party meeting domain with two state-of-the-art STD systems based on phonemes and words respectively. The experimental results demonstrate that the confidence normalisation approach leads to a significant performance improvement in STD, particularly for OOV terms with phoneme-based systems.

[61] Keiichiro Oura, Junichi Yamagishi, Mirjam Wester, Simon King, and Keiichi Tokuda. Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping. Speech Communication, 54(6):703-714, 2012. [ bib | DOI | http ]
In the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech.

Keywords: HMM-based speech synthesis, Unsupervised speaker adaptation, Cross-lingual speaker adaptation, Speech-to-speech translation
[62] Kei Hashimoto, Junichi Yamagishi, William Byrne, Simon King, and Keiichi Tokuda. Impacts of machine translation and speech synthesis on speech-to-speech translation. Speech Communication, 54(7):857-866, 2012. [ bib | DOI | http ]
This paper analyzes the impacts of machine translation and speech synthesis on speech-to-speech translation systems. A typical speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques have been proposed for integration of speech recognition and machine translation. However, corresponding techniques have not yet been considered for speech synthesis. The focus of the current work is machine translation and speech synthesis, and we present a subjective evaluation designed to analyze their impact on speech-to-speech translation. The results of these analyses show that the naturalness and intelligibility of the synthesized speech are strongly affected by the fluency of the translated sentences. In addition, several features were found to correlate well with the average fluency of the translated sentences and the average naturalness of the synthesized speech.

Keywords: Speech-to-speech translation, Machine translation, Speech synthesis, Subjective evaluation
[63] Junichi Yamagishi, Christophe Veaux, Simon King, and Steve Renals. Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction. Acoustical Science and Technology, 33(1):1-5, 2012. [ bib | DOI | http | .pdf ]
In this invited paper, we overview the clinical applications of speech synthesis technologies and explain a few selected researches. We also introduce the University of Edinburgh’s new project “Voice Banking and reconstruction” for patients with degenerative diseases, such as motor neurone disease and Parkinson's disease and show how speech synthesis technologies can improve the quality of life for the patients.

[64] Oliver Watts, Junichi Yamagishi, and Simon King. Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In Proc. Interspeech, pages 2157-2160, Florence, Italy, August 2011. [ bib | .pdf ]
Part of speech (POS) tags are foremost among the features conventionally used to predict intonational phrase-breaks for text to speech (TTS) conversion. The construction of such systems therefore presupposes the availability of a POS tagger for the relevant language, or of a corpus manually tagged with POS. However, such tools and resources are not available in the majority of the world’s languages, and manually labelling text with POS tags is an expensive and time-consuming process. We therefore propose the use of continuous-valued features that summarise the distributional characteristics of word types as surrogates for POS features. Importantly, such features are obtained in an unsupervised manner from an untagged text corpus. We present results on the phrase-break prediction task, where use of the features closes the gap in performance between a baseline system (using only basic punctuation-related features) and a topline system (incorporating a state-of-the-art POS tagger).

[65] Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Can objective measures predict the intelligibility of modified HMM-based synthetic speech in noise? In Proc. Interspeech, August 2011. [ bib | .pdf ]
Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility – and on how well objective measures predict it – when we separately modify speaking rate, fundamental frequency, line spectral pairs and spectral peaks. Shifting LSPs can increase intelligibility for human listeners; other modifications had weaker effects. Among the objective measures we evaluated, the Dau model and the Glimpse proportion were the best predictors of human performance.

[66] Korin Richmond, Phil Hoole, and Simon King. Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Proc. Interspeech, pages 1505-1508, Florence, Italy, August 2011. [ bib | .pdf ]
This paper serves as an initial announcement of the availability of a corpus of articulatory data called mngu0. This corpus will ultimately consist of a collection of multiple sources of articulatory data acquired from a single speaker: electromagnetic articulography (EMA), audio, video, volumetric MRI scans, and 3D scans of dental impressions. This data will be provided free for research use. In this first stage of the release, we are making available one subset of EMA data, consisting of more than 1,300 phonetically diverse utterances recorded with a Carstens AG500 electromagnetic articulograph. Distribution of mngu0 will be managed by a dedicated “forum-style” web site. This paper both outlines the general goals motivating the distribution of the data and the creation of the mngu0 web forum, and also provides a description of the EMA data contained in this initial release.

[67] Ming Lei, Junichi Yamagishi, Korin Richmond, Zhen-Hua Ling, Simon King, and Li-Rong Dai. Formant-controlled HMM-based speech synthesis. In Proc. Interspeech, pages 2777-2780, Florence, Italy, August 2011. [ bib | .pdf ]
This paper proposes a novel framework that enables us to manipulate and control formants in HMM-based speech synthesis. In this framework, the dependency between formants and spectral features is modelled by piecewise linear transforms; formant parameters are effectively mapped by these to the means of Gaussian distributions over the spectral synthesis parameters. The spectral envelope features generated under the influence of formants in this way may then be passed to high-quality vocoders to generate the speech waveform. This provides two major advantages over conventional frameworks. First, we can achieve spectral modification by changing formants only in those parts where we want control, whereas the user must specify all formants manually in conventional formant synthesisers (e.g. Klatt). Second, this can produce high-quality speech. Our results show the proposed method can control vowels in the synthesized speech by manipulating F 1 and F 2 without any degradation in synthesis quality.

[68] S. Andraszewicz, J. Yamagishi, and S. King. Vocal attractiveness of statistical speech synthesisers. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5368-5371, May 2011. [ bib | DOI ]
Our previous analysis of speaker-adaptive HMM-based speech synthesis methods suggested that there are two possible reasons why average voices can obtain higher subjective scores than any individual adapted voice: 1) model adaptation degrades speech quality proportionally to the distance 'moved' by the transforms, and 2) psychoacoustic effects relating to the attractiveness of the voice. This paper is a follow-on from that analysis and aims to separate these effects out. Our latest perceptual experiments focus on attractiveness, using average voices and speaker-dependent voices without model trans formation, and show that using several speakers to create a voice improves smoothness (measured by Harmonics-to-Noise Ratio), reduces distance from the the average voice in the log F0-F1 space of the final voice and hence makes it more attractive at the segmental level. However, this is weakened or overridden at supra-segmental or sentence levels.

Keywords: speaker-adaptive HMM-based speech synthesis methods;speaker-dependent voices;statistical speech synthesisers;vocal attractiveness;hidden Markov models;speaker recognition;speech synthesis;
[69] Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Evaluation of objective measures for intelligibility prediction of HMM-based synthetic speech in noise. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5112-5115, May 2011. [ bib | DOI | .pdf ]
In this paper we evaluate four objective measures of speech with regards to intelligibility prediction of synthesized speech in diverse noisy situations. We evaluated three intelligibility measures, the Dau measure, the glimpse proportion and the Speech Intelligibility Index (SII) and a quality measure, the Perceptual Evaluation of Speech Quality (PESQ). For the generation of synthesized speech we used a state of the art HMM-based speech synthesis system. The noisy conditions comprised four additive noises. The measures were compared with subjective intelligibility scores obtained in listening tests. The results show the Dau and the glimpse measures to be the best predictors of intelligibility, with correlations of around 0.83 to subjective scores. All measures gave less accurate predictions of intelligibility for synthetic speech than have previously been found for natural speech; in particular the SII measure. In additional experiments, we processed the synthesized speech by an ideal binary mask before adding noise. The Glimpse measure gave the most accurate intelligibility predictions in this situation.

[70] K. Hashimoto, J. Yamagishi, W. Byrne, S. King, and K. Tokuda. An analysis of machine translation and speech synthesis in speech-to-speech translation system. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5108-5111, May 2011. [ bib | DOI ]
This paper provides an analysis of the impacts of machine translation and speech synthesis on speech-to-speech translation systems. The speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques for integration of speech recognition and machine translation have been proposed. However, speech synthesis has not yet been considered. Therefore, in this paper, we focus on machine translation and speech synthesis, and report a subjective evaluation to analyze the impact of each component. The results of these analyses show that the naturalness and intelligibility of synthesized speech are strongly affected by the fluency of the translated sentences.

Keywords: machine translation;speech recognition;speech synthesis;speech-to-speech translation system;speech recognition;speech synthesis;
[71] Dong Wang, Nicholas Evans, Raphael Troncy, and Simon King. Handling overlaps in spoken term detection. In Proc. International Conference on Acoustics, Speech and Signal Processing, pages 5656-5659, May 2011. [ bib | DOI | .pdf ]
Spoken term detection (STD) systems usually arrive at many overlapping detections which are often addressed with some pragmatic approaches, e.g. choosing the best detection to represent all the overlaps. In this paper we present a theoretical study based on a concept of acceptance space. In particular, we present two confidence estimation approaches based on Bayesian and evidence perspectives respectively. Analysis shows that both approaches possess respective ad vantages and shortcomings, and that their combination has the potential to provide an improved confidence estimation. Experiments conducted on meeting data confirm our analysis and show considerable performance improvement with the combined approach, in particular for out-of-vocabulary spoken term detection with stochastic pronunciation modeling.

[72] Dong Wang and Simon King. Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Processing Letters, 18(2):122-125, February 2011. [ bib | DOI | .pdf ]
Pronunciation prediction, or letter-to-sound (LTS) conversion, is an essential task for speech synthesis, open vo- cabulary spoken term detection and other applications dealing with novel words. Most current approaches (at least for English) employ data-driven methods to learn and represent pronunciation “rules” using statistical models such as decision trees, hidden Markov models (HMMs) or joint-multigram models (JMMs). The LTS task remains challenging, particularly for languages with a complex relationship between spelling and pronunciation such as English. In this paper, we propose to use a conditional random field (CRF) to perform LTS because it avoids having to model a distribution over observations and can perform global inference, suggesting that it may be more suitable for LTS than decision trees, HMMs or JMMs. One challenge in applying CRFs to LTS is that the phoneme and grapheme sequences of a word are generally of different lengths, which makes CRF training difficult. To solve this problem, we employed a joint-multigram model to generate aligned training exemplars. Experiments conducted with the AMI05 dictionary demonstrate that a CRF significantly outperforms other models, especially if n-best lists of predictions are generated.

[73] J. Dines, J. Yamagishi, and S. King. Measuring the gap between HMM-based ASR and TTS. IEEE Selected Topics in Signal Processing, 2011. (in press). [ bib | DOI ]
The EMIME European project is conducting research in the development of technologies for mobile, personalised speech-to-speech translation systems. The hidden Markov model (HMM) is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components, thus, the investigation of unified statistical modelling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems measuring their performance with respect to phone set and lexicon; acoustic feature type and dimensionality; HMM topology; and speaker adaptation. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modelling approaches.

Keywords: Acoustics, Adaptation model, Context modeling, Hidden Markov models, Speech, Speech recognition, Training, speech recognition, speech synthesis, unified models
[74] Adriana Stan, Junichi Yamagishi, Simon King, and Matthew Aylett. The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication, 53(3):442-450, 2011. [ bib | DOI | http ]
This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called “RSS”, along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given. Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis.

Keywords: Speech synthesis, HTS, Romanian, HMMs, Sampling frequency, Auditory scale
[75] C. Mayo, R. A. J. Clark, and S. King. Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis. Speech Communication, 53(3):311-326, 2011. [ bib | DOI ]
The quality of current commercial speech synthesis systems is now so high that system improvements are being made at subtle sub- and supra-segmental levels. Human perceptual evaluation of such subtle improvements requires a highly sophisticated level of perceptual attention to specific acoustic characteristics or cues. However, it is not well understood what acoustic cues listeners attend to by default when asked to evaluate synthetic speech. It may, therefore, be potentially quite difficult to design an evaluation method that allows listeners to concentrate on only one dimension of the signal, while ignoring others that are perceptually more important to them. The aim of the current study was to determine which acoustic characteristics of unit-selection synthetic speech are most salient to listeners when evaluating the naturalness of such speech. This study made use of multidimensional scaling techniques to analyse listeners' pairwise comparisons of synthetic speech sentences. Results indicate that listeners place a great deal of perceptual importance on the presence of artifacts and discontinuities in the speech, somewhat less importance on aspects of segmental quality, and very little importance on stress/intonation appropriateness. These relative differences in importance will impact on listeners' ability to attend to these different acoustic characteristics of synthetic speech, and should therefore be taken into account when designing appropriate methods of synthetic speech evaluation.

Keywords: Speech synthesis; Evaluation; Speech perception; Acoustic cue weighting; Multidimensional scaling
[76] Dong Wang, Simon King, Nick Evans, and Raphael Troncy. Direct posterior confidence for out-of-vocabulary spoken term detection. In Proc. ACM Multimedia 2010 Searching Spontaneous Conversational Speech Workshop, October 2010. [ bib | DOI | .pdf ]
Spoken term detection (STD) is a fundamental task in spoken information retrieval. Compared to conventional speech transcription and keyword spotting, STD is an open-vocabulary task and is necessarily required to address out-of-vocabulary (OOV) terms. Approaches based on subword units, e.g. phonemes, are widely used to solve the OOV issue; however, performance on OOV terms is still significantly inferior to that for in-vocabulary (INV) terms. The performance degradation on OOV terms can be attributed to a multitude of factors. A particular factor we address in this paper is that the acoustic and language models used for speech transcribing are highly vulnerable to OOV terms, which leads to unreliable confidence measures and error-prone detections. A direct posterior confidence measure that is derived from discriminative models has been proposed for STD. In this paper, we utilize this technique to tackle the weakness of OOV terms in confidence estimation. Neither acoustic models nor language models being included in the computation, the new confidence avoids the weak modeling problem with OOV terms. Our experiments, set up on multi-party meeting speech which is highly spontaneous and conversational, demonstrate that the proposed technique improves STD performance on OOV terms significantly; when combined with conventional lattice-based confidence, a significant improvement in performance is obtained on both INVs and OOVs. Furthermore, the new confidence measure technique can be combined together with other advanced techniques for OOV treatment, such as stochastic pronunciation modeling and term-dependent confidence discrimination, which leads to an integrated solution for OOV STD with greatly improved performance.

[77] Dong Wang, Simon King, Nicholas W. D. Evans, and Raphaël Troncy. Direct posterior confidence for out-of-vocabulary spoken term detection. In SSCS 2010, ACM Workshop on Searching Spontaneous Conversational Speech, September 20-24, 2010, Firenze, Italy, Firenze, ITALY, September 2010. [ bib | DOI ]
Spoken term detection (STD) is a fundamental task in spoken information retrieval. Compared to conventional speech transcription and keyword spotting, STD is an open-vocabulary task and is necessarily required to address out-of-vocabulary (OOV) terms. Approaches based on subword units, e.g. phonemes, are widely used to solve the OOV issue; however, performance on OOV terms is still significantly inferior to that for in-vocabulary (INV) terms. The performance degradation on OOV terms can be attributed to a multitude of factors. A particular factor we address in this paper is that the acoustic and language models used for speech transcribing are highly vulnerable to OOV terms, which leads to unreliable confidence measures and error-prone detections. A direct posterior confidence measure that is derived from discriminative models has been proposed for STD. In this paper, we utilize this technique to tackle the weakness of OOV terms in confidence estimation. Neither acoustic models nor language models being included in the computation, the new confidence avoids the weak modeling problem with OOV terms. Our experiments, set up on multi-party meeting speech which is highly spontaneous and conversational, demonstrate that the proposed technique improves STD performance on OOV terms significantly; when combined with conventional lattice-based confidence, a significant improvement in performance is obtained on both INVs and OOVs. Furthermore, the new confidence measure technique can be combined together with other advanced techniques for OOV treatment, such as stochastic pronunciation modeling and term-dependent confidence discrimination, which leads to an integrated solution for OOV STD with greatly improved performance.

[78] Dong Wang, Simon King, Nick Evans, and Raphael Troncy. CRF-based stochastic pronunciation modelling for out-of-vocabulary spoken term detection. In Proc. Interspeech, Makuhari, Chiba, Japan, September 2010. [ bib ]
Out-of-vocabulary (OOV) terms present a significant challenge to spoken term detection (STD). This challenge, to a large extent, lies in the high degree of uncertainty in pronunciations of OOV terms. In previous work, we presented a stochastic pronunciation modeling (SPM) approach to compensate for this uncertainty. A shortcoming of our original work, however, is that the SPM was based on a joint-multigram model (JMM), which is suboptimal. In this paper, we propose to use conditional random fields (CRFs) for letter-to-sound conversion, which significantly improves quality of the predicted pronunciations. When applied to OOV STD, we achieve consider- able performance improvement with both a 1-best system and an SPM-based system.

[79] Oliver Watts, Junichi Yamagishi, and Simon King. The role of higher-level linguistic features in HMM-based speech synthesis. In Proc. Interspeech, pages 841-844, Makuhari, Japan, September 2010. [ bib | .pdf ]
We analyse the contribution of higher-level elements of the linguistic specification of a data-driven speech synthesiser to the naturalness of the synthetic speech which it generates. The system is trained using various subsets of the full feature-set, in which features relating to syntactic category, intonational phrase boundary, pitch accent and boundary tones are selectively removed. Utterances synthesised by the different configurations of the system are then compared in a subjective evaluation of their naturalness. The work presented forms background analysis for an ongoing set of experiments in performing text-to-speech (TTS) conversion based on shallow features: features that can be trivially extracted from text. By building a range of systems, each assuming the availability of a different level of linguistic annotation, we obtain benchmarks for our on-going work.

[80] Junichi Yamagishi, Oliver Watts, Simon King, and Bela Usabaev. Roles of the average voice in speaker-adaptive HMM-based speech synthesis. In Proc. Interspeech, pages 418-421, Makuhari, Japan, September 2010. [ bib | .pdf ]
In speaker-adaptive HMM-based speech synthesis, there are typically a few speakers for which the output synthetic speech sounds worse than that of other speakers, despite having the same amount of adaptation data from within the same corpus. This paper investigates these fluctuations in quality and concludes that as mel-cepstral distance from the average voice becomes larger, the MOS naturalness scores generally become worse. Although this negative correlation is not that strong, it suggests a way to improve the training and adaptation strategies. We also draw comparisons between our findings and the work of other researchers regarding “vocal attractiveness.”

Keywords: speech synthesis, HMM, average voice, speaker adaptation
[81] Mirjam Wester, John Dines, Matthew Gibson, Hui Liang, Yi-Jian Wu, Lakshmi Saheer, Simon King, Keiichiro Oura, Philip N. Garner, William Byrne, Yong Guan, Teemu Hirsimäki, Reima Karhila, Mikko Kurimo, Matt Shannon, Sayaka Shiota, Jilei Tian, Keiichi Tokuda, and Junichi Yamagishi. Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project. In Proc. 7th ISCA Speech Synthesis Workshop, Kyoto, Japan, September 2010. [ bib | .pdf ]
This paper provides an overview of speaker adaptation research carried out in the EMIME speech-to-speech translation (S2ST) project. We focus on how speaker adaptation transforms can be learned from speech in one language and applied to the acoustic models of another language. The adaptation is transferred across languages and/or from recognition models to synthesis models. The various approaches investigated can all be viewed as a process in which a mapping is defined in terms of either acoustic model states or linguistic units. The mapping is used to transfer either speech data or adaptation transforms between the two models. Because the success of speaker adaptation in text-to-speech synthesis is measured by judging speaker similarity, we also discuss issues concerning evaluation of speaker similarity in an S2ST scenario.

[82] Javier Tejedor, Doroteo T. Toledano, Miguel Bautista, Simon King, Dong Wang, and Jose Colas. Augmented set of features for confidence estimation in spoken term detection. In Proc. Interspeech, September 2010. [ bib | .pdf ]
Discriminative confidence estimation along with confidence normalisation have been shown to construct robust decision maker modules in spoken term detection (STD) systems. Discriminative confidence estimation, making use of termdependent features, has been shown to improve the widely used lattice-based confidence estimation in STD. In this work, we augment the set of these term-dependent features and show a significant improvement in the STD performance both in terms of ATWV and DET curves in experiments conducted on a Spanish geographical corpus. This work also proposes a multiple linear regression analysis to carry out the feature selection. Next, the most informative features derived from it are used within the discriminative confidence on the STD system.

[83] Oliver Watts, Junichi Yamagishi, and Simon King. Letter-based speech synthesis. In Proc. Speech Synthesis Workshop 2010, pages 317-322, Nara, Japan, September 2010. [ bib | .pdf ]
Initial attempts at performing text-to-speech conversion based on standard orthographic units are presented, forming part of a larger scheme of training TTS systems on features that can be trivially extracted from text. We evaluate the possibility of using the technique of decision-tree-based context clustering conventionally used in HMM-based systems for parametertying to handle letter-to-sound conversion. We present the application of a method of compound-feature discovery to corpusbased speech synthesis. Finally, an evaluation of intelligibility of letter-based systems and more conventional phoneme-based systems is presented.

[84] O. Watts, J. Yamagishi, S. King, and K. Berkling. Synthesis of child speech with HMM adaptation and voice conversion. Audio, Speech, and Language Processing, IEEE Transactions on, 18(5):1005-1016, July 2010. [ bib | DOI | .pdf ]
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesizer from that data. We chose to build a statistical parametric synthesizer using the hidden Markov model (HMM)-based system HTS, as this technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. Six different configurations of the synthesizer were compared, using both speaker-dependent and speaker-adaptive modeling techniques, and using varying amounts of data. For comparison with HMM adaptation, techniques from voice conversion were used to transform existing synthesizers to the characteristics of the target speaker. Speaker-adaptive voices generally outperformed child speaker-dependent voices in the evaluation. HMM adaptation outperformed voice conversion style techniques when using the full target speaker corpus; with fewer adaptation data, however, no significant listener preference for either HMM adaptation or voice conversion methods was found.

Keywords: HMM adaptation techniques;child speech synthesis;hidden Markov model;speaker adaptive modeling technique;speaker dependent technique;speaker-adaptive voice;statistical parametric synthesizer;target speaker corpus;voice conversion;hidden Markov models;speech synthesis;
[85] Alice Turk, James Scobbie, Christian Geng, Barry Campbell, Catherine Dickie, Eddie Dubourg, Ellen Gurman Bard, William Hardcastle, Mariam Hartinger, Simon King, Robin Lickley, Cedric Macmartin, Satsuki Nakai, Steve Renals, Korin Richmond, Sonja Schaeffler, Kevin White, Ronny Wiegand, and Alan Wrench. An Edinburgh speech production facility. Poster presented at the 12th Conference on Laboratory Phonology, Albuquerque, New Mexico., July 2010. [ bib | .pdf ]
[86] D. Wang, S. King, and J. Frankel. Stochastic pronunciation modelling for out-of-vocabulary spoken term detection. Audio, Speech, and Language Processing, IEEE Transactions on, PP(99), July 2010. [ bib | DOI ]
Spoken term detection (STD) is the name given to the task of searching large amounts of audio for occurrences of spoken terms, which are typically single words or short phrases. One reason that STD is a hard task is that search terms tend to contain a disproportionate number of out-of-vocabulary (OOV) words. The most common approach to STD uses subword units. This, in conjunction with some method for predicting pronunciations of OOVs from their written form, enables the detection of OOV terms but performance is considerably worse than for in-vocabulary terms. This performance differential can be largely attributed to the special properties of OOVs. One such property is the high degree of uncertainty in the pronunciation of OOVs. We present a stochastic pronunciation model (SPM) which explicitly deals with this uncertainty. The key insight is to search for all possible pronunciations when detecting an OOV term, explicitly capturing the uncertainty in pronunciation. This requires a probabilistic model of pronunciation, able to estimate a distribution over all possible pronunciations. We use a joint-multigram model (JMM) for this and compare the JMM-based SPM with the conventional soft match approach. Experiments using speech from the meetings domain demonstrate that the SPM performs better than soft match in most operating regions, especially at low false alarm probabilities. Furthermore, SPM and soft match are found to be complementary: their combination provides further performance gains.

[87] Mikko Kurimo, William Byrne, John Dines, Philip N. Garner, Matthew Gibson, Yong Guan, Teemu Hirsimäki, Reima Karhila, Simon King, Hui Liang, Keiichiro Oura, Lakshmi Saheer, Matt Shannon, Sayaka Shiota, Jilei Tian, Keiichi Tokuda, Mirjam Wester, Yi-Jian Wu, and Junichi Yamagishi. Personalising speech-to-speech translation in the EMIME project. In Proc. ACL 2010 System Demonstrations, Uppsala, Sweden, July 2010. [ bib | .pdf ]
In the EMIME project we have studied unsupervised cross-lingual speaker adaptation. We have employed an HMM statistical framework for both speech recognition and synthesis which provides transformation mechanisms to adapt the synthesized voice in TTS (text-to-speech) using the recognized voice in ASR (automatic speech recognition). An important application for this research is personalised speech-to-speech translation that will use the voice of the speaker in the input language to utter the translated sentences in the output language. In mobile environments this enhances the users' interaction across language barriers by making the output speech sound more like the original speaker's way of speaking, even if she or he could not speak the output language.

[88] J. Yamagishi, B. Usabaev, S. King, O. Watts, J. Dines, J. Tian, R. Hu, Y. Guan, K. Oura, K. Tokuda, R. Karhila, and M. Kurimo. Thousands of voices for HMM-based speech synthesis - analysis and application of TTS systems built on various ASR corpora. IEEE Transactions on Audio, Speech and Language Processing, 18(5):984-1004, July 2010. [ bib | DOI ]
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.

Keywords: Automatic speech recognition (ASR), H Triple S (HTS), SPEECON database, WSJ database, average voice, hidden Markov model (HMM)-based speech synthesis, speaker adaptation, speech synthesis, voice conversion
[89] R. Barra-Chicote, J. Yamagishi, S. King, J. Manuel Monero, and J. Macias-Guarasa. Analysis of statistical parametric and unit-selection speech synthesis systems applied to emotional speech. Speech Communication, 52(5):394-404, May 2010. [ bib | DOI ]
We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded - happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion. Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method.

Keywords: Emotional speech synthesis; HMM-based synthesis; Unit selection
[90] Dong Wang, Simon King, Joe Frankel, and Peter Bell. Stochastic pronunciation modelling and soft match for out-of-vocabulary spoken term detection. In Proc. ICASSP, Dallas, Texas, USA, March 2010. [ bib | .pdf ]
A major challenge faced by a spoken term detection (STD) system is the detection of out-of-vocabulary (OOV) terms. Although a subword-based STD system is able to detect OOV terms, performance reduction is always observed compared to in-vocabulary terms. One challenge that OOV terms bring to STD is the pronunciation uncertainty. A commonly used approach to address this problem is a soft matching procedure,and the other is the stochastic pronunciation modelling (SPM) proposed by the authors. In this paper we compare these two approaches, and combine them using a discriminative decision strategy. Experimental results demonstrated that SPM and soft match are highly complementary, and their combination gives significant performance improvement to OOV term detection.

Keywords: confidence estimation, spoken term detection, speech recognition
[91] Simon King. Speech synthesis. In Morgan and Ellis, editors, Speech and Audio Signal Processing. Wiley, 2010. [ bib ]
No abstract (this is a book chapter)

[92] Steve Renals and Simon King. Automatic speech recognition. In William J. Hardcastle, John Laver, and Fiona E. Gibbon, editors, Handbook of Phonetic Sciences, chapter 22. Wiley Blackwell, 2010. [ bib ]
[93] Keiichiro Oura, Keiichi Tokuda, Junichi Yamagishi, Mirjam Wester, and Simon King. Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. In Proc. ICASSP, volume I, pages 4954-4957, 2010. [ bib | .pdf ]
In the EMIME project, we are developing a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrate two techniques, unsupervised adaptation for HMM-based TTS using a word-based large-vocabulary continuous speech recognizer and cross-lingual speaker adaptation for HMM-based TTS, into a single architecture. Thus, an unsupervised cross-lingual speaker adaptation system can be developed. Listening tests show very promising results, demonstrating that adapted voices sound similar to the target speaker and that differences between supervised and unsupervised cross-lingual speaker adaptation are small.

[94] Volker Strom and Simon King. A classifier-based target cost for unit selection speech synthesis trained on perceptual data. In Proc. Interspeech, Makuhari, Japan, 2010. [ bib | .ps | .pdf ]
Our goal is to automatically learn a PERCEPTUALLY-optimal target cost function for a unit selection speech synthesiser. The approach we take here is to train a classifier on human perceptual judgements of synthetic speech. The output of the classifier is used to make a simple three-way distinction rather than to estimate a continuously-valued cost. In order to collect the necessary perceptual data, we synthesised 145,137 short sentences with the usual target cost switched off, so that the search was driven by the join cost only. We then selected the 7200 sentences with the best joins and asked 60 listeners to judge them, providing their ratings for each syllable. From this, we derived a rating for each demiphone. Using as input the same context features employed in our conventional target cost function, we trained a classifier on these human perceptual ratings. We synthesised two sets of test sentences with both our standard target cost and the new target cost based on the classifier. A/B preference tests showed that the classifier-based target cost, which was learned completely automatically from modest amounts of perceptual data, is almost as good as our carefully- and expertly-tuned standard target cost.

[95] Alice Turk, James Scobbie, Christian Geng, Cedric Macmartin, Ellen Bard, Barry Campbell, Catherine Dickie, Eddie Dubourg, Bill Hardcastle, Phil Hoole, Evia Kanaida, Robin Lickley, Satsuki Nakai, Marianne Pouplier, Simon King, Steve Renals, Korin Richmond, Sonja Schaeffler, Ronnie Wiegand, Kevin White, and Alan Wrench. The Edinburgh Speech Production Facility's articulatory corpus of spontaneous dialogue. The Journal of the Acoustical Society of America, 128(4):2429-2429, 2010. [ bib | DOI ]
The EPSRC‐funded Edinburgh Speech Production is built around two synchronized Carstens AG500 electromagnetic articulographs (EMAs) in order to capture articulatory∕acoustic data from spontaneous dialogue. An initial articulatory corpus was designed with two aims. The first was to elicit a range of speech styles∕registers from speakers, and therefore provide an alternative to fully scripted corpora. The second was to extend the corpus beyond monologue, by using tasks that promote natural discourse and interaction. A subsidiary driver was to use dialects from outwith North America: dialogues paired up a Scottish English and a Southern British English speaker. Tasks. Monologue: Story reading of “Comma Gets a Cure” [Honorof et al. (2000)], lexical sets [Wells (1982)], spontaneous story telling, diadochokinetic tasks. Dialogue: Map tasks [Anderson et al. (1991)], “Spot the Difference” picture tasks [Bradlow et al. (2007)], story‐recall. Shadowing of the spontaneous story telling by the second participant. Each dialogue session includes approximately 30 min of speech, and there are acoustics‐only baseline materials. We will introduce the corpus and highlight the role of articulatory production data in helping provide a fuller understanding of various spontaneous speech phenomena by presenting examples of naturally occurring covert speech errors, accent accommodation, turn taking negotiation, and shadowing.

[96] J. Yamagishi and S. King. Simple methods for improving speaker-similarity of HMM-based speech synthesis. In Proc. ICASSP 2010, Dallas, Texas, USA, 2010. [ bib | .pdf ]
[97] Simon King. A tutorial on HMM speech synthesis (invited paper). In Sadhana - Academy Proceedings in Engineering Sciences, Indian Institute of Sciences, 2010. [ bib | .pdf ]
Statistical parametric speech synthesis, based on HMM-like models, has become competitive with established concatenative techniques over the last few years. This paper offers a non-mathematical introduction to this method of speech synthesis. It is intended to be complementary to the wide range of excellent technical publications already available. Rather than offer a comprehensive literature review, this paper instead gives a small number of carefully chosen references which are good starting points for further reading.

[98] Peter Bell and Simon King. Diagonal priors for full covariance speech recognition. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Merano, Italy, December 2009. [ bib | DOI | .pdf ]
We investigate the use of full covariance Gaussians for large-vocabulary speech recognition. The large number of parameters gives high modelling power, but when training data is limited, the standard sample covariance matrix is often poorly conditioned, and has high variance. We explain how these problems may be solved by the use of a diagonal covariance smoothing prior, and relate this to the shrinkage estimator, for which the optimal shrinkage parameter may itself be estimated from the training data. We also compare the use of generatively and discriminatively trained priors. Results are presented on a large vocabulary conversational telephone speech recognition task.

[99] Dong Wang, Simon King, and Joe Frankel. Stochastic pronunciation modelling for spoken term detection. In Proc. Interspeech, pages 2135-2138, Brighton, UK, September 2009. [ bib | .pdf ]
A major challenge faced by a spoken term detection (STD) system is the detection of out-of-vocabulary (OOV) terms. Although a subword-based STD system is able to detect OOV terms, performance reduction is always observed compared to in-vocabulary terms. Current approaches to STD do not acknowledge the particular properties of OOV terms, such as pronunciation uncertainty. In this paper, we use a stochastic pronunciation model to deal with the uncertain pronunciations of OOV terms. By considering all possible term pronunciations, predicted by a joint-multigram model, we observe a significant performance improvement.

[100] Oliver Watts, Junichi Yamagishi, Simon King, and Kay Berkling. HMM adaptation and voice conversion for the synthesis of child speech: A comparison. In Proc. Interspeech 2009, pages 2627-2630, Brighton, U.K., September 2009. [ bib | .pdf ]
This study compares two different methodologies for producing data-driven synthesis of child speech from existing systems that have been trained on the speech of adults. On one hand, an existing statistical parametric synthesiser is transformed using model adaptation techniques, informed by linguistic and prosodic knowledge, to the speaker characteristics of a child speaker. This is compared with the application of voice conversion techniques to convert the output of an existing waveform concatenation synthesiser with no explicit linguistic or prosodic knowledge. In a subjective evaluation of the similarity of synthetic speech to natural speech from the target speaker, the HMM-based systems evaluated are generally preferred, although this is at least in part due to the higher dimensional acoustic features supported by these techniques.

[101] Simon King and Vasilis Karaiskos. The Blizzard Challenge 2009. In Proc. Blizzard Challenge Workshop, Edinburgh, UK, September 2009. [ bib | .pdf ]
The Blizzard Challenge 2009 was the fifth annual Blizzard Challenge. As in 2008, UK English and Mandarin Chinese were the chosen languages for the 2009 Challenge. The English corpus was the same one used in 2008. The Mandarin corpus was pro- vided by iFLYTEK. As usual, participants with limited resources or limited experience in these languages had the option of using unaligned labels that were provided for both corpora and for the test sentences. An accent-specific pronunciation dictionary was also available for the English speaker. This year, the tasks were organised in the form of `hubs' and `spokes' where each hub task involved building a general-purpose voice and each spoke task involved building a voice for a specific application. A set of test sentences was released to participants, who were given a limited time in which to synthesise them and submit the synthetic speech. An online listening test was conducted to evaluate naturalness, intelligibility, degree of similarity to the original speaker and, for one of the spoke tasks, "appropriateness."

Keywords: Blizzard
[102] Dong Wang, Simon King, Joe Frankel, and Peter Bell. Term-dependent confidence for out-of-vocabulary term detection. In Proc. Interspeech, pages 2139-2142, Brighton, UK, September 2009. [ bib | .pdf ]
Within a spoken term detection (STD) system, the decision maker plays an important role in retrieving reliable detections. Most of the state-of-the-art STD systems make decisions based on a confidence measure that is term-independent, which poses a serious problem for out-of-vocabulary (OOV) term detection. In this paper, we study a term-dependent confidence measure based on confidence normalisation and discriminative modelling, particularly focusing on its remarkable effectiveness for detecting OOV terms. Experimental results indicate that the term-dependent confidence provides much more significant improvement for OOV terms than terms in-vocabulary.

[103] Junichi Yamagishi, Mike Lincoln, Simon King, John Dines, Matthew Gibson, Jilei Tian, and Yong Guan. Analysis of unsupervised and noise-robust speaker-adaptive HMM-based speech synthesis systems toward a unified ASR and TTS framework. In Proc. Interspeech 2009, Edinburgh, U.K., September 2009. [ bib ]
For the 2009 Blizzard Challenge we have built an unsupervised version of the HTS-2008 speaker-adaptive HMM-based speech synthesis system for English, and a noise robust version of the systems for Mandarin. They are designed from a multidisciplinary application point of view in that we attempt to integrate the components of the TTS system with other technologies such as ASR. All the average voice models are trained exclusively from recognized, publicly available, ASR databases. Multi-pass LVCSR and confidence scores calculated from confusion network are used for the unsupervised systems, and noisy data recorded in cars or public spaces is used for the noise robust system. We believe the developed systems form solid benchmarks and provide good connections to ASR fields. This paper describes the development of the systems and reports the results and analysis of their evaluation.

[104] J. Dines, J. Yamagishi, and S. King. Measuring the gap between HMM-based ASR and TTS. In Proc. Interspeech, pages 1391-1394, Brighton, U.K., September 2009. [ bib ]
The EMIME European project is conducting research in the development of technologies for mobile, personalised speech-to-speech translation systems. The hidden Markov model is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components, thus, the investigation of unified statistical modelling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems, measuring their performance with respect to phone set and lexicon, acoustic feature type and dimensionality and HMM topology. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modelling approaches.

[105] Javier Tejedor, Dong Wang, Simon King, Joe Frankel, and Jose Colas. A posterior probability-based system hybridisation and combination for spoken term detection. In Proc. Interspeech, pages 2131-2134, Brighton, UK, September 2009. [ bib | .pdf ]
Spoken term detection (STD) is a fundamental task for multimedia information retrieval. To improve the detection performance, we have presented a direct posterior-based confidence measure generated from a neural network. In this paper, we propose a detection-independent confidence estimation based on the direct posterior confidence measure, in which the decision making is totally separated from the term detection. Based on this idea, we first present a hybrid system which conducts the term detection and confidence estimation based on different sub-word units, and then propose a combination method which merges detections from heterogeneous term detectors based on the direct posterior-based confidence. Experimental results demonstrated that the proposed methods improved system performance considerably for both English and Spanish.

[106] J. Yamagishi, Bela Usabaev, Simon King, Oliver Watts, John Dines, Jilei Tian, Rile Hu, Yong Guan, Keiichiro Oura, Keiichi Tokuda, Reima Karhila, and Mikko Kurimo. Thousands of voices for HMM-based speech synthesis. In Proc. Interspeech, pages 420-423, Brighton, U.K., September 2009. [ bib | http ]
Our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an ‘average voice model’ plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack of phonetic balance. This enables us consider building high-quality voices on ’non-TTS’ corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper we show thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management, Globalphone and Speecon. We report some perceptual evaluation results and outline the outstanding issues.

[107] Dong Wang, Tejedor Tejedor, Joe Frankel, and Simon King. Posterior-based confidence measures for spoken term detection. In Proc. ICASSP09, Taiwan, April 2009. [ bib | .pdf ]
Confidence measures play a key role in spoken term detection (STD) tasks. The confidence measure expresses the posterior probability of the search term appearing in the detection period, given the speech. Traditional approaches are based on the acoustic and language model scores for candidate detections found using automatic speech recognition, with Bayes' rule being used to compute the desired posterior probability. In this paper, we present a novel direct posterior-based confidence measure which, instead of resorting to the Bayesian formula, calculates posterior probabilities from a multi-layer perceptron (MLP) directly. Compared with traditional Bayesian-based methods, the direct-posterior approach is conceptually and mathematically simpler. Moreover, the MLP-based model does not require assumptions to be made about the acoustic features such as their statistical distribution and the independence of static and dynamic co-efficients. Our experimental results in both English and Spanish demonstrate that the proposed direct posterior-based confidence improves STD performance.

[108] Matthew P. Aylett, Simon King, and Junichi Yamagishi. Speech synthesis without a phone inventory. In Interspeech, pages 2087-2090, 2009. [ bib | .pdf ]
In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and potentially sub-optimal. In this paper we investigate how acoustic clustering, together with lexicon constraints, can be used to build a self-organised inventory. Six English speech synthesis systems were built using two frameworks, unit selection and parametric HTS for three inventory conditions: 1) a traditional phone set, 2) a system using orthographic units, and 3) a self-organised inventory. A listening test showed a strong preference for the classic system, and for the orthographic system over the self-organised system. Results also varied by letter to sound complexity and database coverage. This suggests the self-organised approach failed to generalise pronunciation as well as introducing noise above and beyond that caused by orthographic sound mismatch.

[109] Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhenhua Ling, Tomoki Toda, Keiichi Tokuda, Simon King, and Steve Renals. Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Transactions on Audio, Speech and Language Processing, 17(6):1208-1230, 2009. [ bib | http | www: ]
This paper describes a speaker-adaptive HMM-based speech synthesis system. The new system, called “HTS-2007,” employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available. In addition, a comparison study with several speech synthesis techniques shows the new system is very robust: It is able to build voices from less-than-ideal speech data and synthesize good-quality speech even for out-of-domain sentences.

[110] R. Barra-Chicote, J. Yamagishi, J.M. Montero, S. King, S. Lutfi, and J. Macias-Guarasa. Generacion de una voz sintetica en Castellano basada en HSMM para la Evaluacion Albayzin 2008: conversion texto a voz. In V Jornadas en Tecnologia del Habla, pages 115-118, November 2008. (in Spanish). [ bib | .pdf ]
[111] Javier Tejedor, Dong Wang, Joe Frankel, Simon King, and José Colás. A comparison of grapheme and phoneme-based units for Spanish spoken term detection. Speech Communication, 50(11-12):980-991, November 2008. [ bib | DOI ]
The ever-increasing volume of audio data available online through the world wide web means that automatic methods for indexing and search are becoming essential. Hidden Markov model (HMM) keyword spotting and lattice search techniques are the two most common approaches used by such systems. In keyword spotting, models or templates are defined for each search term prior to accessing the speech and used to find matches. Lattice search (referred to as spoken term detection), uses a pre-indexing of speech data in terms of word or sub-word units, which can then quickly be searched for arbitrary terms without referring to the original audio. In both cases, the search term can be modelled in terms of sub-word units, typically phonemes. For in-vocabulary words (i.e. words that appear in the pronunciation dictionary), the letter-to-sound conversion systems are accepted to work well. However, for out-of-vocabulary (OOV) search terms, letter-to-sound conversion must be used to generate a pronunciation for the search term. This is usually a hard decision (i.e. not probabilistic and with no possibility of backtracking), and errors introduced at this step are difficult to recover from. We therefore propose the direct use of graphemes (i.e., letter-based sub-word units) for acoustic modelling. This is expected to work particularly well in languages such as Spanish, where despite the letter-to-sound mapping being very regular, the correspondence is not one-to-one, and there will be benefits from avoiding hard decisions at early stages of processing. In this article, we compare three approaches for Spanish keyword spotting or spoken term detection, and within each of these we compare acoustic modelling based on phone and grapheme units. Experiments were performed using the Spanish geographical-domain Albayzin corpus. Results achieved in the two approaches proposed for spoken term detection show us that trigrapheme units for acoustic modelling match or exceed the performance of phone-based acoustic models. In the method proposed for keyword spotting, the results achieved with each acoustic model are very similar.

[112] Oliver Watts, Junichi Yamagishi, Kay Berkling, and Simon King. HMM-based synthesis of child speech. In Proc. 1st Workshop on Child, Computer and Interaction (ICMI'08 post-conference workshop), Crete, Greece, October 2008. [ bib | .pdf ]
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesiser from that data. Because only limited data can be collected, and the domain of that data is constrained, it is difficult to obtain the type of phonetically-balanced corpus usually used in speech synthesis. As a consequence, building a synthesiser from this data is difficult. Concatenative synthesisers are not robust to corpora with many missing units (as is likely when the corpus content is not carefully designed), so we chose to build a statistical parametric synthesiser using the HMM-based system HTS. This technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. We compared 6 different configurations of the synthesiser, using both speaker-dependent and speaker-adaptive modelling techniques, and using varying amounts of data. The output from these systems was evaluated alongside natural and vocoded speech, in a Blizzard-style listening test.

[113] Peter Bell and Simon King. A shrinkage estimator for speech recognition with full covariance HMMs. In Proc. Interspeech, Brisbane, Australia, September 2008. Shortlisted for best student paper award. [ bib | .pdf ]
We consider the problem of parameter estimation in full-covariance Gaussian mixture systems for automatic speech recognition. Due to the high dimensionality of the acoustic feature vector, the standard sample covariance matrix has a high variance and is often poorly-conditioned when the amount of training data is limited. We explain how the use of a shrinkage estimator can solve these problems, and derive a formula for the optimal shrinkage intensity. We present results of experiments on a phone recognition task, showing that the estimator gives a performance improvement over a standard full-covariance system

[114] Junichi Yamagishi, Zhenhua Ling, and Simon King. Robustness of hmm-based speech synthesis. In Proc. Interspeech 2008, pages 581-584, Brisbane, Australia, September 2008. [ bib | .pdf ]
As speech synthesis techniques become more advanced, we are able to consider building high-quality voices from data collected outside the usual highly-controlled recording studio environment. This presents new challenges that are not present in conventional text-to-speech synthesis: the available speech data are not perfectly clean, the recording conditions are not consistent, and/or the phonetic balance of the material is not ideal. Although a clear picture of the performance of various speech synthesis techniques (e.g., concatenative, HMM-based or hybrid) under good conditions is provided by the Blizzard Challenge, it is not well understood how robust these algorithms are to less favourable conditions. In this paper, we analyse the performance of several speech synthesis methods under such conditions. This is, as far as we know, a new research topic: “Robust speech synthesis.” As a consequence of our investigations, we propose a new robust training method for the HMM-based speech synthesis in for use with speech data collected in unfavourable conditions.

[115] Dong Wang, Ivan Himawan, Joe Frankel, and Simon King. A posterior approach for microphone array based speech recognition. In Proc. Interspeech, pages 996-999, September 2008. [ bib | .pdf ]
Automatic speech recognition (ASR) becomes rather difficult in meetings domains because of the adverse acoustic conditions, including more background noise, more echo and reverberation and frequent cross-talking. Microphone arrays have been demonstrated able to boost ASR performance dramatically in such noisy and reverberant environments, with various beamforming algorithms. However, almost all existing beamforming measures work in the acoustic domain, resorting to signal processing theories and geometric explanation. This limits their application, and induces significant performance degradation when the geometric property is unavailable or hard to estimate, or if heterogenous channels exist in the audio system. In this paper, we preset a new posterior-based approach for array-based speech recognition. The main idea is, instead of enhancing speech signals, we try to enhance the posterior probabilities that frames belonging to recognition units, e.g., phones. These enhanced posteriors are then transferred to posterior probability based features and are modeled by HMMs, leading to a tandem ANN-HMM hybrid system presented by Hermansky et al.. Experimental results demonstrated the validity of this posterior approach. With the posterior accumulation or enhancement, significant improvement was achieved over the single channel baseline. Moreover, we can combine the acoustic enhancement and posterior enhancement together, leading to a hybrid acoustic-posterior beamforming approach, which works significantly better than just the acoustic beamforming, especially in the scenario with moving-speakers.

[116] Joe Frankel, Dong Wang, and Simon King. Growing bottleneck features for tandem ASR. In Proc. Interspeech, page 1549, September 2008. [ bib | .pdf ]
We present a method for training bottleneck MLPs for use in tandem ASR. Experiments on meetings data show that this approach leads to improved performance compared with training MLPs from a random initialization.

[117] Simon King, Keiichi Tokuda, Heiga Zen, and Junichi Yamagishi. Unsupervised adaptation for hmm-based speech synthesis. In Proc. Interspeech, pages 1869-1872, Brisbane, Australia, September 2008. [ bib | .PDF ]
It is now possible to synthesise speech using HMMs with a comparable quality to unit-selection techniques. Generating speech from a model has many potential advantages over concatenating waveforms. The most exciting is model adaptation. It has been shown that supervised speaker adaptation can yield high-quality synthetic voices with an order of magnitude less data than required to train a speaker-dependent model or to build a basic unit-selection system. Such supervised methods require labelled adaptation data for the target speaker. In this paper, we introduce a method capable of unsupervised adaptation, using only speech from the target speaker without any labelling.

[118] Laszlo Toth, Joe Frankel, Gabor Gosztolya, and Simon King. Cross-lingual portability of mlp-based tandem features - a case study for english and hungarian. In Proc. Interspeech, pages 2695-2698, Brisbane, Australia, September 2008. [ bib | .PDF ]
One promising approach for building ASR systems for less-resourced languages is cross-lingual adaptation. Tandem ASR is particularly well suited to such adaptation, as it includes two cascaded modelling steps: feature extraction using multi-layer perceptrons (MLPs), followed by modelling using a standard HMM. The language-specific tuning can be performed by adjusting the HMM only, leaving the MLP untouched. Here we examine the portability of feature extractor MLPs between an Indo-European (English) and a Finno-Ugric (Hungarian) language. We present experiments which use both conventional phone-posterior and articulatory feature (AF) detector MLPs, both trained on a much larger quantity of (English) data than the monolingual (Hungarian) system. We find that the cross-lingual configurations achieve similar performance to the monolingual system, and that, interestingly, the AF detectors lead to slightly worse performance, despite the expectation that they should be more language-independent than phone-based MLPs. However, the cross-lingual system outperforms all other configurations when the English phone MLP is adapted on the Hungarian data.

Keywords: tandem, ASR
[119] Vasilis Karaiskos, Simon King, Robert A. J. Clark, and Catherine Mayo. The blizzard challenge 2008. In Proc. Blizzard Challenge Workshop, Brisbane, Australia, September 2008. [ bib | .pdf ]
The Blizzard Challenge 2008 was the fourth annual Blizzard Challenge. This year, participants were asked to build two voices from a UK English corpus and one voice from a Man- darin Chinese corpus. This is the first time that a language other than English has been included and also the first time that a large UK English corpus has been available. In addi- tion, the English corpus contained somewhat more expressive speech than that found in corpora used in previous Blizzard Challenges. To assist participants with limited resources or limited ex- perience in UK-accented English or Mandarin, unaligned la- bels were provided for both corpora and for the test sentences. Participants could use the provided labels or create their own. An accent-specific pronunciation dictionary was also available for the English speaker. A set of test sentences was released to participants, who were given a limited time in which to synthesise them and submit the synthetic speech. An online listening test was con- ducted, to evaluate naturalness, intelligibility and degree of similarity to the original speaker.

Keywords: Blizzard
[120] Peter Bell and Simon King. Covariance updates for discriminative training by constrained line search. In Proc. Interspeech, Brisbane, Australia, September 2008. [ bib | .pdf ]
We investigate the recent Constrained Line Search algorithm for discriminative training of HMMs and propose an alternative formula for variance update. We compare the method to standard techniques on a phone recognition task.

[121] Olga Goubanova and Simon King. Bayesian networks for phone duration prediction. Speech Communication, 50(4):301-311, April 2008. [ bib | DOI ]
In a text-to-speech system, the duration of each phone may be predicted by a duration model. This model is usually trained using a database of phones with known durations; each phone (and the context it appears in) is characterised by a feature vector that is composed of a set of linguistic factor values. We describe the use of a graphical model - a Bayesian network - for predicting the duration of a phone, given the values for these factors. The network has one discrete variable for each of the linguistic factors and a single continuous variable for the phone's duration. Dependencies between variables (or the lack of them) are represented in the BN structure by arcs (or missing arcs) between pairs of nodes. During training, both the topology of the network and its parameters are learned from labelled data. We compare the results of the BN model with results for sums of products and CART models on the same data. In terms of the root mean square error, the BN model performs much better than both CART and SoP models. In terms of correlation coefficient, the BN model performs better than the SoP model, and as well as the CART model. A BN model has certain advantages over CART and SoP models. Training SoP models requires a high degree of expertise. CART models do not deal with interactions between factors in any explicit way. As we demonstrate, a BN model can also make accurate predictions of a phone's duration, even when the values for some of the linguistic factors are unknown.

[122] Dong Wang, Joe Frankel, Javier Tejedor, and Simon King. A comparison of phone and grapheme-based spoken term detection. In Proc. ICASSP, pages 4969-4972, March 2008. [ bib | DOI ]
We propose grapheme-based sub-word units for spoken term detection (STD). Compared to phones, graphemes have a number of potential advantages. For out-of-vocabulary search terms, phone- based approaches must generate a pronunciation using letter-to-sound rules. Using graphemes obviates this potentially error-prone hard decision, shifting pronunciation modelling into the statistical models describing the observation space. In addition, long-span grapheme language models can be trained directly from large text corpora. We present experiments on Spanish and English data, comparing phone and grapheme-based STD. For Spanish, where phone and grapheme-based systems give similar transcription word error rates (WERs), grapheme-based STD significantly outperforms a phone- based approach. The converse is found for English, where the phone-based system outperforms a grapheme approach. However, we present additional analysis which suggests that phone-based STD performance levels may be achieved by a grapheme-based approach despite lower transcription accuracy, and that the two approaches may usefully be combined. We propose a number of directions for future development of these ideas, and suggest that if grapheme-based STD can match phone-based performance, the inherent flexibility in dealing with out-of-vocabulary terms makes this a desirable approach.

[123] Matthew P. Aylett and Simon King. Single speaker segmentation and inventory selection using dynamic time warping self organization and joint multigram mapping. In SSW06, pages 258-263, 2008. [ bib | .pdf ]
In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together with the constraints of a synthesis lexicon. In this paper we will describe a prototype inventory creation method using dynamic time warping (DTW) for acoustic clustering and a joint multigram approach for relating a series of symbols that represent the speech to these emerged units. We initially examined two symbol sets: 1) A baseline of standard phones 2) Orthographic symbols. The success of the approach is evaluated by comparing word boundaries generated by the emergent phones against those created using state-of-the-art HMM segmentation. Initial results suggest the DTW segmentation can match word boundaries with a root mean square error (RMSE) of 35ms. Results from mapping units onto phones resulted in a higher RMSE of 103ms. This error was increased when multiple multigram types were added and when the default unit clustering was altered from 40 (our baseline) to 10. Results for orthographic matching had a higher RMSE of 125ms. To conclude we discuss future work that we believe can reduce this error rate to a level sufficient for the techniques to be applied to a unit selection synthesis system.

[124] Volker Strom and Simon King. Investigating Festival's target cost function using perceptual experiments. In Proc. Interspeech, Brisbane, 2008. [ bib | .ps | .pdf ]
We describe an investigation of the target cost used in the Festival unit selection speech synthesis system. Our ultimate goal is to automatically learn a perceptually optimal target cost function. In this study, we investigated the behaviour of the target cost for one segment type. The target cost is based on counting the mismatches in several context features. A carrier sentence (“My name is Roger”) was synthesised using all 147,820 possible combinations of the diphones /n_ei/ and /ei_m/. 92 representative versions were selected and presented to listeners as 460 pairwise comparisons. The listeners' preference votes were used to analyse the behaviour of the target cost, with respect to the values of its component linguistic context features.

[125] J. Frankel and S. King. Factoring Gaussian precision matrices for linear dynamic models. Pattern Recognition Letters, 28(16):2264-2272, December 2007. [ bib | DOI | .pdf ]
The linear dynamic model (LDM), also known as the Kalman filter model, has been the subject of research in the engineering, control, and more recently, machine learning and speech technology communities. The Gaussian noise processes are usually assumed to have diagonal, or occasionally full, covariance matrices. A number of recent papers have considered modelling the precision rather than covariance matrix of a Gaussian distribution, and this work applies such ideas to the LDM. A Gaussian precision matrix P can be factored into the form P = UTSU where U is a transform and S a diagonal matrix. By varying the form of U, the covariance can be specified as being diagonal or full, or used to model a given set of spatial dependencies. Furthermore, the transform and scaling components can be shared between models, allowing richer distributions with only marginally more parameters than required to specify diagonal covariances. The method described in this paper allows the construction of models with an appropriate number of parameters for the amount of available training data. We provide illustrative experimental results on synthetic and real speech data in which models with factored precision matrices and automatically-selected numbers of parameters are as good as or better than models with diagonal covariances on small data sets and as good as models with full covariance matrices on larger data sets.

[126] Ö. Çetin, M. Magimai-Doss, A. Kantor, S. King, C. Bartels, J. Frankel, and K. Livescu. Monolingual and crosslingual comparison of tandem features derived from articulatory and phone MLPs. In Proc. ASRU, Kyoto, December 2007. IEEE. [ bib | .pdf ]
In recent years, the features derived from posteriors of a multilayer perceptron (MLP), known as tandem features, have proven to be very effective for automatic speech recognition. Most tandem features to date have relied on MLPs trained for phone classification. We recently showed on a relatively small data set that MLPs trained for articulatory feature classification can be equally effective. In this paper, we provide a similar comparison using MLPs trained on a much larger data set - 2000 hours of English conversational telephone speech. We also explore how portable phone- and articulatory feature- based tandem features are in an entirely different language - Mandarin - without any retraining. We find that while phone-based features perform slightly better in the matched-language condition, they perform significantly better in the cross-language condition. Yet, in the cross-language condition, neither approach is as effective as the tandem features extracted from an MLP trained on a relatively small amount of in-domain data. Beyond feature concatenation, we also explore novel observation modelling schemes that allow for greater flexibility in combining the tandem and standard features at hidden Markov model (HMM) outputs.

[127] J. Frankel, M. Wester, and S. King. Articulatory feature recognition using dynamic Bayesian networks. Computer Speech & Language, 21(4):620-640, October 2007. [ bib | .pdf ]
We describe a dynamic Bayesian network for articulatory feature recognition. The model is intended to be a component of a speech recognizer that avoids the problems of conventional “beads-on-a-string” phoneme-based models. We demonstrate that the model gives superior recognition of articulatory features from the speech signal compared with a stateof- the art neural network system. We also introduce a training algorithm that offers two major advances: it does not require time-aligned feature labels and it allows the model to learn a set of asynchronous feature changes in a data-driven manner.

[128] J. Frankel, M. Magimai-Doss, S. King, K. Livescu, and Ö. Çetin. Articulatory feature classifiers trained on 2000 hours of telephone speech. In Proc. Interspeech, Antwerp, Belgium, August 2007. [ bib | .pdf ]
This paper is intended to advertise the public availability of the articulatory feature (AF) classification multi-layer perceptrons (MLPs) which were used in the Johns Hopkins 2006 summer workshop. We describe the design choices, data preparation, AF label generation, and the training of MLPs for feature classification on close to 2000 hours of telephone speech. In addition, we present some analysis of the MLPs in terms of classification accuracy and confusions along with a brief summary of the results obtained during the workshop using the MLPs. We invite interested parties to make use of these MLPs.

[129] Junichi Yamagishi, Takao Kobayashi, Steve Renals, Simon King, Heiga Zen, Tomoki Toda, and Keiichi Tokuda. Improved average-voice-based speech synthesis using gender-mixed modeling and a parameter generation algorithm considering GV. In Proc. 6th ISCA Workshop on Speech Synthesis (SSW-6), August 2007. [ bib | .pdf ]
For constructing a speech synthesis system which can achieve diverse voices, we have been developing a speaker independent approach of HMM-based speech synthesis in which statistical average voice models are adapted to a target speaker using a small amount of speech data. In this paper, we incorporate a high-quality speech vocoding method STRAIGHT and a parameter generation algorithm with global variance into the system for improving quality of synthetic speech. Furthermore, we introduce a feature-space speaker adaptive training algorithm and a gender mixed modeling technique for conducting further normalization of the average voice model. We build an English text-to-speech system using these techniques and show the performance of the system.

[130] Robert A. J. Clark, Monika Podsiadlo, Mark Fraser, Catherine Mayo, and Simon King. Statistical analysis of the Blizzard Challenge 2007 listening test results. In Proc. Blizzard 2007 (in Proc. Sixth ISCA Workshop on Speech Synthesis), Bonn, Germany, August 2007. [ bib | .pdf ]
Blizzard 2007 is the third Blizzard Challenge, in which participants build voices from a common dataset. A large listening test is conducted which allows comparison of systems in terms of naturalness and intelligibility. New sections were added to the listening test for 2007 to test the perceived similarity of the speaker's identity between natural and synthetic speech. In this paper, we present the results of the listening test and the subsequent statistical analysis.

Keywords: Blizzard
[131] Mark Fraser and Simon King. The Blizzard Challenge 2007. In Proc. Blizzard 2007 (in Proc. Sixth ISCA Workshop on Speech Synthesis), Bonn, Germany, August 2007. [ bib | .pdf ]
In Blizzard 2007, the third Blizzard Challenge, participants were asked to build voices from a dataset, a defined subset and, following certain constraints, a subset of their choice. A set of test sentences was then released to be synthesised. An online evaluation of the submitted synthesised sentences focused on naturalness and intelligibility, and added new sec- tions for degree of similarity to the original speaker, and similarity in terms of naturalness of pairs of sentences from different systems. We summarise this year's Blizzard Challenge and look ahead to possible designs for Blizzard 2008 in the light of participant and listener feedback.

Keywords: Blizzard
[132] Volker Strom, Ani Nenkova, Robert Clark, Yolanda Vazquez-Alvarez, Jason Brenier, Simon King, and Dan Jurafsky. Modelling prominence and emphasis improves unit-selection synthesis. In Proc. Interspeech 2007, Antwerp, Belgium, August 2007. [ bib | .pdf ]
We describe the results of large scale perception experiments showing improvements in synthesising two distinct kinds of prominence: standard pitch-accent and strong emphatic accents. Previously prominence assignment has been mainly evaluated by computing accuracy on a prominence-labelled test set. By contrast we integrated an automatic pitch-accent classifier into the unit selection target cost and showed that listeners preferred these synthesised sentences. We also describe an improved recording script for collecting emphatic accents, and show that generating emphatic accents leads to further improvements in the fiction genre over incorporating pitch accent only. Finally, we show differences in the effects of prominence between child-directed speech and news and fiction genres. Index Terms: speech synthesis, prosody, prominence, pitch accent, unit selection

[133] Peter Bell and Simon King. Sparse gaussian graphical models for speech recognition. In Proc. Interspeech 2007, Antwerp, Belgium, August 2007. [ bib | .pdf ]
We address the problem of learning the structure of Gaussian graphical models for use in automatic speech recognition, a means of controlling the form of the inverse covariance matrices of such systems. With particular focus on data sparsity issues, we implement a method for imposing graphical model structure on a Gaussian mixture system, using a convex optimisation technique to maximise a penalised likelihood expression. The results of initial experiments on a phone recognition task show a performance improvement over an equivalent full-covariance system.

[134] Ö. Çetin, A. Kantor, S. King, C. Bartels, M. Magimai-Doss, J. Frankel, and K. Livescu. An articulatory feature-based tandem approach and factored observation modeling. In Proc. ICASSP, Honolulu, April 2007. [ bib | .pdf ]
The so-called tandem approach, where the posteriors of a multilayer perceptron (MLP) classifier are used as features in an automatic speech recognition (ASR) system has proven to be a very effective method. Most tandem approaches up to date have relied on MLPs trained for phone classification, and appended the posterior features to some standard feature hidden Markov model (HMM). In this paper, we develop an alternative tandem approach based on MLPs trained for articulatory feature (AF) classification. We also develop a factored observation model for characterizing the posterior and standard features at the HMM outputs, allowing for separate hidden mixture and state-tying structures for each factor. In experiments on a subset of Switchboard, we show that the AFbased tandem approach is as effective as the phone-based approach, and that the factored observation model significantly outperforms the simple feature concatenation approach while using fewer parameters.

[135] K. Livescu, Ö. Çetin, M. Hasegawa-Johnson, S. King, C. Bartels, N. Borges, A. Kantor, P. Lal, L. Yung, S. Bezman, Dawson-Haggerty, B. Woods, J. Frankel, M. Magimai-Doss, and K. Saenko. Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU Summer Workshop. In Proc. ICASSP, Honolulu, April 2007. [ bib | .pdf ]
We report on investigations, conducted at the 2006 Johns HopkinsWorkshop, into the use of articulatory features (AFs) for observation and pronunciation models in speech recognition. In the area of observation modeling, we use the outputs of AF classiers both directly, in an extension of hybrid HMM/neural network models, and as part of the observation vector, an extension of the tandem approach. In the area of pronunciation modeling, we investigate a model having multiple streams of AF states with soft synchrony constraints, for both audio-only and audio-visual recognition. The models are implemented as dynamic Bayesian networks, and tested on tasks from the Small-Vocabulary Switchboard (SVitchboard) corpus and the CUAVE audio-visual digits corpus. Finally, we analyze AF classication and forced alignment using a newly collected set of feature-level manual transcriptions.

[136] K. Livescu, A. Bezman, N. Borges, L. Yung, Ö. Çetin, J. Frankel, S. King, M. Magimai-Doss, X. Chi, and L. Lavoie. Manual transcription of conversational speech at the articulatory feature level. In Proc. ICASSP, Honolulu, April 2007. [ bib | .pdf ]
We present an approach for the manual labeling of speech at the articulatory feature level, and a new set of labeled conversational speech collected using this approach. A detailed transcription, including overlapping or reduced gestures, is useful for studying the great pronunciation variability in conversational speech. It also facilitates the testing of feature classiers, such as those used in articulatory approaches to automatic speech recognition. We describe an effort to transcribe a small set of utterances drawn from the Switchboard database using eight articulatory tiers. Two transcribers have labeled these utterances in a multi-pass strategy, allowing for correction of errors. We describe the data collection methods and analyze the data to determine how quickly and reliably this type of transcription can be done. Finally, we demonstrate one use of the new data set by testing a set of multilayer perceptron feature classiers against both the manual labels and forced alignments.

[137] S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester. Speech production knowledge in automatic speech recognition. Journal of the Acoustical Society of America, 121(2):723-742, February 2007. [ bib | .pdf ]
Although much is known about how speech is produced, and research into speech production has resulted in measured articulatory data, feature systems of different kinds and numerous models, speech production knowledge is almost totally ignored in current mainstream approaches to automatic speech recognition. Representations of speech production allow simple explanations for many phenomena observed in speech which cannot be easily analyzed from either acoustic signal or phonetic transcription alone. In this article, we provide a survey of a growing body of work in which such representations are used to improve automatic speech recognition.

[138] J. Frankel and S. King. Speech recognition using linear dynamic models. IEEE Transactions on Speech and Audio Processing, 15(1):246-256, January 2007. [ bib | .ps | .pdf ]
The majority of automatic speech recognition (ASR) systems rely on hidden Markov models, in which Gaussian mixtures model the output distributions associated with sub-phone states. This approach, whilst successful, models consecutive feature vectors (augmented to include derivative information) as statistically independent. Furthermore, spatial correlations present in speech parameters are frequently ignored through the use of diagonal covariance matrices. This paper continues the work of Digalakis and others who proposed instead a first-order linear state-space model which has the capacity to model underlying dynamics, and furthermore give a model of spatial correlations. This paper examines the assumptions made in applying such a model and shows that the addition of a hidden dynamic state leads to increases in accuracy over otherwise equivalent static models. We also propose a time-asynchronous decoding strategy suited to recognition with segment models. We describe implementation of decoding for linear dynamic models and present TIMIT phone recognition results.

[139] Robert A. J. Clark, Korin Richmond, and Simon King. Multisyn: Open-domain unit selection for the Festival speech synthesis system. Speech Communication, 49(4):317-330, 2007. [ bib | DOI | .pdf ]
We present the implementation and evaluation of an open-domain unit selection speech synthesis engine designed to be flexible enough to encourage further unit selection research and allow rapid voice development by users with minimal speech synthesis knowledge and experience. We address the issues of automatically processing speech data into a usable voice using automatic segmentation techniques and how the knowledge obtained at labelling time can be exploited at synthesis time. We describe target cost and join cost implementation for such a system and describe the outcome of building voices with a number of different sized datasets. We show that, in a competitive evaluation, voices built using this technology compare favourably to other systems.

[140] Jithendra Vepa and Simon King. Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Transactions on Speech and Audio Processing, 14(5):1763-1771, September 2006. [ bib | .pdf ]
In unit selection-based concatenative speech synthesis, join cost (also known as concatenation cost), which measures how well two units can be joined together, is one of the main criteria for selecting appropriate units from the inventory. Usually, some form of local parameter smoothing is also needed to disguise the remaining discontinuities. This paper presents a subjective evaluation of three join cost functions and three smoothing methods. We describe the design and performance of a listening test. The three join cost functions were taken from our previous study, where we proposed join cost functions derived from spectral distances, which have good correlations with perceptual scores obtained for a range of concatenation discontinuities. This evaluation allows us to further validate their ability to predict concatenation discontinuities. The units for synthesis stimuli are obtained from a state-of-the-art unit selection text-to-speech system: rVoice from Rhetorical Systems Ltd. In this paper, we report listeners' preferences for each join cost in combination with each smoothing method.

[141] J. Frankel and S. King. Observation process adaptation for linear dynamic models. Speech Communication, 48(9):1192-1199, September 2006. [ bib | .ps | .pdf ]
This work introduces two methods for adapting the observation process parameters of linear dynamic models (LDM) or other linear-Gaussian models. The first method uses the expectation-maximization (EM) algorithm to estimate transforms for location and covariance parameters, and the second uses a generalized EM (GEM) approach which reduces computation in making updates from O(p6) to O(p3), where p is the feature dimension. We present the results of speaker adaptation on TIMIT phone classification and recognition experiments with relative error reductions of up to 6%. Importantly, we find minimal differences in the results from EM and GEM. We therefore propose that the GEM approach be applied to adaptation of hidden Markov models which use non-diagonal covariances. We provide the necessary update equations.

[142] R. Clark, K. Richmond, V. Strom, and S. King. Multisyn voices for the Blizzard Challenge 2006. In Proc. Blizzard Challenge Workshop (Interspeech Satellite), Pittsburgh, USA, September 2006. (http://festvox.org/blizzard/blizzard2006.html). [ bib | .pdf ]
This paper describes the process of building unit selection voices for the Festival Multisyn engine using the ATR dataset provided for the Blizzard Challenge 2006. We begin by discussing recent improvements that we have made to the Multisyn voice building process, prompted by our participation in the Blizzard Challenge 2006. We then go on to discuss our interpretation of the results observed. Finally, we conclude with some comments and suggestions for the formulation of future Blizzard Challenges.

[143] Robert A. J. Clark and Simon King. Joint prosodic and segmental unit selection speech synthesis. In Proc. Interspeech 2006, Pittsburgh, USA, September 2006. [ bib | .ps | .pdf ]
We describe a unit selection technique for text-to-speech synthesis which jointly searches the space of possible diphone sequences and the space of possible prosodic unit sequences in order to produce synthetic speech with more natural prosody. We demonstrates that this search, although currently computationally expensive, can achieve improved intonation compared to a baseline in which only the space of possible diphone sequences is searched. We discuss ways in which the search could be made sufficiently efficient for use in a real-time system.

[144] Simon King. Handling variation in speech and language processing. In Keith Brown, editor, Encyclopedia of Language and Linguistics. Elsevier, 2nd edition, 2006. [ bib ]
[145] Simon King. Language variation in speech technologies. In Keith Brown, editor, Encyclopedia of Language and Linguistics. Elsevier, 2nd edition, 2006. [ bib ]
[146] Volker Strom, Robert Clark, and Simon King. Expressive prosody for unit-selection speech synthesis. In Proc. Interspeech, Pittsburgh, 2006. [ bib | .ps | .pdf ]
Current unit selection speech synthesis voices cannot produce emphasis or interrogative contours because of a lack of the necessary prosodic variation in the recorded speech database. A method of recording script design is proposed which addresses this shortcoming. Appropriate components were added to the target cost function of the Festival Multisyn engine, and a perceptual evaluation showed a clear preference over the baseline system.

[147] Robert A.J. Clark, Korin Richmond, and Simon King. Multisyn voices from ARCTIC data for the Blizzard challenge. In Proc. Interspeech 2005, September 2005. [ bib | .pdf ]
This paper describes the process of building unit selection voices for the Festival Multisyn engine using four ARCTIC datasets, as part of the Blizzard evaluation challenge. The build process is almost entirely automatic, with very little need for human intervention. We discuss the difference in the evaluation results for each voice and evaluate the suitability of the ARCTIC datasets for building this type of voice.

[148] C. Mayo, R. A. J. Clark, and S. King. Multidimensional scaling of listener responses to synthetic speech. In Proc. Interspeech 2005, Lisbon, Portugal, September 2005. [ bib | .pdf ]
[149] J. Frankel and S. King. A hybrid ANN/DBN approach to articulatory feature recognition. In Proc. Eurospeech, Lisbon, September 2005. [ bib | .ps | .pdf ]
Artificial neural networks (ANN) have proven to be well suited to the task of articulatory feature (AF) recognition. Previous studies have taken a cascaded approach where separate ANNs are trained for each feature group, making the assumption that features are statistically independent. We address this by using ANNs to provide virtual evidence to a dynamic Bayesian network (DBN). This gives a hybrid ANN/DBN model and allows modelling of inter-feature dependencies. We demonstrate significant increases in AF recognition accuracy from modelling dependencies between features, and present the results of embedded training experiments in which a set of asynchronous feature changes are learned. Furthermore, we report on the application of a Viterbi training scheme in which we alternate between realigning the AF training labels and retraining the ANNs.

[150] Alexander Gutkin and Simon King. Inductive String Template-Based Learning of Spoken Language. In Hugo Gamboa and Ana Fred, editors, Proc. 5th International Workshop on Pattern Recognition in Information Systems (PRIS-2005), In conjunction with the 7th International Conference on Enterprise Information Systems (ICEIS-2005), pages 43-51, Miami, USA, May 2005. INSTICC Press. [ bib | .ps.gz | .pdf ]
This paper deals with formulation of alternative structural approach to the speech recognition problem. In this approach, we require both the representation and the learning algorithms defined on it to be linguistically meaningful, which allows the speech recognition system to discover the nature of the linguistic classes of speech patterns corresponding to the speech waveforms. We briefly discuss the current formalisms and propose an alternative - a phonologically inspired string-based inductive speech representation, defined within an analytical framework specifically designed to address the issues of class and object representation. We also present the results of the phoneme classification experiments conducted on the TIMIT corpus of continuous speech.

[151] Alexander Gutkin and Simon King. Detection of Symbolic Gestural Events in Articulatory Data for Use in Structural Representations of Continuous Speech. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-05), volume I, pages 885-888, Philadelphia, PA, USA, March 2005. IEEE Signal Processing Society Press. [ bib | .ps.gz | .pdf ]
One of the crucial issues which often needs to be addressed in structural approaches to speech representation is the choice of fundamental symbolic units of representation. In this paper, a physiologically inspired methodology for defining these symbolic atomic units in terms of primitive articulatory events is proposed. It is shown how the atomic articulatory events (gestures) can be detected directly in the articulatory data. An algorithm for evaluating the reliability of the articulatory events is described and promising results of the experiments conducted on MOCHA articulatory database are presented.

[152] Simon King, Chris Bartels, and Jeff Bilmes. Svitchboard 1: Small vocabulary tasks from switchboard 1. In Proc. Interspeech 2005, Lisbon, Portugal, 2005. [ bib | .pdf ]
We present a conversational telephone speech data set designed to support research on novel acoustic models. Small vocabulary tasks from 10 words up to 500 words are defined using subsets of the Switchboard-1 corpus; each task has a completely closed vocabulary (an OOV rate of 0%). We justify the need for these tasks, de- scribe the algorithm for selecting them from a large cor- pus, give a statistical analysis of the data and present baseline whole-word hidden Markov model recognition results. The goal of the paper is to define a common data set and to encourage other researchers to use it.

[153] Olga Goubanova and Simon King. Predicting consonant duration with Bayesian belief networks. In Proc. Interspeech 2005, Lisbon, Portugal, 2005. [ bib | .pdf ]
Consonant duration is influenced by a number of linguistic factors such as the consonant s identity, within-word position, stress level of the previous and following vowels, phrasal position of the word containing the target consonant, its syllabic position, identity of the previous and following segments. In our work, consonant duration is predicted from a Bayesian belief network (BN) consisting of discrete nodes for the linguistic factors and a single continuous node for the consonant s duration. Interactions between factors are represented as conditional dependency arcs in this graphical model. Given the parameters of the belief network, the duration of each consonant in the test set is then predicted as the value with the maximum probability. We compare the results of the belief network model with those of sums-of-products (SoP) and classification and regression tree (CART) models using the same data. In terms of RMS error, our BN model performs better than both CART and SoP models. In terms of the correlation coefficient, our BN model performs better than SoP model, and no worse than CART model. In addition, the Bayesian model reliably predicts consonant duration in cases of missing or hidden linguistic factors.

[154] M. Wester, J. Frankel, and S. King. Asynchronous articulatory feature recognition using dynamic Bayesian networks. In Proc. IEICI Beyond HMM Workshop, Kyoto, December 2004. [ bib | .ps | .pdf ]
This paper builds on previous work where dynamic Bayesian networks (DBN) were proposed as a model for articulatory feature recognition. Using DBNs makes it possible to model the dependencies between features, an addition to previous approaches which was found to improve feature recognition performance. The DBN results were promising, giving close to the accuracy of artificial neural nets (ANNs). However, the system was trained on canonical labels, leading to an overly strong set of constraints on feature co-occurrence. In this study, we describe an embedded training scheme which learns a set of data-driven asynchronous feature changes where supported in the data. Using a subset of the OGI Numbers corpus, we describe articulatory feature recognition experiments using both canonically-trained and asynchronous DBNs. Performance using DBNs is found to exceed that of ANNs trained on an identical task, giving a higher recognition accuracy. Furthermore, inter-feature dependencies result in a more structured model, giving rise to fewer feature combinations in the recognition output. In addition to an empirical evaluation of this modelling approach, we give a qualitative analysis, comparing asynchrony found through our data-driven methods to the asynchrony which may be expected on the basis of linguistic knowledge.

[155] Yoshinori Shiga and Simon King. Source-filter separation for articulation-to-speech synthesis. In Proc. ICSLP, Jeju, Korea, October 2004. [ bib | .ps | .pdf ]
In this paper we examine a method for separating out the vocal-tract filter response from the voice source characteristic using a large articulatory database. The method realises such separation for voiced speech using an iterative approximation procedure under the assumption that the speech production process is a linear system composed of a voice source and a vocal-tract filter, and that each of the components is controlled independently by different sets of factors. Experimental results show that the spectral variation is evidently influenced by the fundamental frequency or the power of speech, and that the tendency of the variation may be related closely to speaker identity. The method enables independent control over the voice source characteristic in our articulation-to-speech synthesis.

[156] Jithendra Vepa and Simon King. Subjective evaluation of join cost functions used in unit selection speech synthesis. In Proc. 8th International Conference on Spoken Language Processing (ICSLP), Jeju, Korea, October 2004. [ bib | .pdf ]
In our previous papers, we have proposed join cost functions derived from spectral distances, which have good correlations with perceptual scores obtained for a range of concatenation discontinuities. To further validate their ability to predict concatenation discontinuities, we have chosen the best three spectral distances and evaluated them subjectively in a listening test. The unit sequences for synthesis stimuli are obtained from a state-of-the-art unit selection text-tospeech system: rVoice from Rhetorical Systems Ltd. In this paper, we report listeners preferences for each of the three join cost functions.

[157] Yoshinori Shiga and Simon King. Estimating detailed spectral envelopes using articulatory clustering. In Proc. ICSLP, Jeju, Korea, October 2004. [ bib | .ps | .pdf ]
This paper presents an articulatory-acoustic mapping where detailed spectral envelopes are estimated. During the estimation, the harmonics of a range of F0 values are derived from the spectra of multiple voiced speech signals vocalized with similar articulator settings. The envelope formed by these harmonics is represented by a cepstrum, which is computed by fitting the peaks of all the harmonics based on the weighted least square method in the frequency domain. The experimental result shows that the spectral envelopes are estimated with the highest accuracy when the cepstral order is 48-64 for a female speaker, which suggests that representing the real response of the vocal tract requires high-quefrency elements that conventional speech synthesis methods are forced to discard in order to eliminate the pitch component of speech.

[158] Alexander Gutkin and Simon King. Phone classification in pseudo-Euclidean vector spaces. In Proc. 8th International Conference on Spoken Language Processing (ICSLP), volume II, pages 1453-1457, Jeju Island, Korea, October 2004. [ bib | .ps.gz | .pdf ]
Recently we have proposed a structural framework for modelling speech, which is based on patterns of phonological distinctive features, a linguistically well-motivated alternative to standard vector-space acoustic models like HMMs. This framework gives considerable representational freedom by working with features that have explicit linguistic interpretation, but at the expense of the ability to apply the wide range of analytical decision algorithms available in vector spaces, restricting oneself to more computationally expensive and less-developed symbolic metric tools. In this paper we show that a dissimilarity-based distance-preserving transition from the original structural representation to a corresponding pseudo-Euclidean vector space is possible. Promising results of phone classification experiments conducted on the TIMIT database are reported.

[159] J. Frankel, M. Wester, and S. King. Articulatory feature recognition using dynamic Bayesian networks. In Proc. ICSLP, September 2004. [ bib | .ps | .pdf ]
This paper describes the use of dynamic Bayesian networks for the task of articulatory feature recognition. We show that by modeling the dependencies between a set of 6 multi-leveled articulatory features, recognition accuracy is increased over an equivalent system in which features are considered independent. Results are compared to those found using artificial neural networks on an identical task.

[160] Alexander Gutkin and Simon King. Structural Representation of Speech for Phonetic Classification. In Proc. 17th International Conference on Pattern Recognition (ICPR), volume 3, pages 438-441, Cambridge, UK, August 2004. IEEE Computer Society Press. [ bib | .ps.gz | .pdf ]
This paper explores the issues involved in using symbolic metric algorithms for automatic speech recognition (ASR), via a structural representation of speech. This representation is based on a set of phonological distinctive features which is a linguistically well-motivated alternative to the “beads-on-a-string” view of speech that is standard in current ASR systems. We report the promising results of phoneme classification experiments conducted on a standard continuous speech task.

[161] J. Vepa and S. King. Subjective evaluation of join cost and smoothing methods. In Proc. 5th ISCA speech synthesis workshop, Pittsburgh, USA, June 2004. [ bib | .pdf ]
In our previous papers, we have proposed join cost functions derived from spectral distances, which have good correlations with perceptual scores obtained for a range of concatenation discontinuities. To further validate their ability to predict concatenation discontinuities, we have chosen the best three spectral distances and evaluated them subjectively in a listening test. The units for synthesis stimuli are obtained from a state-of-the-art unit selection text-to-speech system: `rVoice' from Rhetorical Systems Ltd. We also compared three different smoothing methods in this listening test. In this paper, we report listeners' preferences for each join costs in combination with each smoothing method.

[162] Yoshinori Shiga and Simon King. Accurate spectral envelope estimation for articulation-to-speech synthesis. In Proc. 5th ISCA Speech Synthesis Workshop, pages 19-24, CMU, Pittsburgh, USA, June 2004. [ bib | .ps | .pdf ]
This paper introduces a novel articulatory-acoustic mapping in which detailed spectral envelopes are estimated based on the cepstrum, inclusive of the high-quefrency elements which are discarded in conventional speech synthesis to eliminate the pitch component of speech. For this estimation, the method deals with the harmonics of multiple voiced-speech spectra so that several sets of harmonics can be obtained at various pitch frequencies to form a spectral envelope. The experimental result shows that the method estimates spectral envelopes with the highest accuracy when the cepstral order is 48-64, which suggests that the higher order coeffcients are required to represent detailed envelopes reflecting the real vocal-tract responses.

[163] Jithendra Vepa and Simon King. Join cost for unit selection speech synthesis. In Abeer Alwan and Shri Narayanan, editors, Speech Synthesis. Prentice Hall, 2004. [ bib | .ps ]
[164] Robert A.J. Clark, Korin Richmond, and Simon King. Festival 2 - build your own general purpose unit selection speech synthesiser. In Proc. 5th ISCA workshop on speech synthesis, 2004. [ bib | .ps | .pdf ]
This paper describes version 2 of the Festival speech synthesis system. Festival 2 provides a development environment for concatenative speech synthesis, and now includes a general purpose unit selection speech synthesis engine. We discuss various aspects of unit selection speech synthesis, focusing on the research issues that relate to voice design and the automation of the voice development process.

[165] Ben Gillett and Simon King. Transforming F0 contours. In Proc. Eurospeech, Geneva, September 2003. [ bib | .pdf ]
Voice transformation is the process of transforming the characteristics of speech uttered by a source speaker, such that a listener would believe the speech was uttered by a target speaker. Training F0 contour generation models for speech synthesis requires a large corpus of speech. If it were possible to adapt the F0 contour of one speaker to sound like that of another speaker, using a small, easily obtainable parameter set, this would be extremely valuable. We present a new method for the transformation of F0 contours from one speaker to another based on a small linguistically motivated parameter set. The system performs a piecewise linear mapping using these parameters. A perceptual experiment clearly demonstrates that the presented system is at least as good as an existing technique for all speaker pairs, and that in many cases it is much better and almost as good as using the target F0 contour

[166] Yoshinori Shiga and Simon King. Estimating the spectral envelope of voiced speech using multi-frame analysis. In Proc. Eurospeech-2003, volume 3, pages 1737-1740, Geneva, Switzerland, September 2003. [ bib | .ps | .pdf ]
This paper proposes a novel approach for estimating the spectral envelope of voiced speech independently of its harmonic structure. Because of the quasi-periodicity of voiced speech, its spectrum indicates harmonic structure and only has energy at frequencies corresponding to integral multiples of F0. It is hence impossible to identify transfer characteristics between the adjacent harmonics. In order to resolve this problem, Multi-frame Analysis (MFA) is introduced. The MFA estimates a spectral envelope using many portions of speech which are vocalised using the same vocal-tract shape. Since each of the portions usually has a different F0 and ensuing different harmonic structure, a number of harmonics can be obtained at various frequencies to form a spectral envelope. The method thereby gives a closer approximation to the vocal-tract transfer function.

[167] James Horlock and Simon King. Named entity extraction from word lattices. In Proc. Eurospeech, Geneva, September 2003. [ bib | .pdf ]
We present a method for named entity extraction from word lattices produced by a speech recogniser. Previous work by others on named entity extraction from speech has used either a manual transcript or 1-best recogniser output. We describe how a single Viterbi search can recover both the named entity sequence and the corresponding word sequence from a word lattice, and further that it is possible to trade off an increase in word error rate for improved named entity extraction.

[168] James Horlock and Simon King. Discriminative methods for improving named entity extraction on speech data. In Proc. Eurospeech, Geneva, September 2003. [ bib | .pdf ]
In this paper we present a method of discriminatively training language models for spoken language understanding; we show improvements in named entity F-scores on speech data using these improved language models. A comparison between theoretical probabilities associated with manual markup and the actual probabilities of output markup is used to identify probabilities requiring adjustment. We present results which support our hypothesis that improvements in F-scores are possible by using either previously used training data or held out development data to improve discrimination amongst a set of N-gram language models.

[169] Ben Gillett and Simon King. Transforming voice quality. In Proc. Eurospeech, Geneva, September 2003. [ bib | .pdf ]
Voice transformation is the process of transforming the characteristics of speech uttered by a source speaker, such that a listener would believe the speech was uttered by a target speaker. In this paper we address the problem of transforming voice quality. We do not attempt to transform prosody. Our system has two main parts corresponding to the two components of the source-filter model of speech production. The first component transforms the spectral envelope as represented by a linear prediction model. The transformation is achieved using a Gaussian mixture model, which is trained on aligned speech from source and target speakers. The second part of the system predicts the spectral detail from the transformed linear prediction coefficients. A novel approach is proposed, which is based on a classifier and residual codebooks. On the basis of a number of performance metrics it outperforms existing systems.

[170] Yoshinori Shiga and Simon King. Estimation of voice source and vocal tract characteristics based on multi-frame analysis. In Proc. Eurospeech, volume 3, pages 1749-1752, Geneva, Switzerland, September 2003. [ bib | .ps | .pdf ]
This paper presents a new approach for estimating voice source and vocal tract filter characteristics of voiced speech. When it is required to know the transfer function of a system in signal processing, the input and output of the system are experimentally observed and used to calculate the function. However, in the case of source-filter separation we deal with in this paper, only the output (speech) is observed and the characteristics of the system (vocal tract) and the input (voice source) must simultaneously be estimated. Hence the estimate becomes extremely difficult, and it is usually solved approximately using oversimplified models. We demonstrate that these characteristics are separable under the assumption that they are independently controlled by different factors. The separation is realised using an iterative approximation along with the Multi-frame Analysis method, which we have proposed to find spectral envelopes of voiced speech with minimum interference of the harmonic structure.

[171] K. Richmond, S. King, and P. Taylor. Modelling the uncertainty in recovering articulation from acoustics. Computer Speech and Language, 17:153-172, 2003. [ bib | .pdf ]
This paper presents an experimental comparison of the performance of the multilayer perceptron (MLP) with that of the mixture density network (MDN) for an acoustic-to-articulatory mapping task. A corpus of acoustic-articulatory data recorded by electromagnetic articulography (EMA) for a single speaker was used as training and test data for this purpose. In theory, the MDN is able to provide a richer, more flexible description of the target variables in response to a given input vector than the least-squares trained MLP. Our results show that the mean likelihoods of the target articulatory parameters for an unseen test set were indeed consistently higher with the MDN than with the MLP. The increase ranged from approximately 3% to 22%, depending on the articulatory channel in question. On the basis of these results, we argue that using a more flexible description of the target domain, such as that offered by the MDN, can prove beneficial when modelling the acoustic-to-articulatory mapping.

[172] Christophe Van Bael and Simon King. An accent-independent lexicon for automatic speech recognition. In Proc. ICPhS, pages 1165-1168, 2003. [ bib | .pdf ]
Recent work at the Centre for Speech Technology Re- search (CSTR) at the University of Edinburgh has de- veloped an accent-independent lexicon for speech syn- thesis (the Unisyn project). The main purpose of this lexicon is to avoid the problems and cost of writing a new lexicon for every new accent needed for synthesis. Only recently, a first attempt has been made to use the Keyword Lexicon for automatic speech recognition.

[173] J. Vepa and S. King. Kalman-filter based join cost for unit-selection speech synthesis. In Proc. Eurospeech, Geneva, Switzerland, 2003. [ bib | .pdf ]
We introduce a new method for computing join cost in unit-selection speech synthesis which uses a linear dynamical model (also known as a Kalman filter) to model line spectral frequency trajectories. The model uses an underlying subspace in which it makes smooth, continuous trajectories. This subspace can be seen as an analogy for underlying articulator movement. Once trained, the model can be used to measure how well concatenated speech segments join together. The objective join cost is based on the error between model predictions and actual observations. We report correlations between this measure and mean listener scores obtained from a perceptual listening experiment. Our experiments use a state-of-the art unit-selection text-to-speech system: `rVoice' from Rhetorical Systems Ltd.

[174] Simon King. Dependence and independence in automatic speech recognition and synthesis. Journal of Phonetics, 31(3-4):407-411, 2003. [ bib | .pdf ]
A short review paper

[175] J. Vepa, S. King, and P. Taylor. Objective distance measures for spectral discontinuities in concatenative speech synthesis. In Proc. ICSLP, Denver, USA, September 2002. [ bib | .pdf ]
In unit selection based concatenative speech systems, `join cost', which measures how well two units can be joined together, is one of the main criteria for selecting appropriate units from the inventory. The ideal join cost will measure `perceived' discontinuity, based on easily measurable spectral properties of the units being joined, in order to ensure smooth and natural-sounding synthetic speech. In this paper we report a perceptual experiment conducted to measure the correlation between `subjective' human perception and various `objective' spectrally-based measures proposed in the literature. Our experiments used a state-of-the-art unit-selection text-to-speech system: `rVoice' from Rhetorical Systems Ltd.

[176] J. Vepa, S. King, and P. Taylor. New objective distance measures for spectral discontinuities in concatenative speech synthesis. In Proc. IEEE 2002 workshop on speech synthesis, Santa Monica, USA, September 2002. [ bib | .pdf ]
The quality of unit selection based concatenative speech synthesis mainly depends on how well two successive units can be joined together to minimise the audible discontinuities. The objective measure of discontinuity used when selecting units is known as the `join cost'. The ideal join cost will measure `perceived' discontinuity, based on easily measurable spectral properties of the units being joined, in order to ensure smooth and natural-sounding synthetic speech. In this paper we describe a perceptual experiment conducted to measure the correlation between `subjective' human perception and various `objective' spectrally-based measures proposed in the literature. Also we report new objective distance measures derived from various distance metrics based on these spectral features, which have good correlation with human perception to concatenation discontinuities. Our experiments used a state-of-the art unit-selection text-to-speech system: `rVoice' from Rhetorical Systems Ltd.

[177] Jesper Salomon, Simon King, and Miles Osborne. Framewise phone classification using support vector machines. In Proceedings International Conference on Spoken Language Processing, Denver, 2002. [ bib | .ps | .pdf ]
We describe the use of Support Vector Machines for phonetic classification on the TIMIT corpus. Unlike previous work, in which entire phonemes are classified, our system operates in a framewise manner and is intended for use as the front-end of a hybrid system similar to ABBOT. We therefore avoid the problems of classifying variable-length vectors. Our frame-level phone classification accuracy on the complete TIMIT test set is competitive with other results from the literature. In addition, we address the serious problem of scaling Support Vector Machines by using the Kernel Fisher Discriminant.

[178] J. Frankel and S. King. ASR - articulatory speech recognition. In Proc. Eurospeech, pages 599-602, Aalborg, Denmark, September 2001. [ bib | .ps | .pdf ]
In this paper we report recent work on a speech recognition system using a combination of acoustic and articulatory features as input. Linear dynamic models are used to capture the trajectories which characterize each segment type. We describe classification and recognition tasks for systems based on acoustic data in conjunction with both real and automatically recovered articulatory parameters.

[179] J. Frankel and S. King. Speech recognition in the articulatory domain: investigating an alternative to acoustic HMMs. In Proc. Workshop on Innovations in Speech Processing, April 2001. [ bib | .ps | .pdf ]
We describe a speech recognition system which uses a combination of acoustic and articulatory features as input. Linear dynamic models capture the trajectories which characterize each segment type. We describe classification and recognition tasks for systems based on acoustic data in conjunction with both real and automatically recovered articulatory parameters.

[180] J. Frankel, K. Richmond, S. King, and P. Taylor. An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces. In Proc. ICSLP, 2000. [ bib | .ps | .pdf ]
In this paper we describe a speech recognition system using linear dynamic models and articulatory features. Experiments are reported in which measured articulation from the MOCHA corpus has been used, along with those where the articulatory parameters are estimated from the speech signal using a recurrent neural network.

[181] S. King, P. Taylor, J. Frankel, and K. Richmond. Speech recognition via phonetically-featured syllables. In PHONUS, volume 5, pages 15-34, Institute of Phonetics, University of the Saarland, 2000. [ bib | .ps | .pdf ]
We describe recent work on two new automatic speech recognition systems. The first part of this paper describes the components of a system based on phonological features (which we call EspressoA) in which the values of these features are estimated from the speech signal before being used as the basis for recognition. In the second part of the paper, another system (which we call EspressoB) is described in which articulatory parameters are used instead of phonological features and a linear dynamical system model is used to perform recognition from automatically estimated values of these articulatory parameters.

[182] Simon King and Paul Taylor. Detection of phonological features in continuous speech using neural networks. Computer Speech and Language, 14(4):333-353, 2000. [ bib | .ps | .pdf ]
We report work on the first component of a two stage speech recognition architecture based on phonological features rather than phones. The paper reports experiments on three phonological feature systems: 1) the Sound Pattern of English (SPE) system which uses binary features, 2)a multi valued (MV) feature system which uses traditional phonetic categories such as manner, place etc, and 3) Government Phonology (GP) which uses a set of structured primes. All experiments used recurrent neural networks to perform feature detection. In these networks the input layer is a standard framewise cepstral representation, and the output layer represents the values of the features. The system effectively produces a representation of the most likely phonological features for each input frame. All experiments were carried out on the TIMIT speaker independent database. The networks performed well in all cases, with the average accuracy for a single feature ranging from 86 to 93 percent. We describe these experiments in detail, and discuss the justification and potential advantages of using phonological features rather than phones for the basis of speech recognition.

[183] Simon King and Alan Wrench. Dynamical system modelling of articulator movement. In Proc. ICPhS 99, pages 2259-2262, San Francisco, August 1999. [ bib | .ps | .pdf ]
We describe the modelling of articulatory movements using (hidden) dynamical system models trained on Electro-Magnetic Articulograph (EMA) data. These models can be used for automatic speech recognition and to give insights into articulatory behaviour. They belong to a class of continuous-state Markov models, which we believe can offer improved performance over conventional Hidden Markov Models (HMMs) by better accounting for the continuous nature of the underlying speech production process - that is, the movements of the articulators. To assess the performance of our models, a simple speech recognition task was used, on which the models show promising results.

[184] Simon King, Todd Stephenson, Stephen Isard, Paul Taylor, and Alex Strachan. Speech recognition via phonetically featured syllables. In Proc. ICSLP `98, pages 1031-1034, Sydney, Australia, December 1998. [ bib | .ps | .pdf ]
We describe a speech recogniser which uses a speech production-motivated phonetic-feature description of speech. We argue that this is a natural way to describe the speech signal and offers an efficient intermediate parameterisation for use in speech recognition. We also propose to model this description at the syllable rather than phone level. The ultimate goal of this work is to generate syllable models whose parameters explicitly describe the trajectories of the phonetic features of the syllable. We hope to move away from Hidden Markov Models (HMMs) of context-dependent phone units. As a step towards this, we present a preliminary system which consists of two parts: recognition of the phonetic features from the speech signal using a neural network; and decoding of the feature-based description into phonemes using HMMs.

[185] Paul A. Taylor, S. King, S. D. Isard, and H. Wright. Intonation and dialogue context as constraints for speech recognition. Language and Speech, 41(3):493-512, 1998. [ bib | .ps | .pdf ]
[186] Simon King. Using Information Above the Word Level for Automatic Speech Recognition. PhD thesis, University of Edinburgh, 1998. [ bib | .ps | .pdf ]
This thesis introduces a general method for using information at the utterance level and across utterances for automatic speech recognition. The method involves classification of utterances into types. Using constraints at the utterance level via this classification method allows information sources to be exploited which cannot necessarily be used directly for word recognition. The classification power of three sources of information is investigated: the language model in the speech recogniser, dialogue context and intonation. The method is applied to a challenging task: the recognition of spontaneous dialogue speech. The results show success in automatic utterance type classification, and subsequent word error rate reduction over a baseline system, when all three information sources are probabilistically combined.

[187] Simon King, Thomas Portele, and Florian Höfer. Speech synthesis using non-uniform units in the Verbmobil project. In Proc. Eurospeech 97, volume 2, pages 569-572, Rhodes, Greece, September 1997. [ bib | .ps | .pdf ]
We describe a concatenative speech synthesiser for British English which uses the HADIFIX inventory structure originally developed for German by Portele. An inventory of non-uniform units was investigated with the aim of improving segmental quality compared to diphones. A combination of soft (diphone) and hard concatenation was used, which allowed a dramatic reduction in inventory size. We also present a unit selection algorithm which selects an optimum sequence of units from this inventory for a given phoneme sequence. The work described is part of the concept-to-speech synthesiser for the language and speech project Verbmobil which is funded by the German Ministry of Science (BMBF).

[188] Simon King. Final report for Verbmobil Teilprojekt 4.4. Technical Report ISSN 1434-8845, IKP, Universitaet Bonn, January 1997. Verbmobil-Report 195 available at http://verbmobil.dfki.de. [ bib ]
Final report for Verbmobil English speech synthesis

[189] Paul A. Taylor, Simon King, Stephen Isard, Helen Wright, and Jacqueline Kowtko. Using intonation to constrain language models in speech recognition. In Proc. Eurospeech'97, Rhodes, 1997. [ bib | .pdf ]
This paper describes a method for using intonation to reduce word error rate in a speech recognition system designed to recognise spontaneous dialogue speech. We use a form of dialogue analysis based on the theory of conversational games. Different move types under this analysis conform to different language models. Different move types are also characterised by different intonational tunes. Our overall recognition strategy is first to predict from intonation the type of game move that a test utterance represents, and then to use a bigram language model for that type of move during recognition. point in a game.

[190] Simon King. Users Manual for Verbmobil Teilprojekt 4.4. IKP, Universitaet Bonn, October 1996. [ bib ]
Verbmobil English synthesiser users manual

[191] Simon King. Inventory design for Verbmobil Teilprojekt 4.4. Technical report, IKP, Universität Bonn, October 1996. [ bib ]
Inventory design for Verbmobil English speech synthesis synthesis

[192] Paul A. Taylor, Hiroshi Shimodaira, Stephen Isard, Simon King, and Jacqueline Kowtko. Using prosodic information to constrain language models for spoken dialogue. In Proc. ICSLP `96, Philadelphia, 1996. [ bib | .ps | .pdf ]
We present work intended to improve speech recognition performance for computer dialogue by taking into account the way that dialogue context and intonational tune interact to limit the possibilities for what an utterance might be. We report here on the extra constraint achieved in a bigram language model expressed in terms of entropy by using separate submodels for different sorts of dialogue acts and trying to predict which submodel to apply by analysis of the intonation of the sentence being recognised.

[193] Stephen Isard, Simon King, Paul A. Taylor, and Jacqueline Kowtko. Prosodic information in a speech recognition system intended for dialogue. In IEEE Workshop in speech recognition, Snowbird, Utah, 1995. [ bib ]
We report on an automatic speech recognition system intended for use in dialogue, whose original aspect is its use of prosodic information for two different purposes. The first is to improve the word level accuracy of the system. The second is to constrain the language model applied to a given utterance by taking into account the way that dialogue context and intonational tune interact to limit the possibilities for what an utterance might be.