The Centre for Speech Technology Research, The university of Edinburgh

Publications by Junichi Yamagishi

[1] John Dines, Hui Liang, Lakshmi Saheer, Matthew Gibson, William Byrne, Keiichiro Oura, Keiichi Tokuda, Junichi Yamagishi, Simon King, Mirjam Wester, Teemu Hirsimäki, Reima Karhila, and Mikko Kurimo. Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. Computer Speech and Language, 27(2):420-437, February 2013. [ bib | DOI | http ]
In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics.

Keywords: Speech-to-speech translation, Cross-lingual speaker adaptation, HMM-based speech synthesis, Speaker adaptation, Voice conversion
[2] Z. Ling, K. Richmond, and J. Yamagishi. Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. Audio, Speech, and Language Processing, IEEE Transactions on, 21(1):207-219, 2013. [ bib | DOI ]
In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesiser. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous explanatory variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.

[3] P. L. De Leon, M. Pucher, J. Yamagishi, I. Hernaez, and I. Saratxaga. Evaluation of speaker verification security and detection of HMM-based synthetic speech. Audio, Speech, and Language Processing, IEEE Transactions on, 20(8):2280-2290, October 2012. [ bib | DOI ]
In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model #x2013;universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81 the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.

[4] Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi. Vowel creation by articulatory control in HMM-based parametric speech synthesis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]
This paper presents a method to produce a new vowel by articulatory control in hidden Markov model (HMM) based parametric speech synthesis. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as external auxiliary variables. The dependency between acoustic and articulatory features is modelled by a group of linear transforms that are either estimated context-dependently or determined by the distribution of articulatory features. Vowel identity is removed from the set of context features used to ensure compatibility between the context-dependent model parameters and the articulatory features of a new vowel. At synthesis time, acoustic features are predicted according to the input articulatory features as well as context information. With an appropriate articulatory feature sequence, a new vowel can be generated even when it does not exist in the training set. Experimental results show this method is effective in creating the English vowel /2/ by articulatory control without using any acoustic samples of this vowel.

Keywords: Speech synthesis, articulatory features, multiple-regression hidden Markov model
[5] Phillip L. De Leon, Bryan Stewart, and Junichi Yamagishi. Synthetic speech discrimination using pitch pattern statistics derived from image analysis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib ]
In this paper, we extend the work by Ogihara, et al. to discriminate between human and synthetic speech using features based on pitch patterns. As previously demonstrated, significant differences in pitch patterns between human and synthetic speech can be leveraged to classify speech as being human or synthetic in origin. We propose using mean pitch stability, mean pitch stability range, and jitter as features extracted after image analysis of pitch patterns. We have observed that for synthetic speech, these features lie in a small and distinct space as compared to human speech and have modeled them with a multivariate Gaussian distribution. Our classifier is trained using synthetic speech collected from the 2008 and 2011 Blizzard Challenge along with Festival pre-built voices and human speech from the NIST2002 corpus. We evaluate the classifier on a much larger corpus than previously studied using human speech from the Switchboard corpus, synthetic speech from the Resource Management corpus, and synthetic speech generated from Festival trained on the Wall Street Journal corpus. Results show 98 in correctly classifying human speech and 96 in correctly classifying synthetic speech.

[6] J. Lorenzo, B. Martinez, R. Barra-Chicote, V. Lopez–Ludena, J. Ferreiros, J. Yamagishi, and J.M. Montero. Towards an unsupervised speaking style voice building framework: Multi–style speaker diarization. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib ]
Current text–to–speech systems are developed using studio-recorded speech in a neutral style or based on acted emotions. However, the proliferation of media sharing sites would allow developing a new generation of speech–based systems which could cope with sponta- neous and styled speech. This paper proposes an architecture to deal with realistic recordings and carries out some experiments on unsupervised speaker diarization. In order to maximize the speaker purity of the clusters while keeping a high speaker coverage, the paper evaluates the F–measure of a diarization module, achieving high scores (>85 especially when the clusters are longer than 30 seconds, even for the more spontaneous and expressive styles (such as talk shows or sports).

[7] Rasmus Dall, Christophe Veaux, Junichi Yamagishi, and Simon King. Analysis of speaker clustering strategies for HMM-based speech synthesis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]
This paper describes a method for speaker clustering, with the application of building average voice models for speaker-adaptive HMM-based speech synthesis that are a good basis for adapting to specific target speakers. Our main hypothesis is that using perceptually similar speakers to build the average voice model will be better than use unselected speakers, even if the amount of data available from perceptually similar speakers is smaller. We measure the perceived similarities among a group of 30 female speakers in a listening test and then apply multiple linear regression to automatically predict these listener judgements of speaker similarity and thus to identify similar speakers automatically. We then compare a variety of average voice models trained on either speakers who were perceptually judged to be similar to the target speaker, or speakers selected by the multiple linear regression, or a large global set of unselected speakers. We find that the average voice model trained on perceptually similar speakers provides better performance than the global model, even though the latter is trained on more data, confirming our main hypothesis. However, the average voice model using speakers selected automatically by the multiple linear regression does not reach the same level of performance.

[8] Jaime Lorenzo-Trueba, Roberto Barra-Chicote, Tuomo Raitio, Nicolas Obin, Paavo Alku, Junichi Yamagishi, and Juan M Montero. Towards glottal source controllability in expressive speech synthesis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib ]
In order to obtain more human like sounding human- machine interfaces we must first be able to give them expressive capabilities in the way of emotional and stylistic features so as to closely adequate them to the intended task. If we want to replicate those features it is not enough to merely replicate the prosodic information of fundamental frequency and speaking rhythm. The proposed additional layer is the modification of the glottal model, for which we make use of the GlottHMM parameters. This paper analyzes the viability of such an approach by verifying that the expressive nuances are captured by the aforementioned features, obtaining 95 speaking and 82 the effect of speaker bias and recording environment on the source modeling in order to quantify possible problems when analyzing multi-speaker databases. Finally we propose a speaking styles separation for Spanish based on prosodic features and check its perceptual significance.

[9] C. Valentini-Botinhao, J. Yamagishi, and S. King. Speech intelligibility enhancement for HMM-based synthetic speech in noise. In Proc. Sapa Workshop, Portland, USA, September 2012. [ bib | .pdf ]
It is possible to increase the intelligibility of speech in noise by enhancing the clean speech signal. In this paper we demonstrate the effects of modifying the spectral envelope of synthetic speech according to the environmental noise. To achieve this, we modify Mel cepstral coefficients according to an intelligibility measure that accounts for glimpses of speech in noise: the Glimpse Proportion measure. We evaluate this method against a baseline synthetic voice trained only with normal speech and a topline voice trained with Lombard speech, as well as natural speech. The intelligibility of these voices was measured when mixed with speech-shaped noise and with a competing speaker at three different levels. The Lombard voices, both natural and synthetic, were more intelligible than the normal voices in all conditions. For speech-shaped noise, the proposed modified voice was as intelligible as the Lombard synthetic voice without requiring any recordings of Lombard speech, which are hard to obtain. However, in the case of competing talker noise, the Lombard synthetic voice was more intelligible than the proposed modified voice.

[10] C. Valentini-Botinhao, J. Yamagishi, and S. King. Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. In Proc. Interspeech, Portland, USA, September 2012. [ bib ]
We propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation we previously proposed for the Glimpse Proportion measure. Here we show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. To evaluate the method we built eight different voices from normal read-text speech data from a male speaker. Some voices were also built from Lombard speech data produced by the same speaker. Listening experiments with speech-shaped noise and with a single competing talker indicate that our method significantly improves intelligibility when compared to unmodified synthetic speech. The voices built from Lombard speech outperformed the proposed method particularly for the competing talker case. However, compared to a voice using only the spectral parameters from Lombard speech, the proposed method obtains similar or higher performance.

[11] Zhenhua Ling, Korin Richmond, and Junichi Yamagishi. Vowel creation by articulatory control in HMM-based parametric speech synthesis. In Proc. The Listening Talker Workshop, page 72, Edinburgh, UK, May 2012. [ bib | .pdf ]
[12] C. Valentini-Botinhao, J. Yamagishi, and S. King. Using an intelligibility measure to create noise robust cepstral coefficients for HMM-based speech synthesis. In Proc. LISTA Workshop, Edinburgh, UK, May 2012. [ bib | .pdf ]
[13] C. Valentini-Botinhao, R. Maia, J. Yamagishi, S. King, and H. Zen. Cepstral analysis based on the Glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise. In Proc. ICASSP, pages 3997-4000, Kyoto, Japan, March 2012. [ bib | DOI | .pdf ]
In this paper we introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. This new method aims to increase the intelligibility of speech in noise by modifying the clean speech, and has applications in scenarios such as public announcement and car navigation systems. We first explain how the Glimpse Proportion measure operates and further show how we approximated it to integrate it into an existing spectral envelope parameter extraction method commonly used in the HMM-based speech synthesis framework. We then demonstrate how this new method changes the modelled spectrum according to the characteristics of the noise and show results for a listening test with vocoded and HMM-based synthetic speech. The test indicates that the proposed method can significantly improve intelligibility of synthetic speech in speech shaped noise.

[14] L. Saheer, J. Yamagishi, P.N. Garner, and J. Dines. Combining vocal tract length normalization with hierarchial linear transformations. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 4493 -4496, March 2012. [ bib | DOI ]
Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLR-based adaptation techniques, being much closer in quality to that generated by the original average voice model. However with only a single parameter, VTLN captures very few speaker specific characteristics when compared to linear transform based adaptation techniques. This paper proposes that the merits of VTLN can be combined with those of linear transform based adaptation in a hierarchial Bayesian framework, where VTLN is used as the prior information. A novel technique for propagating the gender information from the VTLN prior through constrained structural maximum a posteriori linear regression (CSMAPLR) adaptation is presented. Experiments show that the resulting transformation has improved speech quality with better naturalness, intelligibility and improved speaker similarity.

Keywords: CSMAPLR adaptation;MLLR based adaptation technique;constrained structural maximum a posteriori linear regression;hierarchial Bayesian framework;hierarchial linear transformation;intelligibility;rapid adaptation technique;speaker similarity;statistical parametric speech synthesis;vocal tract length normalization;Bayes methods;speech intelligibility;
[15] Sebastian Andersson, Junichi Yamagishi, and Robert A.J. Clark. Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Communication, 54(2):175-188, 2012. [ bib | DOI | http ]
Spontaneous conversational speech has many characteristics that are currently not modelled well by HMM-based speech synthesis and in order to build synthetic voices that can give an impression of someone partaking in a conversation, we need to utilise data that exhibits more of the speech phenomena associated with conversations than the more generally used carefully read aloud sentences. In this paper we show that synthetic voices built with HMM-based speech synthesis techniques from conversational speech data, preserved segmental and prosodic characteristics of frequent conversational speech phenomena. An analysis of an evaluation investigating the perception of quality and speaking style of HMM-based voices confirms that speech with conversational characteristics are instrumental for listeners to perceive successful integration of conversational speech phenomena in synthetic speech. The achieved synthetic speech quality provides an encouraging start for the continued use of conversational speech in HMM-based speech synthesis.

Keywords: Speech synthesis, HMM, Conversation, Spontaneous speech, Filled pauses, Discourse marker
[16] S. Andersson, J. Yamagishi, and R.A.J. Clark. Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Communication, 54(2):175-188, 2012. [ bib | DOI ]
Spontaneous conversational speech has many characteristics that are currently not modelled well by HMM-based speech synthesis and in order to build synthetic voices that can give an impression of someone partaking in a conversation, we need to utilise data that exhibits more of the speech phenomena associated with conversations than the more generally used carefully read aloud sentences. In this paper we show that synthetic voices built with HMM-based speech synthesis techniques from conversational speech data, preserved segmental and prosodic characteristics of frequent conversational speech phenomena. An analysis of an evaluation investigating the perception of quality and speaking style of HMM-based voices confirms that speech with conversational characteristics are instrumental for listeners to perceive successful integration of conversational speech phenomena in synthetic speech. The achieved synthetic speech quality provides an encouraging start for the continued use of conversational speech in HMM-based speech synthesis.

[17] Keiichiro Oura, Junichi Yamagishi, Mirjam Wester, Simon King, and Keiichi Tokuda. Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping. Speech Communication, 54(6):703-714, 2012. [ bib | DOI | http ]
In the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech.

Keywords: HMM-based speech synthesis, Unsupervised speaker adaptation, Cross-lingual speaker adaptation, Speech-to-speech translation
[18] Kei Hashimoto, Junichi Yamagishi, William Byrne, Simon King, and Keiichi Tokuda. Impacts of machine translation and speech synthesis on speech-to-speech translation. Speech Communication, 54(7):857-866, 2012. [ bib | DOI | http ]
This paper analyzes the impacts of machine translation and speech synthesis on speech-to-speech translation systems. A typical speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques have been proposed for integration of speech recognition and machine translation. However, corresponding techniques have not yet been considered for speech synthesis. The focus of the current work is machine translation and speech synthesis, and we present a subjective evaluation designed to analyze their impact on speech-to-speech translation. The results of these analyses show that the naturalness and intelligibility of the synthesized speech are strongly affected by the fluency of the translated sentences. In addition, several features were found to correlate well with the average fluency of the translated sentences and the average naturalness of the synthesized speech.

Keywords: Speech-to-speech translation, Machine translation, Speech synthesis, Subjective evaluation
[19] Junichi Yamagishi, Christophe Veaux, Simon King, and Steve Renals. Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction. Acoustical Science and Technology, 33(1):1-5, 2012. [ bib | http ]
[20] Sarah Creer, Stuart Cunningham, Phil Green, and Junichi Yamagishi. Building personalised synthetic voices for individuals with severe speech impairment. Computer Speech and Language, (0):-, 2012. [ bib | DOI | http ]
For individuals with severe speech impairment accurate spoken communication can be difficult and require considerable effort. Some may choose to use a voice output communication aid (or VOCA) to support their spoken communication needs. A VOCA typically takes input from the user through a keyboard or switch-based interface and produces spoken output using either synthesised or recorded speech. The type and number of synthetic voices that can be accessed with a VOCA is often limited and this has been implicated as a factor for rejection of the devices. Therefore, there is a need to be able to provide voices that are more appropriate and acceptable for users. This paper reports on a study that utilises recent advances in speech synthesis to produce personalised synthetic voices for 3 speakers with mild to severe dysarthria, one of the most common speech disorders. Using a statistical parametric approach to synthesis, an average voice trained on data from several unimpaired speakers was adapted using recordings of the impaired speech of 3 dysarthric speakers. By careful selection of the speech data and the model parameters, several exemplar voices were produced for each speaker. A qualitative evaluation was conducted with the speakers and listeners who were familiar with the speaker. The evaluation showed that for one of the 3 speakers a voice could be created which conveyed many of his personal characteristics, such as regional identity, sex and age.

Keywords: Speech synthesis, Augmentative and alternative communication, Disordered speech, Voice output communication aid
[21] Oliver Watts, Junichi Yamagishi, and Simon King. Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In Proc. Interspeech, pages 2157-2160, Florence, Italy, August 2011. [ bib | .pdf ]
Part of speech (POS) tags are foremost among the features conventionally used to predict intonational phrase-breaks for text to speech (TTS) conversion. The construction of such systems therefore presupposes the availability of a POS tagger for the relevant language, or of a corpus manually tagged with POS. However, such tools and resources are not available in the majority of the world’s languages, and manually labelling text with POS tags is an expensive and time-consuming process. We therefore propose the use of continuous-valued features that summarise the distributional characteristics of word types as surrogates for POS features. Importantly, such features are obtained in an unsupervised manner from an untagged text corpus. We present results on the phrase-break prediction task, where use of the features closes the gap in performance between a baseline system (using only basic punctuation-related features) and a topline system (incorporating a state-of-the-art POS tagger).

[22] Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Can objective measures predict the intelligibility of modified HMM-based synthetic speech in noise? In Proc. Interspeech, August 2011. [ bib | .pdf ]
Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility – and on how well objective measures predict it – when we separately modify speaking rate, fundamental frequency, line spectral pairs and spectral peaks. Shifting LSPs can increase intelligibility for human listeners; other modifications had weaker effects. Among the objective measures we evaluated, the Dau model and the Glimpse proportion were the best predictors of human performance.

[23] Ming Lei, Junichi Yamagishi, Korin Richmond, Zhen-Hua Ling, Simon King, and Li-Rong Dai. Formant-controlled HMM-based speech synthesis. In Proc. Interspeech, pages 2777-2780, Florence, Italy, August 2011. [ bib | .pdf ]
This paper proposes a novel framework that enables us to manipulate and control formants in HMM-based speech synthesis. In this framework, the dependency between formants and spectral features is modelled by piecewise linear transforms; formant parameters are effectively mapped by these to the means of Gaussian distributions over the spectral synthesis parameters. The spectral envelope features generated under the influence of formants in this way may then be passed to high-quality vocoders to generate the speech waveform. This provides two major advantages over conventional frameworks. First, we can achieve spectral modification by changing formants only in those parts where we want control, whereas the user must specify all formants manually in conventional formant synthesisers (e.g. Klatt). Second, this can produce high-quality speech. Our results show the proposed method can control vowels in the synthesized speech by manipulating F 1 and F 2 without any degradation in synthesis quality.

[24] Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi. Feature-space transform tying in unified acoustic-articulatory modelling of articulatory control of HMM-based speech synthesis. In Proc. Interspeech, pages 117-120, Florence, Italy, August 2011. [ bib | .pdf ]
In previous work, we have proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into hidden Markov model (HMM) based parametric speech synthesis. A unified acoustic-articulatory model was trained and a piecewise linear transform was adopted to describe the dependency between these two feature streams. The transform matrices were trained for each HMM state and were tied based on each state's context. In this paper, an improved acoustic-articulatory modelling method is proposed. A Gaussian mixture model (GMM) is introduced to model the articulatory space and the cross-stream transform matrices are trained for each Gaussian mixture instead of context-dependently. This means the dependency relationship can vary with the change of articulatory features flexibly. Our results show this method improves the effectiveness of control over vowel quality by modifing articulatory trajectories without degrading naturalness.

[25] S. Andraszewicz, J. Yamagishi, and S. King. Vocal attractiveness of statistical speech synthesisers. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5368-5371, May 2011. [ bib | DOI ]
Our previous analysis of speaker-adaptive HMM-based speech synthesis methods suggested that there are two possible reasons why average voices can obtain higher subjective scores than any individual adapted voice: 1) model adaptation degrades speech quality proportionally to the distance 'moved' by the transforms, and 2) psychoacoustic effects relating to the attractiveness of the voice. This paper is a follow-on from that analysis and aims to separate these effects out. Our latest perceptual experiments focus on attractiveness, using average voices and speaker-dependent voices without model trans formation, and show that using several speakers to create a voice improves smoothness (measured by Harmonics-to-Noise Ratio), reduces distance from the the average voice in the log F0-F1 space of the final voice and hence makes it more attractive at the segmental level. However, this is weakened or overridden at supra-segmental or sentence levels.

Keywords: speaker-adaptive HMM-based speech synthesis methods;speaker-dependent voices;statistical speech synthesisers;vocal attractiveness;hidden Markov models;speaker recognition;speech synthesis;
[26] P.L. De Leon, I. Hernaez, I. Saratxaga, M. Pucher, and J. Yamagishi. Detection of synthetic speech for the problem of imposture. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 4844-4847, May 2011. [ bib | DOI ]
In this paper, we present new results from our research into the vulnerability of a speaker verification (SV) system to synthetic speech. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and both GMM-UBM and support vector machine (SVM) SV systems. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV systems have a 0.35 speech generated from speaker models derived from the WSJ journal corpus, over 91 accepted. We propose the use of relative phase shift (RPS) in order to detect synthetic speech and develop a GMM-based synthetic speech classifier (SSC). Using the SSC, we are able to correctly classify human speech in 95 significantly reducing the vulnerability.

Keywords: EER;GMM-UBM;GMM-based synthetic speech classifier;HMM-based speech synthesizer;RPS;SSC;SV system;WSJ corpus;Wall-Street Journal corpus;relative phase shift;speaker verification system;support vector machine;hidden Markov models;speaker recognition;speech synthesis;support vector machines;
[27] Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Evaluation of objective measures for intelligibility prediction of HMM-based synthetic speech in noise. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5112-5115, May 2011. [ bib | DOI | .pdf ]
In this paper we evaluate four objective measures of speech with regards to intelligibility prediction of synthesized speech in diverse noisy situations. We evaluated three intelligibility measures, the Dau measure, the glimpse proportion and the Speech Intelligibility Index (SII) and a quality measure, the Perceptual Evaluation of Speech Quality (PESQ). For the generation of synthesized speech we used a state of the art HMM-based speech synthesis system. The noisy conditions comprised four additive noises. The measures were compared with subjective intelligibility scores obtained in listening tests. The results show the Dau and the glimpse measures to be the best predictors of intelligibility, with correlations of around 0.83 to subjective scores. All measures gave less accurate predictions of intelligibility for synthetic speech than have previously been found for natural speech; in particular the SII measure. In additional experiments, we processed the synthesized speech by an ideal binary mask before adding noise. The Glimpse measure gave the most accurate intelligibility predictions in this situation.

[28] J.P. Cabral, S. Renals, J. Yamagishi, and K. Richmond. HMM-based speech synthesiser using the LF-model of the glottal source. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 4704-4707, May 2011. [ bib | DOI | .pdf ]
A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper sets out a new approach to using an acoustic glottal source model in HMM-based synthesisers instead of the traditional pulse signal. The goal is to improve speech quality and to better model and transform voice characteristics. We have found the new method decreases buzziness and also improves prosodic modelling. A perceptual evaluation has supported this finding by showing a 55.6 preference for the new system, as against the baseline. This improvement, while not being as significant as we had initially expected, does encourage us to work on developing the proposed speech synthesiser further.

[29] K. Hashimoto, J. Yamagishi, W. Byrne, S. King, and K. Tokuda. An analysis of machine translation and speech synthesis in speech-to-speech translation system. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5108-5111, May 2011. [ bib | DOI ]
This paper provides an analysis of the impacts of machine translation and speech synthesis on speech-to-speech translation systems. The speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques for integration of speech recognition and machine translation have been proposed. However, speech synthesis has not yet been considered. Therefore, in this paper, we focus on machine translation and speech synthesis, and report a subjective evaluation to analyze the impact of each component. The results of these analyses show that the naturalness and intelligibility of synthesized speech are strongly affected by the fluency of the translated sentences.

Keywords: machine translation;speech recognition;speech synthesis;speech-to-speech translation system;speech recognition;speech synthesis;
[30] T. Raitio, A. Suni, J. Yamagishi, H. Pulakka, J. Nurminen, M. Vainio, and P. Alku. HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech and Language Processing, 19(1):153-165, January 2011. [ bib | DOI ]
This paper describes an hidden Markov model (HMM)-based speech synthesizer that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In the proposed method, speech is first decomposed into the glottal source signal and the model of the vocal tract filter through glottal inverse filtering, and thus parametrized into excitation and spectral features. The source and filter features are modeled individually in the framework of HMM and generated in the synthesis stage according to the text input. The glottal excitation is synthesized through interpolating and concatenating natural glottal flow pulses, and the excitation signal is further modified according to the spectrum of the desired voice source characteristics. Speech is synthesized by filtering the reconstructed source signal with the vocal tract filter. Experiments show that the proposed system is capable of generating natural sounding speech, and the quality is clearly better compared to two HMM-based speech synthesis systems based on widely used vocoder techniques.

Keywords: Glottal inverse filtering , hidden Markov model (HMM) , speech synthesis
[31] J. Dines, J. Yamagishi, and S. King. Measuring the gap between HMM-based ASR and TTS. IEEE Selected Topics in Signal Processing, 2011. (in press). [ bib | DOI ]
The EMIME European project is conducting research in the development of technologies for mobile, personalised speech-to-speech translation systems. The hidden Markov model (HMM) is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components, thus, the investigation of unified statistical modelling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems measuring their performance with respect to phone set and lexicon; acoustic feature type and dimensionality; HMM topology; and speaker adaptation. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modelling approaches.

Keywords: Acoustics, Adaptation model, Context modeling, Hidden Markov models, Speech, Speech recognition, Training, speech recognition, speech synthesis, unified models
[32] Adriana Stan, Junichi Yamagishi, Simon King, and Matthew Aylett. The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication, 53(3):442-450, 2011. [ bib | DOI | http ]
This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called “RSS”, along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given. Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis.

Keywords: Speech synthesis, HTS, Romanian, HMMs, Sampling frequency, Auditory scale
[33] Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi. An analysis of HMM-based prediction of articulatory movements. Speech Communication, 52(10):834-846, October 2010. [ bib | DOI ]
This paper presents an investigation into predicting the movement of a speaker's mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945 mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchronous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900 mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076 mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used.

Keywords: Hidden Markov model; Articulatory features; Parameter generation
[34] Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi. HMM-based text-to-articulatory-movement prediction and analysis of critical articulators. In Proc. Interspeech, pages 2194-2197, Makuhari, Japan, September 2010. [ bib | .pdf ]
In this paper we present a method to predict the movement of a speaker's mouth from text input using hidden Markov models (HMM). We have used a corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), to train HMMs. To predict articulatory movements from text, a suitable model sequence is selected and the maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. In our experiments, we find that fully context-dependent models outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945mm when state durations are predicted from text, and 0.872mm when natural state durations are used. Finally, we go on to analyze the prediction error for different EMA dimensions and phone types. We find a clear pattern emerges that the movements of so-called critical articulators can be predicted more accurately than the average performance.

Keywords: Hidden Markov model, articulatory features, parameter generation, critical articulators
[35] Yong Guan, Jilei Tian, Yi-Jian Wu, Junichi Yamagishi, and Jani Nurminen. A unified and automatic approach of Mandarin HTS system. In Proc. SSW7, Kyoto, Japan, September 2010. [ bib | .pdf ]
Keywords: HTS, speech synthesis, mandarin
[36] João Cabral, Steve Renals, Korin Richmond, and Junichi Yamagishi. Transforming voice source parameters in a HMM-based speech synthesiser with glottal post-filtering. In Proc. 7th ISCA Speech Synthesis Workshop (SSW7), pages 365-370, NICT/ATR, Kyoto, Japan, September 2010. [ bib | .pdf ]
Control over voice quality, e.g. breathy and tense voice, is important for speech synthesis applications. For example, transformations can be used to modify aspects of the voice re- lated to speaker's identity and to improve expressiveness. How- ever, it is hard to modify voice characteristics of the synthetic speech, without degrading speech quality. State-of-the-art sta- tistical speech synthesisers, in particular, do not typically al- low control over parameters of the glottal source, which are strongly correlated with voice quality. Consequently, the con- trol of voice characteristics in these systems is limited. In con- trast, the HMM-based speech synthesiser proposed in this paper uses an acoustic glottal source model. The system passes the glottal signal through a whitening filter to obtain the excitation of voiced sounds. This technique, called glottal post-filtering, allows to transform voice characteristics of the synthetic speech by modifying the source model parameters. We evaluated the proposed synthesiser in a perceptual ex- periment, in terms of speech naturalness, intelligibility, and similarity to the original speaker's voice. The results show that it performed as well as a HMM-based synthesiser, which generates the speech signal with a commonly used high-quality speech vocoder.

Keywords: HMM-based speech synthesis, voice quality, glottal post-filter
[37] Oliver Watts, Junichi Yamagishi, and Simon King. The role of higher-level linguistic features in HMM-based speech synthesis. In Proc. Interspeech, pages 841-844, Makuhari, Japan, September 2010. [ bib | .pdf ]
We analyse the contribution of higher-level elements of the linguistic specification of a data-driven speech synthesiser to the naturalness of the synthetic speech which it generates. The system is trained using various subsets of the full feature-set, in which features relating to syntactic category, intonational phrase boundary, pitch accent and boundary tones are selectively removed. Utterances synthesised by the different configurations of the system are then compared in a subjective evaluation of their naturalness. The work presented forms background analysis for an ongoing set of experiments in performing text-to-speech (TTS) conversion based on shallow features: features that can be trivially extracted from text. By building a range of systems, each assuming the availability of a different level of linguistic annotation, we obtain benchmarks for our on-going work.

[38] Junichi Yamagishi, Oliver Watts, Simon King, and Bela Usabaev. Roles of the average voice in speaker-adaptive HMM-based speech synthesis. In Proc. Interspeech, pages 418-421, Makuhari, Japan, September 2010. [ bib | .pdf ]
In speaker-adaptive HMM-based speech synthesis, there are typically a few speakers for which the output synthetic speech sounds worse than that of other speakers, despite having the same amount of adaptation data from within the same corpus. This paper investigates these fluctuations in quality and concludes that as mel-cepstral distance from the average voice becomes larger, the MOS naturalness scores generally become worse. Although this negative correlation is not that strong, it suggests a way to improve the training and adaptation strategies. We also draw comparisons between our findings and the work of other researchers regarding “vocal attractiveness.”

Keywords: speech synthesis, HMM, average voice, speaker adaptation
[39] Mirjam Wester, John Dines, Matthew Gibson, Hui Liang, Yi-Jian Wu, Lakshmi Saheer, Simon King, Keiichiro Oura, Philip N. Garner, William Byrne, Yong Guan, Teemu Hirsimäki, Reima Karhila, Mikko Kurimo, Matt Shannon, Sayaka Shiota, Jilei Tian, Keiichi Tokuda, and Junichi Yamagishi. Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project. In Proc. of 7th ISCA Speech Synthesis Workshop, Kyoto, Japan, September 2010. [ bib | .pdf ]
This paper provides an overview of speaker adaptation research carried out in the EMIME speech-to-speech translation (S2ST) project. We focus on how speaker adaptation transforms can be learned from speech in one language and applied to the acoustic models of another language. The adaptation is transferred across languages and/or from recognition models to synthesis models. The various approaches investigated can all be viewed as a process in which a mapping is defined in terms of either acoustic model states or linguistic units. The mapping is used to transfer either speech data or adaptation transforms between the two models. Because the success of speaker adaptation in text-to-speech synthesis is measured by judging speaker similarity, we also discuss issues concerning evaluation of speaker similarity in an S2ST scenario.

[40] Michael Pucher, Dietmar Schabus, and Junichi Yamagishi. Synthesis of fast speech with interpolation of adapted HSMMs and its evaluation by blind and sighted listeners. In Proc. Interspeech, pages 2186-2189, Makuhari, Japan, September 2010. [ bib | .pdf ]
In this paper we evaluate a method for generating synthetic speech at high speaking rates based on the interpolation of hidden semi-Markov models (HSMMs) trained on speech data recorded at normal and fast speaking rates. The subjective evaluation was carried out with both blind listeners, who are used to very fast speaking rates, and sighted listeners. We show that we can achieve a better intelligibility rate and higher voice quality with this method compared to standard HSMM-based duration modeling. We also evaluate duration modeling with the interpolation of all the acoustic features including not only duration but also spectral and F0 models. An analysis of the mean squared error (MSE) of standard HSMM-based duration modeling for fast speech identifies problematic linguistic contexts for duration modeling.

Keywords: speech synthesis, fast speech, hidden semi- Markov model
[41] Sebastian Andersson, Junichi Yamagishi, and Robert Clark. Utilising spontaneous conversational speech in HMM-based speech synthesis. In The 7th ISCA Tutorial and Research Workshop on Speech Synthesis, September 2010. [ bib | .pdf ]
Spontaneous conversational speech has many characteristics that are currently not well modelled in unit selection and HMM-based speech synthesis. But in order to build synthetic voices more suitable for interaction we need data that exhibits more conversational characteristics than the generally used read aloud sentences. In this paper we will show how carefully selected utterances from a spontaneous conversation was instrumental for building an HMM-based synthetic voices with more natural sounding conversational characteristics than a voice based on carefully read aloud sentences. We also investigated a style blending technique as a solution to the inherent problem of phonetic coverage in spontaneous speech data. But the lack of an appropriate representation of spontaneous speech phenomena probably contributed to results showing that we could not yet compete with the speech quality achieved for grammatical sentences.

[42] Oliver Watts, Junichi Yamagishi, and Simon King. Letter-based speech synthesis. In Proc. Speech Synthesis Workshop 2010, pages 317-322, Nara, Japan, September 2010. [ bib | .pdf ]
Initial attempts at performing text-to-speech conversion based on standard orthographic units are presented, forming part of a larger scheme of training TTS systems on features that can be trivially extracted from text. We evaluate the possibility of using the technique of decision-tree-based context clustering conventionally used in HMM-based systems for parametertying to handle letter-to-sound conversion. We present the application of a method of compound-feature discovery to corpusbased speech synthesis. Finally, an evaluation of intelligibility of letter-based systems and more conventional phoneme-based systems is presented.

[43] Mikko Kurimo, William Byrne, John Dines, Philip N. Garner, Matthew Gibson, Yong Guan, Teemu Hirsimäki, Reima Karhila, Simon King, Hui Liang, Keiichiro Oura, Lakshmi Saheer, Matt Shannon, Sayaka Shiota, Jilei Tian, Keiichi Tokuda, Mirjam Wester, Yi-Jian Wu, and Junichi Yamagishi. Personalising speech-to-speech translation in the EMIME project. In Proc. of the ACL 2010 System Demonstrations, Uppsala, Sweden, July 2010. [ bib | .pdf ]
In the EMIME project we have studied unsupervised cross-lingual speaker adaptation. We have employed an HMM statistical framework for both speech recognition and synthesis which provides transformation mechanisms to adapt the synthesized voice in TTS (text-to-speech) using the recognized voice in ASR (automatic speech recognition). An important application for this research is personalised speech-to-speech translation that will use the voice of the speaker in the input language to utter the translated sentences in the output language. In mobile environments this enhances the users' interaction across language barriers by making the output speech sound more like the original speaker's way of speaking, even if she or he could not speak the output language.

[44] O. Watts, J. Yamagishi, S. King, and K. Berkling. Synthesis of child speech with HMM adaptation and voice conversion. Audio, Speech, and Language Processing, IEEE Transactions on, 18(5):1005-1016, July 2010. [ bib | DOI | .pdf ]
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesizer from that data. We chose to build a statistical parametric synthesizer using the hidden Markov model (HMM)-based system HTS, as this technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. Six different configurations of the synthesizer were compared, using both speaker-dependent and speaker-adaptive modeling techniques, and using varying amounts of data. For comparison with HMM adaptation, techniques from voice conversion were used to transform existing synthesizers to the characteristics of the target speaker. Speaker-adaptive voices generally outperformed child speaker-dependent voices in the evaluation. HMM adaptation outperformed voice conversion style techniques when using the full target speaker corpus; with fewer adaptation data, however, no significant listener preference for either HMM adaptation or voice conversion methods was found.

Keywords: HMM adaptation techniques;child speech synthesis;hidden Markov model;speaker adaptive modeling technique;speaker dependent technique;speaker-adaptive voice;statistical parametric synthesizer;target speaker corpus;voice conversion;hidden Markov models;speech synthesis;
[45] J. Yamagishi, B. Usabaev, S. King, O. Watts, J. Dines, J. Tian, R. Hu, Y. Guan, K. Oura, K. Tokuda, R. Karhila, and M. Kurimo. Thousands of voices for HMM-based speech synthesis - analysis and application of TTS systems built on various ASR corpora. IEEE Transactions on Audio, Speech and Language Processing, 18(5):984-1004, July 2010. [ bib | DOI ]
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.

Keywords: Automatic speech recognition (ASR), H Triple S (HTS), SPEECON database, WSJ database, average voice, hidden Markov model (HMM)-based speech synthesis, speaker adaptation, speech synthesis, voice conversion
[46] R. Barra-Chicote, J. Yamagishi, S. King, J. Manuel Monero, and J. Macias-Guarasa. Analysis of statistical parametric and unit-selection speech synthesis systems applied to emotional speech. Speech Communication, 52(5):394-404, May 2010. [ bib | DOI ]
We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded - happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion. Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method.

Keywords: Emotional speech synthesis; HMM-based synthesis; Unit selection
[47] P. L. De Leon, V. R. Apsingekar, M. Pucher, and J. Yamagishi. Revisiting the security of speaker verification systems against imposture using synthetic speech. In Proc. ICASSP 2010, Dallas, Texas, USA, 2010. [ bib | .pdf ]
[48] P.L. De Leon, M. Pucher, and J. Yamagishi. Evaluation of the vulnerability of speaker verification to synthetic speech. In Proc. Odyssey (The speaker and language recognition workshop) 2010, Brno, Czech Republic, 2010. [ bib | .pdf ]
[49] Keiichiro Oura, Keiichi Tokuda, Junichi Yamagishi, Mirjam Wester, and Simon King. Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. In Proc. of ICASSP, volume I, pages 4954-4957, 2010. [ bib | .pdf ]
In the EMIME project, we are developing a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrate two techniques, unsupervised adaptation for HMM-based TTS using a word-based large-vocabulary continuous speech recognizer and cross-lingual speaker adaptation for HMM-based TTS, into a single architecture. Thus, an unsupervised cross-lingual speaker adaptation system can be developed. Listening tests show very promising results, demonstrating that adapted voices sound similar to the target speaker and that differences between supervised and unsupervised cross-lingual speaker adaptation are small.

[50] Michael Pucher, Dietmar Schabus, Junichi Yamagishi, Friedrich Neubarth, and Volker Strom. Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis. Speech Communication, 52(2):164-179, 2010. [ bib | DOI ]
An HMM-based speech synthesis framework is applied to both Standard Austrian German and a Viennese dialectal variety and several training strategies for multi-dialect modeling such as dialect clustering and dialect-adaptive training are investigated. For bridging the gap between processing on the level of HMMs and on the linguistic level, we add phonological transformations to the HMM interpolation and apply them to dialect interpolation. The crucial steps are to employ several formalized phonological rules between Austrian German and Viennese dialect as constraints for the HMM interpolation. We verify the effectiveness of this strategy in a number of perceptual evaluations. Since the HMM space used is not articulatory but acoustic space, there are some variations in evaluation results between the phonological rules. However, in general we obtained good evaluation results which show that listeners can perceive both continuous and categorical changes of dialect varieties by using phonological transformations employed as switching rules in the HMM interpolation.

[51] J. Yamagishi and S. King. Simple methods for improving speaker-similarity of HMM-based speech synthesis. In Proc. ICASSP 2010, Dallas, Texas, USA, 2010. [ bib | .pdf ]
[52] Heiga Zen, Keiichiro Oura, Takashi Nose, Junichi Yamagishi, Shinji Sako, Tomoki Toda, Takashi Masuko, Alan W. Black, and Keiichi Tokuda. Recent development of the HMM-based speech synthesis system (HTS). In Proc. 2009 Asia-Pacific Signal and Information Processing Association (APSIPA), Sapporo, Japan, October 2009. [ bib | .pdf ]
A statistical parametric approach to speech synthesis based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context-dependent HMMs, and speech waveforms are generate from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named “HMM-based speech synthesis system (HTS)” to provide a research and development toolkit for statistical parametric speech synthesis. This paper describes recent developments of HTS in detail, as well as future release plans.

[53] J. Sebastian Andersson, Joao P. Cabral, Leonardo Badino, Junichi Yamagishi, and Robert A.J. Clark. Glottal source and prosodic prominence modelling in HMM-based speech synthesis for the Blizzard Challenge 2009. In The Blizzard Challenge 2009, Edinburgh, U.K., September 2009. [ bib | .pdf ]
This paper describes the CSTR entry for the Blizzard Challenge 2009. The work focused on modifying two parts of the Nitech 2005 HTS speech synthesis system to improve naturalness and contextual appropriateness. The first part incorporated an implementation of the Linjencrants-Fant (LF) glottal source model. The second part focused on improving synthesis of prosodic prominence including emphasis through context dependent phonemes. Emphasis was assigned to the synthesised test sentences based on a handful of theory based rules. The two parts (LF-model and prosodic prominence) were not combined and hence evaluated separately. The results on naturalness for the LF-model showed that it is not yet perceived as natural as the Benchmark HTS system for neutral speech. The results for the prosodic prominence modelling showed that it was perceived as contextually appropriate as the Benchmark HTS system, despite a low naturalness score. The Blizzard challenge evaluation has provided valuable information on the status of our work and continued work will begin with analysing why our modifications resulted in reduced naturalness compared to the Benchmark HTS system.

[54] Oliver Watts, Junichi Yamagishi, Simon King, and Kay Berkling. HMM adaptation and voice conversion for the synthesis of child speech: A comparison. In Proc. Interspeech 2009, pages 2627-2630, Brighton, U.K., September 2009. [ bib | .pdf ]
This study compares two different methodologies for producing data-driven synthesis of child speech from existing systems that have been trained on the speech of adults. On one hand, an existing statistical parametric synthesiser is transformed using model adaptation techniques, informed by linguistic and prosodic knowledge, to the speaker characteristics of a child speaker. This is compared with the application of voice conversion techniques to convert the output of an existing waveform concatenation synthesiser with no explicit linguistic or prosodic knowledge. In a subjective evaluation of the similarity of synthetic speech to natural speech from the target speaker, the HMM-based systems evaluated are generally preferred, although this is at least in part due to the higher dimensional acoustic features supported by these techniques.

[55] Leonardo Badino, J. Sebastian Andersson, Junichi Yamagishi, and Robert A.J. Clark. Identification of contrast and its emphatic realization in HMM-based speech synthesis. In Proc. Interspeech 2009, Brighton, U.K., September 2009. [ bib | .PDF ]
The work presented in this paper proposes to identify contrast in the form of contrastive word pairs and prosodically signal it with emphatic accents in a Text-to-Speech (TTS) application using a Hidden-Markov-Model (HMM) based speech synthesis system. We first describe a novel method to automatically detect contrastive word pairs using textual features only and report its performance on a corpus of spontaneous conversations in English. Subsequently we describe the set of features selected to train a HMM-based speech synthesis system and attempting to properly control prosodic prominence (including emphasis). Results from a large scale perceptual test show that in the majority of cases listeners judge emphatic contrastive word pairs as acceptable as their non-emphatic counterpart, while emphasis on non-contrastive pairs is almost never acceptable.

[56] Junichi Yamagishi, Mike Lincoln, Simon King, John Dines, Matthew Gibson, Jilei Tian, and Yong Guan. Analysis of unsupervised and noise-robust speaker-adaptive HMM-based speech synthesis systems toward a unified ASR and TTS framework. In Proc. Interspeech 2009, Edinburgh, U.K., September 2009. [ bib ]
For the 2009 Blizzard Challenge we have built an unsupervised version of the HTS-2008 speaker-adaptive HMM-based speech synthesis system for English, and a noise robust version of the systems for Mandarin. They are designed from a multidisciplinary application point of view in that we attempt to integrate the components of the TTS system with other technologies such as ASR. All the average voice models are trained exclusively from recognized, publicly available, ASR databases. Multi-pass LVCSR and confidence scores calculated from confusion network are used for the unsupervised systems, and noisy data recorded in cars or public spaces is used for the noise robust system. We believe the developed systems form solid benchmarks and provide good connections to ASR fields. This paper describes the development of the systems and reports the results and analysis of their evaluation.

[57] J. Dines, J. Yamagishi, and S. King. Measuring the gap between HMM-based ASR and TTS. In Proc. Interspeech, pages 1391-1394, Brighton, U.K., September 2009. [ bib ]
The EMIME European project is conducting research in the development of technologies for mobile, personalised speech-to-speech translation systems. The hidden Markov model is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components, thus, the investigation of unified statistical modelling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems, measuring their performance with respect to phone set and lexicon, acoustic feature type and dimensionality and HMM topology. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modelling approaches.

[58] J. Yamagishi, Bela Usabaev, Simon King, Oliver Watts, John Dines, Jilei Tian, Rile Hu, Yong Guan, Keiichiro Oura, Keiichi Tokuda, Reima Karhila, and Mikko Kurimo. Thousands of voices for HMM-based speech synthesis. In Proc. Interspeech, pages 420-423, Brighton, U.K., September 2009. [ bib | http ]
Our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an ‘average voice model’ plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack of phonetic balance. This enables us consider building high-quality voices on ’non-TTS’ corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper we show thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management, Globalphone and Speecon. We report some perceptual evaluation results and outline the outstanding issues.

[59] Z. Ling, K. Richmond, J. Yamagishi, and R. Wang. Integrating articulatory features into HMM-based parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing, 17(6):1171-1185, August 2009. IEEE SPS 2010 Young Author Best Paper Award. [ bib | DOI ]
This paper presents an investigation of ways to integrate articulatory features into Hidden Markov Model (HMM)-based parametric speech synthesis, primarily with the aim of improving the performance of acoustic parameter generation. The joint distribution of acoustic and articulatory features is estimated during training and is then used for parameter generation at synthesis time in conjunction with a maximum-likelihood criterion. Different model structures are explored to allow the articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency. The results of objective evaluation show that the accuracy of acoustic parameter prediction can be improved when shared clustering and asynchronous-state model structures are adopted for combined acoustic and articulatory features. More significantly, our experiments demonstrate that modeling the dependency between these two feature streams can make speech synthesis more flexible. The characteristics of synthetic speech can be easily controlled by modifying generated articulatory features as part of the process of acoustic parameter generation.

[60] J. Cabral, S. Renals, K. Richmond, and J. Yamagishi. HMM-based speech synthesis with an acoustic glottal source model. In Proc. The First Young Researchers Workshop in Speech Technology, April 2009. [ bib | .pdf ]
A major cause of degradation of speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper describes a new approach to using an acoustic glottal source model in HMM-based synthesisers. The goal is to improve speech quality and parametric flexibility to better model and transform voice characteristics.

[61] Sarah Creer, Phil Green, Stuart Cunningham, and Junichi Yamagishi. Building personalised synthesised voices for individuals with dysarthria using the HTS toolkit. In John W. Mullennix and Steven E. Stern, editors, Computer Synthesized Speech Technologies: Tools for Aiding Impairment. IGI Global, 1st edition, 2009. in press. [ bib ]
When the speech of an individual becomes unintelligible due to a neurological disorder, a synthesized voice can replace that of the individual. To fully replace all functions of human speech communication: communication of information, maintenance of social relationships and displaying identity, the voice must be intelligible, natural-sounding and retain the vocal identity of the speaker. For speakers with dysarthria, achieving this output with minimal data recordings and deteriorating speech is difficult. An alternative to this is using Hidden Markov models (HMMs) which require much less speech data than needed for concatenative methods, to adapt a robust statistical model of speech towards the speaker characteristics captured in the data recorded by the individual. This chapter implements this technique using the HTS toolkit to build personalized synthetic voices for two individuals with dysarthria. An evaluation of the voices by the participants themselves suggests that this technique shows promise for building and reconstructing personalized voices for individuals with dysarthria once deterioration has begun.

[62] Matthew P. Aylett, Simon King, and Junichi Yamagishi. Speech synthesis without a phone inventory. In Interspeech, pages 2087-2090, 2009. [ bib | .pdf ]
In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and potentially sub-optimal. In this paper we investigate how acoustic clustering, together with lexicon constraints, can be used to build a self-organised inventory. Six English speech synthesis systems were built using two frameworks, unit selection and parametric HTS for three inventory conditions: 1) a traditional phone set, 2) a system using orthographic units, and 3) a self-organised inventory. A listening test showed a strong preference for the classic system, and for the orthographic system over the self-organised system. Results also varied by letter to sound complexity and database coverage. This suggests the self-organised approach failed to generalise pronunciation as well as introducing noise above and beyond that caused by orthographic sound mismatch.

[63] Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhenhua Ling, Tomoki Toda, Keiichi Tokuda, Simon King, and Steve Renals. Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Transactions on Audio, Speech and Language Processing, 17(6):1208-1230, 2009. [ bib | http | www: ]
This paper describes a speaker-adaptive HMM-based speech synthesis system. The new system, called “HTS-2007,” employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available. In addition, a comparison study with several speech synthesis techniques shows the new system is very robust: It is able to build voices from less-than-ideal speech data and synthesize good-quality speech even for out-of-domain sentences.

[64] R. Barra-Chicote, J. Yamagishi, J.M. Montero, S. King, S. Lutfi, and J. Macias-Guarasa. Generacion de una voz sintetica en Castellano basada en HSMM para la Evaluacion Albayzin 2008: conversion texto a voz. In V Jornadas en Tecnologia del Habla, pages 115-118, November 2008. (in Spanish). [ bib | .pdf ]
[65] Oliver Watts, Junichi Yamagishi, Kay Berkling, and Simon King. HMM-based synthesis of child speech. In Proc. of The 1st Workshop on Child, Computer and Interaction (ICMI'08 post-conference workshop), Crete, Greece, October 2008. [ bib | .pdf ]
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesiser from that data. Because only limited data can be collected, and the domain of that data is constrained, it is difficult to obtain the type of phonetically-balanced corpus usually used in speech synthesis. As a consequence, building a synthesiser from this data is difficult. Concatenative synthesisers are not robust to corpora with many missing units (as is likely when the corpus content is not carefully designed), so we chose to build a statistical parametric synthesiser using the HMM-based system HTS. This technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. We compared 6 different configurations of the synthesiser, using both speaker-dependent and speaker-adaptive modelling techniques, and using varying amounts of data. The output from these systems was evaluated alongside natural and vocoded speech, in a Blizzard-style listening test.

[66] Zhen-Hua Ling, Korin Richmond, Junichi Yamagishi, and Ren-Hua Wang. Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge. In Proc. Interspeech, pages 573-576, Brisbane, Australia, September 2008. [ bib | .PDF ]
This paper presents a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a Hidden Markov Model (HMM)-based parametric speech synthesis system. In contrast to model adaptation and interpolation approaches for speaking style control, this method is driven by phonetic knowledge, and target speech samples are not required. The joint distribution of parallel acoustic and articulatory features considering cross-stream feature dependency is estimated. At synthesis time, acoustic and articulatory features are generated simultaneously based on the maximum-likelihood criterion. The synthetic speech can be controlled flexibly by modifying the generated articulatory features according to arbitrary phonetic rules in the parameter generation process. Our experiments show that the proposed method is effective in both changing the overall character of synthesized speech and in controlling the quality of a specific vowel.

[67] Matthew P. Aylett and Junichi Yamagishi. Combining statistical parameteric speech synthesis and unit-selection for automatic voice cloning. In Proc. LangTech 2008, Brisbane, Australia, September 2008. [ bib | .pdf ]
The ability to use the recorded audio of a subjects voice to produce an open-domain synthesis system has generated much interest both in academic research and in commercial speech technology. The ability to produce synthetic versions of a subjects voice has potential commercial applications, such as virtual celebrity actors, or potential clinical applications, such as offering a synthetic replacement voice in the case of a laryngectomy. Recent developments in HMM-based speech synthesis have shown it is possible to produce synthetic voices from quite small amounts of speech data. However, mimicking the depth and variation of a speakers prosody as well as synthesising natural voice quality is still a challenging research problem. In contrast, unit-selection systems have shown it is possible to strongly retain the character of the voice but only with sufficient original source material. Often this runs into hours and may require significant manual checking and labelling. In this paper we will present two state of the art systems, an HMM based system HTS-2007, developed by CSTR and Nagoya Institute Technology, and a commercial unit-selection system CereVoice, developed by Cereproc. Both systems have been used to mimic the voice of George W. Bush (43rd president of the United States) using freely available audio from the web. In addition we will present a hybrid system which combines both technologies. We demonstrate examples of synthetic voices created from 10, 40 and 210 minutes of randomly selected speech. We will then discuss the underlying problems associated with voice cloning using found audio, and the scalability of our solution.

[68] Junichi Yamagishi, Zhenhua Ling, and Simon King. Robustness of hmm-based speech synthesis. In Proc. Interspeech 2008, pages 581-584, Brisbane, Australia, September 2008. [ bib | .pdf ]
As speech synthesis techniques become more advanced, we are able to consider building high-quality voices from data collected outside the usual highly-controlled recording studio environment. This presents new challenges that are not present in conventional text-to-speech synthesis: the available speech data are not perfectly clean, the recording conditions are not consistent, and/or the phonetic balance of the material is not ideal. Although a clear picture of the performance of various speech synthesis techniques (e.g., concatenative, HMM-based or hybrid) under good conditions is provided by the Blizzard Challenge, it is not well understood how robust these algorithms are to less favourable conditions. In this paper, we analyse the performance of several speech synthesis methods under such conditions. This is, as far as we know, a new research topic: “Robust speech synthesis.” As a consequence of our investigations, we propose a new robust training method for the HMM-based speech synthesis in for use with speech data collected in unfavourable conditions.

[69] J. Cabral, S. Renals, K. Richmond, and J. Yamagishi. Glottal spectral separation for parametric speech synthesis. In Proc. Interspeech, pages 1829-1832, Brisbane, Australia, September 2008. [ bib | .PDF ]
This paper presents a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a Hidden Markov Model (HMM)-based parametric speech synthesis system. In contrast to model adaptation and interpolation approaches for speaking style control, this method is driven by phonetic knowledge, and target speech samples are not required. The joint distribution of parallel acoustic and articulatory features considering cross-stream feature dependency is estimated. At synthesis time, acoustic and articulatory features are generated simultaneously based on the maximum-likelihood criterion. The synthetic speech can be controlled flexibly by modifying the generated articulatory features according to arbitrary phonetic rules in the parameter generation process. Our experiments show that the proposed method is effective in both changing the overall character of synthesized speech and in controlling the quality of a specific vowel.

[70] Gregor Hofer, Junichi Yamagishi, and Hiroshi Shimodaira. Speech-driven lip motion generation with a trajectory HMM. In Proc. Interspeech 2008, pages 2314-2317, Brisbane, Australia, September 2008. [ bib | .pdf ]
Automatic speech animation remains a challenging problem that can be described as finding the optimal sequence of animation parameter configurations given some speech. In this paper we present a novel technique to automatically synthesise lip motion trajectories from a speech signal. The developed system predicts lip motion units from the speech signal and generates animation trajectories automatically employing a Trajectory Hidden Markov Model. Using the MLE criterion, its parameter generation algorithm produces the optimal smooth motion trajectories that are used to drive control points on the lips directly. Additionally, experiments were carried out to find a suitable model unit that produces the most accurate results. Finally a perceptual evaluation was conducted, that showed that the developed motion units perform better than phonemes.

[71] Simon King, Keiichi Tokuda, Heiga Zen, and Junichi Yamagishi. Unsupervised adaptation for hmm-based speech synthesis. In Proc. Interspeech, pages 1869-1872, Brisbane, Australia, September 2008. [ bib | .PDF ]
It is now possible to synthesise speech using HMMs with a comparable quality to unit-selection techniques. Generating speech from a model has many potential advantages over concatenating waveforms. The most exciting is model adaptation. It has been shown that supervised speaker adaptation can yield high-quality synthetic voices with an order of magnitude less data than required to train a speaker-dependent model or to build a basic unit-selection system. Such supervised methods require labelled adaptation data for the target speaker. In this paper, we introduce a method capable of unsupervised adaptation, using only speech from the target speaker without any labelling.

[72] Junichi Yamagishi, Heiga Zen, Yi-Jian Wu, Tomoki Toda, and Keiichi Tokuda. The HTS-2008 system: Yet another evaluation of the speaker-adaptive HMM-based speech synthesis system in the 2008 Blizzard Challenge. In Proc. Blizzard Challenge 2008, Brisbane, Australia, September 2008. [ bib | .pdf ]
For the 2008 Blizzard Challenge, we used the same speaker-adaptive approach to HMM-based speech synthesis that was used in the HTS entry to the 2007 challenge, but an improved system was built in which the multi-accented English average voice model was trained on 41 hours of speech data with high-order mel-cepstral analysis using an efficient forward-backward algorithm for the HSMM. The listener evaluation scores for the synthetic speech generated from this system was much better than in 2007: the system had the equal best naturalness on the small English data set and the equal best intelligibility on both small and large data sets for English, and had the equal best naturalness on the Mandarin data. In fact, the English system was found to be as intelligible as human speech.

[73] Junichi Yamagishi, Hisashi Kawai, and Takao Kobayashi. Phone duration modeling using gradient tree boosting. Speech Communication, 50(5):405-415, May 2008. [ bib | DOI ]
In text-to-speech synthesis systems, phone duration influences the quality and naturalness of synthetic speech. In this study, we incorporate an ensemble learning technique called gradient tree boosting into phone duration modeling as an alternative to the conventional approach using regression trees, and objectively evaluate the prediction accuracy of Japanese, Mandarin, and English phone duration. The gradient tree boosting algorithm is a meta algorithm of regression trees: it iteratively builds the regression tree from the residuals and outputs weighting sum of the regression trees. Our evaluation results show that compared to the regression trees or other techniques related to the regression trees, the gradient tree boosting algorithm can substantially and robustly improve the predictive accuracy of the phone duration regardless of languages, speakers, or domains.

[74] Junichi Yamagishi, Takashi Nose, Heiga Zen, Tomoki Toda, and Keiichi Tokuda. Performance evaluation of the speaker-independent HMM-based speech synthesis system "HTS-2007" for the Blizzard Challenge 2007. In Proc. ICASSP 2008, pages 3957-3960, Las Vegas, U.S.A, April 2008. [ bib | DOI ]
This paper describes a speaker-independent/adaptive HMM-based speech synthesis system developed for the Blizzard Challenge 2007. The new system, named HTS-2007, employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than that of speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available.

[75] Junichi Yamagishi, Takao Kobayashi, Yuji Nakano, Katsumi Ogata, and Juri Isogai. Analysis of speaker adaptation algorihms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Transactions on Audio, Speech and Language Processing, 2008. In print. [ bib ]
In this paper we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms. Here we investigate six major aspects of the speaker adaptation: initial models transform functions, estimation criteria, and sensitivity of several linear regression adaptation algorithms algorithms. Analyzing the effect of the initial model, we compare speaker-dependent models, gender-independent models, and the simultaneous use of the gender-dependent models to single use of the gender-dependent models. Analyzing the effect of the transform functions, we compare the transform function for only mean vectors with that for mean vectors and covariance matrices. Analyzing the effect of the estimation criteria, we compare the ML criterion with a robust estimation criterion called structural MAP. We evaluate the sensitivity of several thresholds for the piecewise linear regression algorithms and take up methods combining MAP adaptation with the linear regression algorithms. We incorporate these adaptation algorithms into our speech synthesis system and present several subjective and objective evaluation results showing the utility and effectiveness of these algorithms in speaker adaptation for HMM-based speech synthesis.

[76] Takashi Nose, Junichi Yamagishi, and Takao Kobayashi. A style control technique for HMM-based expressive speech synthesis. IEICE Trans. Information and Systems, E90-D(9):1406-1413, September 2007. [ bib | http ]
This paper describes a technique for controlling the degree of expressivity of a desired emotional expression and/or speaking style of synthesized speech in an HMM-based speech synthesis framework. With this technique, multiple emotional expressions and speaking styles of speech are modeled in a single model by using a multiple-regression hidden semi-Markov model (MRHSMM). A set of control parameters, called the style vector, is defined, and each speech synthesis unit is modeled by using the MRHSMM, in which mean parameters of the state output and duration distributions are expressed by multiple-regression of the style vector. In the synthesis stage, the mean parameters of the synthesis units are modified by transforming an arbitrarily given style vector that corresponds to a point in a low-dimensional space, called style space, each of whose coordinates represents a certain specific speaking style or emotion of speech. The results of subjective evaluation tests show that style and its intensity can be controlled by changing the style vector

[77] Junichi Yamagishi, Takao Kobayashi, Steve Renals, Simon King, Heiga Zen, Tomoki Toda, and Keiichi Tokuda. Improved average-voice-based speech synthesis using gender-mixed modeling and a parameter generation algorithm considering GV. In Proc. 6th ISCA Workshop on Speech Synthesis (SSW-6), August 2007. [ bib | .pdf ]
For constructing a speech synthesis system which can achieve diverse voices, we have been developing a speaker independent approach of HMM-based speech synthesis in which statistical average voice models are adapted to a target speaker using a small amount of speech data. In this paper, we incorporate a high-quality speech vocoding method STRAIGHT and a parameter generation algorithm with global variance into the system for improving quality of synthetic speech. Furthermore, we introduce a feature-space speaker adaptive training algorithm and a gender mixed modeling technique for conducting further normalization of the average voice model. We build an English text-to-speech system using these techniques and show the performance of the system.

[78] Toshio Hirai, Junichi Yamagishi, and Seiichi Tenpaku. Utilization of an HMM-based feature generation module in 5 ms segment concatenative speech synthesis. In Proc. 6th ISCA Workshop on Speech Synthesis (SSW-6), August 2007. [ bib ]
If a concatenative speech synthesis system uses more short speech segments, it increases the potential to generate natural speech because the concatenation variation becomes greater. Recently, a synthesis approach was proposed in which very short (5 ms) segments are used. In this paper, an implementation of an HMM-based feature generation module into a very short segment concatenative synthesis system that has the advantage of modularity and a synthesis experiment are described.

[79] Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan Black, and Keiichi Tokuda. The HMM-based speech synthesis system (HTS) version 2.0. In Proc. 6th ISCA Workshop on Speech Synthesis (SSW-6), August 2007. [ bib ]
A statistical parametric speech synthesis system based on hidden Markov models (HMMs) has grown in popularity over the last few years. This system simultaneously models spectrum, excitation, and duration of speech using context-dependent HMMs and generates speech waveforms from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named HMM-based speech synthesis system (HTS) to provide a research and development platform for the speech synthesis community. In December 2006, HTS version 2.0 was released. This version includes a number of new features which are useful for both speech synthesis researchers and developers. This paper describes HTS version 2.0 in detail, as well as future release plans.

[80] Makoto Tachibana, Keigo Kawashima, Junichi Yamagishi, and Takao Kobayashi. Performance evaluation of HMM-based style classification with a small amount of training data. In Proc. Interspeech 2007, August 2007. [ bib ]
This paper describes a classification technique for emotional expressions and speaking styles of speech using only a small amount of training data of a target speaker. We model spectral and fundamental frequency (F0) features simultaneously using multi-space probability distribution HMM (MSD-HMM), and adapt a speaker-independent neutral style model to a certain target speaker’s style model with a small amount of data using MSD-MLLR which is extended MLLR for MSD-HMM. We perform classification experiments for professional narrators’ speech and non-professional speakers' speech and evaluate the performance of proposed technique by comparing with other commonly used classifiers. We show that the proposed technique gives better result than the other classifiers when using a few sentences of target speaker’s style data.

[81] Junichi Yamagishi, Heiga Zen, Tomoki Toda, and Keiichi Tokuda. Speaker-independent HMM-based speech synthesis system - HTS-2007 system for the Blizzard Challenge 2007. In Proc. Blizzard Challenge 2007, August 2007. [ bib | .pdf ]
This paper describes an HMM-based speech synthesis system developed by the HTS working group for the Blizzard Challenge 2007. To further explore the potential of HMM-based speech synthesis, we incorporate new features in our conventional system which underpin a speaker-independent approach: speaker adaptation techniques; adaptive training for HSMMs; and full covariance modeling using the CSMAPLR transforms.

[82] K. Richmond, V. Strom, R. Clark, J. Yamagishi, and S. Fitt. Festival multisyn voices for the 2007 blizzard challenge. In Proc. Blizzard Challenge Workshop (in Proc. SSW6), Bonn, Germany, August 2007. [ bib | .pdf ]
This paper describes selected aspects of the Festival Multisyn entry to the Blizzard Challenge 2007. We provide an overview of the process of building the three required voices from the speech data provided. This paper focuses on new features of Multisyn which are currently under development and which have been employed in the system used for this Blizzard Challenge. These differences are the application of a more flexible phonetic lattice representation during forced alignment labelling and the use of a pitch accent target cost component. Finally, we also examine aspects of the speech data provided for this year's Blizzard Challenge and raise certain issues for discussion concerning the aim of comparing voices made with differing subsets of the data provided.

[83] Junichi Yamagishi and Takao Kobayashi. Average-voice-based speech synthesis using hsmm-based speaker adaptation and adaptive training. IEICE Trans. Information and Systems, E90-D(2):533-543, February 2007. [ bib ]
In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.

[84] J. Yamagishi, T. Kobayashi, M. Tachibana, K. Ogata, and Y. Nakano. Model adaptation approach to speech synthesis with diverse voices and styles. In Proc. ICASSP, pages 1233-1236, 2007. [ bib ]
In human computer interaction and dialogue systems, it is often desirable for text-to-speech synthesis to be able to generate natural sounding speech with an arbitrary speaker Afs voice and with varying speaking styles and/or emotional expressions. We have developed an average-voice-based speech synthesis method using statistical average voice models and model adaptation techniques for this purpose. In this paper, we describe an overview of the speech synthesis system and show the current performance with several experimental results.

[85] Gregor Hofer, Hiroshi Shimodaira, and Junichi Yamagishi. Speech-driven head motion synthesis based on a trajectory model. Poster at Siggraph 2007, 2007. [ bib | .pdf ]
[86] Gregor Hofer, Hiroshi Shimodaira, and Junichi Yamagishi. Lip motion synthesis using a context dependent trajectory hidden Markov model. Poster at SCA 2007, 2007. [ bib | .pdf ]
[87] J. Cabral, S. Renals, K. Richmond, and J. Yamagishi. Towards an improved modeling of the glottal source in statistical parametric speech synthesis. In Proc.of the 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, 2007. [ bib | .pdf ]
This paper proposes the use of the Liljencrants-Fant model (LF-model) to represent the glottal source signal in HMM-based speech synthesis systems. These systems generally use a pulse train to model the periodicity of the excitation signal of voiced speech. However, this model produces a strong and uniform harmonic structure throughout the spectrum of the excitation which makes the synthetic speech sound buzzy. The use of a mixed band excitation and phase manipulation reduces this effect but it can result in degradation of the speech quality if the noise component is not weighted carefully. In turn, the LF-waveform has a decaying spectrum at higher frequencies, which is more similar to the real glottal source excitation signal. We conducted a perceptual experiment to test the hypothesis that the LF-model can perform as well as or better than the pulse train in a HMM-based speech synthesizer. In the synthesis, we used the mean values of the LF-parameters, calculated by measurements of the recorded speech. The result of this study is important not only regarding the improvement in speech quality of these type of systems, but also because the LF-model can be used to model many characteristics of the glottal source, such as voice quality, which are important for voice transformation and generation of expressive speech.

[88] Hisashi Kawai, Tomoki Toda, Junichi Yamagishi, Toshio Hirai, Jinfu Ni, Nobuyuki Nishizawa, Minoru Tsuzaki, and Keiichi Tokuda. Ximera: a concatenative speech synthesis system with large scale corpora. IEICE Trans. Information and Systems, J89-D-II(12):2688-2698, December 2006. [ bib ]