The Centre for Speech Technology Research, The university of Edinburgh

Publications by Cassia Valentini-Botinhao

[1] Felipe Espic, Cassia Valentini-Botinhao, and Simon King. Direct modelling of magnitude and phase spectra for statistical parametric speech synthesis. In Proc. Interspeech, Stochohlm, Sweden, August 2017. [ bib | .PDF ]
We propose a simple new representation for the FFT spectrum tailored to statistical parametric speech synthesis. It consists of four feature streams that describe magnitude, phase and fundamental frequency using real numbers. The proposed feature extraction method does not attempt to decompose the speech structure (e.g., into source+filter or harmonics+noise). By avoiding the simplifications inherent in decomposition, we can dramatically reduce the “phasiness” and “buzziness” typical of most vocoders. The method uses simple and computationally cheap operations and can operate at a lower frame rate than the 200 frames-per-second typical in many systems. It avoids heuristics and methods requiring approximate or iterative solutions, including phase unwrapping. Two DNN-based acoustic models were built - from male and female speech data - using the Merlin toolkit. Subjective comparisons were made with a state-of-the-art baseline, using the STRAIGHT vocoder. In all variants tested, and for both male and female voices, the proposed method substantially outperformed the baseline. We provide source code to enable our complete system to be replicated.

[2] Michael Pucher, Bettina Zillinger, Markus Toman, Dietmar Schabus, Cassia Valentini-Botinhao, Junichi Yamagishi, Erich Schmid, and Thomas Woltron. Influence of speaker familiarity on blind and visually impaired children and young adults perception of synthetic voices. Computer Speech and Language, 46:179-195, June 2017. [ bib | DOI ]
In this paper we evaluate how speaker familiarity influences the engagement times and performance of blind children and young adults when playing audio games made with different synthetic voices. We also show how speaker familiarity influences speaker and synthetic speech recognition. For the first experiment we develop synthetic voices of school children, their teachers and of speakers that are unfamiliar to them and use each of these voices to create variants of two audio games: a memory game and a labyrinth game. Results show that pupils have significantly longer engagement times and better performance when playing games that use synthetic voices built with their own voices. These findings can be used to improve the design of audio games and lecture books for blind and visually impaired children and young adults. In the second experiment we show that blind children and young adults are better in recognising synthetic voices than their visually impaired companions. We also show that the average familiarity with a speaker and the similarity between a speaker’s synthetic and natural voice are correlated to the speaker’s synthetic voice recognition rate.

[3] Cassia Valentini-Botinhao and Junichi Yamagishi. Speech intelligibility in cars: the effect of speaking style, noise and listener age. In Interspeech, 2017. [ bib | .pdf ]
Intelligibility of speech in noise becomes lower as the listeners age increases, even when no apparent hearing impairment is present. The losses are, however, different depending on the nature of the noise and the characteristics of the voice. In this paper we investigate the effect that age, noise type and speaking style have on the intelligibility of speech reproduced by car loudspeakers. Using a binaural mannequin we recorded a variety of voices and speaking styles played from the audio system of a car while driving in different conditions. We used this material to create a listening test where participants were asked to transcribe what they could hear and recruited groups of young and older adults to take part in it. We found that intelligibility scores of older participants were lower for the competing speaker and background music conditions. Results also indicate that clear and Lombard speech was more intelligible than plain speech for both age groups. A mixed effect model revealed that the largest effect was the noise condition, followed by sentence type, speaking style, voice, age group and pure tone average.

[4] Jaime Lorenzo-Trueba, Cassia Valentini-Botinhao, Gustav Henter, and Junichi Yamagishi. Misperceptions of the emotional content of natural and vocoded speech in a car. In Interspeech, 2017. [ bib | .pdf ]
This paper analyzes a) how often listeners interpret the emotional content of an utterance incorrectly when listening to vocoded or natural speech in adverse conditions; b) which noise conditions cause the most misperceptions; and c) which group of listeners misinterpret emotions the most. The long-term goal is to construct new emotional speech synthesizers that adapt to the environment and to the listener. We performed a large-scale listening test where over 400 listeners between the ages of 21 and 72 assessed natural and vocoded acted emotional speech stimuli. The stimuli had been artificially degraded using a room impulse response recorded in a car and various in-car noise types recorded in a real car. Experimental results show that the recognition rates for emotions and perceived emotional strength degrade as signal-to-noise ratio decreases. Interestingly, misperceptions seem to be more pronounced for negative and lowarousal emotions such as calmness or anger, while positive emotions such as happiness appear to be more robust to noise. An ANOVA analysis of listener meta-data further revealed that gender and age also influenced results, with elderly male listeners most likely to incorrectly identify emotions.

[5] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks. In Interspeech, pages 352-356. ISCA, September 2016. [ bib | DOI | .pdf ]
Quality of text-to-speech voices built from noisy recordings is diminished. In order to improve it we propose the use of a recurrent neural network to enhance acoustic parameters prior to training. We trained a deep recurrent neural network using a parallel database of noisy and clean acoustics parameters as input and output of the network. The database consisted of multiple speakers and diverse noise conditions. We investigated using text-derived features as an additional input of the network. We processed a noisy database of two other speakers using this network and used its output to train an HMM acoustic text-to-synthesis model for each voice. Listening experiment results showed that the voice built with enhanced parameters was ranked significantly higher than the ones trained with noisy speech and speech that has been enhanced using a conventional enhancement system. The text-derived features improved results only for the female voice, where it was ranked as highly as a voice trained with clean speech.

[6] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech. In Proceedings of 9th ISCA Speech Synthesis Workshop, pages 159-165, September 2016. [ bib | .pdf ]
The quality of text-to-speech (TTS) voices built from noisy speech is compromised. Enhancing the speech data before training has been shown to improve quality but voices built with clean speech are still preferred. In this paper we investigate two different approaches for speech enhancement to train TTS systems. In both approaches we train a recursive neural network (RNN) to map acoustic features extracted from noisy speech to features describing clean speech. The enhanced data is then used to train the TTS acoustic model. In one approach we use the features conventionally employed to train TTS acoustic models, i.e Mel cepstral (MCEP) coefficients, aperiodicity values and fundamental frequency (F0). In the other approach, following conventional speech enhancement methods, we train an RNN using only the MCEP coefficients extracted from the magnitude spectrum. The enhanced MCEP features and the phase extracted from noisy speech are combined to reconstruct the waveform which is then used to extract acoustic features to train the TTS system. We show that the second approach results in larger MCEP distortion but smaller F0 errors. Subjective evaluation shows that synthetic voices trained with data enhanced with this method were rated higher and with similar to scores to voices trained with clean speech.

[7] Felipe Espic, Cassia Valentini-Botinhao, Zhizheng Wu, and Simon King. Waveform generation based on signal reshaping for statistical parametric speech synthesis. In Proc. Interspeech, pages 2263-2267, San Francisco, CA, USA, September 2016. [ bib | .PDF ]
We propose a new paradigm of waveform generation for Statistical Parametric Speech Synthesis that is based on neither source-filter separation nor sinusoidal modelling. We suggest that one of the main problems of current vocoding techniques is that they perform an extreme decomposition of the speech signal into source and filter, which is an underlying cause of “buzziness”, “musical artifacts”, or “muffled sound” in the synthetic speech. The proposed method avoids making unnecessary assumptions and decompositions as far as possible, and uses only the spectral envelope and F0 as parameters. Prerecorded speech is used as a base signal, which is “reshaped” to match the acoustic specification predicted by the statistical model, without any source-filter decomposition. A detailed description of the method is presented, including implementation details and adjustments. Subjective listening test evaluations of complete DNN-based text-to-speech systems were conducted for two voices: one female and one male. The results show that the proposed method tends to outperform the state-of-theart standard vocoder STRAIGHT, whilst using fewer acoustic parameters.

[8] Rasmus Dall, Sandrine Brognaux, Korin Richmond, Cassia Valentini-Botinhao, Gustav Eje Henter, Julia Hirschberg, and Junichi Yamagishi. Testing the consistency assumption: pronunciation variant forced alignment in read and spontaneous speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 5155-5159, March 2016. [ bib | .pdf ]
Forced alignment for speech synthesis traditionally aligns a phoneme sequence predetermined by the front-end text processing system. This sequence is not altered during alignment, i.e., it is forced, despite possibly being faulty. The consistency assumption is the assumption that these mistakes do not degrade models, as long as the mistakes are consistent across training and synthesis. We present evidence that in the alignment of both standard read prompts and spontaneous speech this phoneme sequence is often wrong, and that this is likely to have a negative impact on acoustic models. A lattice-based forced alignment system allowing for pronunciation variation is implemented, resulting in improved phoneme identity accuracy for both types of speech. A perceptual evaluation of HMM-based voices showed that spontaneous models trained on this improved alignment also improved standard synthesis, despite breaking the consistency assumption.

Keywords: speech synthesis, TTS, forced alignment, HMM
[9] Yan Tang, Martin Cooke, and Cassia Valentini-Botinhao. Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech. Computer Speech & Language, 35:73 - 92, 2016. [ bib | DOI ]
Several modification algorithms that alter natural or synthetic speech with the goal of improving intelligibility in noise have been proposed recently. A key requirement of many modification techniques is the ability to predict intelligibility, both offline during algorithm development, and online, in order to determine the optimal modification for the current noise context. While existing objective intelligibility metrics (OIMs) have good predictive power for unmodified natural speech in stationary and fluctuating noise, little is known about their effectiveness for other forms of speech. The current study evaluated how well seven OIMs predict listener responses in three large datasets of modified and synthetic speech which together represent 396 combinations of speech modification, masker type and signal-to-noise ratio. The chief finding is a clear reduction in predictive power for most OIMs when faced with modified and synthetic speech. Modifications introducing durational changes are particularly harmful to intelligibility predictors. OIMs that measure masked audibility tend to over-estimate intelligibility in the presence of fluctuating maskers relative to stationary maskers, while OIMs that estimate the distortion caused by the masker to a clean speech prototype exhibit the reverse pattern.

[10] Adriana Stan, Cassia Valentini-Botinhao, Bogdan Orza, and Mircea Giurgiu. Blind speech segmentation using spectrogram image-based features and mel cepstral coefficients. In SLT, pages 597-602. IEEE, 2016. [ bib | DOI | .pdf ]
This paper introduces a novel method for blind speech segmentation at a phone level based on image processing. We consider the spectrogram of the waveform of an utterance as an image and hypothesize that its striping defects, i.e. discontinuities, appear due to phone boundaries. Using a simple image destriping algorithm these discontinuities are found. To discover phone transitions which are not as salient in the image, we compute spectral changes derived from the time evolution of Mel cepstral parametrisation of speech. These so called image-based and acoustic features are then combined to form a mixed probability function, whose values indicate the likelihood of a phone boundary being located at the corresponding time frame. The method is completely unsupervised and achieves an accuracy of 75.59% at a -3.26% over segmentation rate, yielding an F-measure of 0.76 and an 0.80 R-value on the TIMIT dataset.

[11] Cassia Valentini-Botinhao, Markus Toman, Michael Pucher, Dietmar Schabus, and Junichi Yamagishi. Intelligibility of time-compressed synthetic speech: Compression method and speaking style. Speech Communication, October 2015. [ bib | DOI ]
We present a series of intelligibility experiments performed on natural and synthetic speech time-compressed at a range of rates and analyze the effect of speech corpus and compression method on the intelligibility scores of sighted and blind individuals. Particularly we are interested in comparing linear and non-linear compression methods applied to normal and fast speech of different speakers. We recorded English and German language voice talents reading prompts at a normal and a fast rate. To create synthetic voices we trained a statistical parametric speech synthesis system based on the normal and the fast data of each speaker. We compared three compression methods: scaling the variance of the state duration model, interpolating the duration models of the fast and the normal voices, and applying a linear compression method to the generated speech waveform. Word recognition results for the English voices show that generating speech at a normal speaking rate and then applying linear compression resulted in the most intelligible speech at all tested rates. A similar result was found when evaluating the intelligibility of the natural speech corpus. For the German voices, interpolation was found to be better at moderate speaking rates but the linear method was again more successful at very high rates, particularly when applied to the fast data. Phonemic level annotation of the normal and fast databases showed that the German speaker was able to reproduce speech at a fast rate with fewer deletion and substitution errors compared to the English speaker, supporting the intelligibility benefits observed when compressing his fast speech. This shows that the use of fast speech data to create faster synthetic voices does not necessarily lead to more intelligible voices as results are highly dependent on how successful the speaker was at speaking fast while maintaining intelligibility. Linear compression applied to normal rate speech can more reliably provide higher intelligibility, particularly at ultra fast rates.

[12] C. Valentini-Botinhao, Z. Wu, and S. King. Towards minimum perceptual error training for DNN-based speech synthesis. In Proc. Interspeech, Dresden, Germany, September 2015. [ bib | .pdf ]
We propose to use a perceptually-oriented domain to improve the quality of text-to-speech generated by deep neural networks (DNNs). We train a DNN that predicts the parameters required for speech reconstruction but whose cost function is calculated in another domain. In this paper, to represent this perceptual domain we extract an approximated version of the Spectro-Temporal Excitation Pattern that was originally proposed as part of a model of hearing speech in noise. We train DNNs that predict band aperiodicity, fundamental frequency and Mel cepstral coefficients and compare generated speech when the spectral cost function is defined in the Mel cepstral, warped log spectrum or perceptual domains. Objective results indicate that the perceptual domain system achieves the highest quality.

[13] M. Pucher, M. Toman, D. Schabus, C. Valentini-Botinhao, J. Yamagishi, B. Zillinger, and E Schmid. Influence of speaker familiarity on blind and visually impaired children's perception of synthetic voices in audio games. In Proc. Interspeech, Dresden, Germany, September 2015. [ bib | .pdf ]
In this paper we evaluate how speaker familiarity influences the engagement times and performance of blind school children when playing audio games made with different synthetic voices. We developed synthetic voices of school children, their teachers and of speakers that were unfamiliar to them and used each of these voices to create variants of two audio games: a memory game and a labyrinth game. Results show that pupils had significantly longer engagement times and better performance when playing games that used synthetic voices built with their own voices. This result was observed even though the children reported not recognising the synthetic voice as their own after the experiment was over. These findings could be used to improve the design of audio games and lecture books for blind and visually impaired children.

[14] Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations. In Proc. Interspeech, pages 3476-3480, Dresden, September 2015. [ bib | .pdf ]
Tallying the numbers of listeners that took part in subjective evaluations of synthetic speech at Interspeech 2014 showed that in more than 60% of papers conclusions are based on listening tests with less than 20 listeners. Our analysis of Blizzard 2013 data shows that for a MOS test measuring naturalness a stable level of significance is only reached when more than 30 listeners are used. In this paper, we set out a list of guidelines, i.e., a checklist for carrying out meaningful subjective evaluations. We further illustrate the importance of sentence coverage and number of listeners by presenting changes to rank order and number of significant pairs by re-analysing data from the Blizzard Challenge 2013.

[15] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Proc. ICASSP, pages 4460-4464, Brisbane, Australia, April 2015. [ bib | .pdf ]
Deep neural networks (DNNs) use a cascade of hidden representations to enable the learning of complex mappings from input to output features. They are able to learn the complex mapping from textbased linguistic features to speech acoustic features, and so perform text-to-speech synthesis. Recent results suggest that DNNs can produce more natural synthetic speech than conventional HMM-based statistical parametric systems. In this paper, we show that the hidden representation used within a DNN can be improved through the use of Multi-Task Learning, and that stacking multiple frames of hidden layer activations (stacked bottleneck features) also leads to improvements. Experimental results confirmed the effectiveness of the proposed methods, and in listening tests we find that stacked bottleneck features in particular offer a significant improvement over both a baseline DNN and a benchmark HMM system.

[16] B. Uria, I. Murray, S. Renals, C. Valentini-Botinhao, and J. Bridle. Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE. In Proc. ICASSP, pages 4465-4469, Brisbane, Australia, April 2015. [ bib | .pdf ]
Given a transcription, sampling from a good model of acoustic feature trajectories should result in plausible realizations of an utterance. However, samples from current probabilistic speech synthesis systems result in low quality synthetic speech. Henter et al. have demonstrated the need to capture the dependencies between acoustic features conditioned on the phonetic labels in order to obtain high quality synthetic speech. These dependencies are often ignored in neural network based acoustic models. We tackle this deficiency by introducing a probabilistic neural network model of acoustic trajectories, trajectory RNADE, able to capture these dependencies.

[17] Ling-Hui Chen, T. Raitio, C. Valentini-Botinhao, Z. Ling, and J. Yamagishi. A deep generative architecture for postfiltering in statistical parametric speech synthesis. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 23(11):2003-2014, 2015. [ bib | DOI ]
The generated speech of hidden Markov model (HMM)-based statistical parametric speech synthesis still sounds muffled. One cause of this degradation in speech quality may be the loss of fine spectral structures. In this paper, we propose to use a deep generative architecture, a deep neural network (DNN) generatively trained, as a postfilter. The network models the conditional probability of the spectrum of natural speech given that of synthetic speech to compensate for such gap between synthetic and natural speech. The proposed probabilistic postfilter is generatively trained by cascading two restricted Boltzmann machines (RBMs) or deep belief networks (DBNs) with one bidirectional associative memory (BAM). We devised two types of DNN postfilters: one operating in the mel-cepstral domain and the other in the higher dimensional spectral domain. We compare these two new data-driven postfilters with other types of postfilters that are currently used in speech synthesis: a fixed mel-cepstral based postfilter, the global variance based parameter generation, and the modulation spectrum-based enhancement. Subjective evaluations using the synthetic voices of a male and female speaker confirmed that the proposed DNN-based postfilter in the spectral domain significantly improved the segmental quality of synthetic speech compared to that with conventional methods.

Keywords: HMM;deep generative architecture;modulation spectrum;postfilter;segmental quality;speech synthesis
[18] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King. Deep neural network employing multi-task learning and stacked bottleneck features for speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015. [ bib | .pdf ]
[19] Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Intelligibility enhancement of speech in noise. In Proceedings of the Institute of Acoustics, volume 36 Pt. 2, pages 96-103, Birmingham, UK, October 2014. [ bib | .pdf ]
To maintain communication success, humans change the way they speak and hear according to many factors, like the age, gender, native language and social relationship between talker and listener. Other factors are dictated by how communication takes place, such as environmental factors like an active competing speaker or limitations on the communication channel. As in natural interaction, we expect to communicate with and use synthetic voices that can also adapt to different listening scenarios and keep the level of intelligibility high. Research in speech technology needs to account for this to change the way we transmit, store and artificially generate speech accordingly.

[20] C. Valentini-Botinhao and M. Wester. Using linguistic predictability and the Lombard effect to increase the intelligibility of synthetic speech in noise. In Proc. Interspeech, pages 2063-2067, Singapore, September 2014. [ bib | .pdf ]
In order to predict which words in a sentence are harder to understand in noise it is necessary to consider not only audibility but also semantic or linguistic information. This paper focuses on using linguistic predictability to inform an intelligibility enhancement method that uses Lombard-adapted synthetic speech to modify low predictable words in Speech Perception in Noise (SPIN) test sentences. Word intelligibility in the presence of speech-shaped noise was measured using plain, Lombard and a combination of the two synthetic voices. The findings show that the Lombard voice increases intelligibility in noise but the intelligibility gap between words in a high and low predictable context still remains. Using a Lombard voice when a word is unpredictable is a good strategy, but if a word is predictable from its context the Lombard benefit only occurs when other words in the sentence are also modified.

[21] L.-H. Chen, T. Raitio, C. Valentini-Botinhao, J. Yamagishi, and Z.-H. Ling. DNN-Based Stochastic Postfilter for HMM-Based Speech Synthesis. In Proc. Interspeech, pages 1954-1958, Singapore, September 2014. [ bib | .pdf ]
In this paper we propose a deep neural network to model the conditional probability of the spectral differences between natural and synthetic speech. This allows us to reconstruct the spectral fine structures in speech generated by HMMs. We compared the new stochastic data-driven postfilter with global variance based parameter generation and modulation spectrum enhancement. Our results confirm that the proposed method significantly improves the segmental quality of synthetic speech compared to the conventional methods.

[22] C. Valentini-Botinhao, M. Toman, M. Pucher, D. Schabus, and J. Yamagishi. Intelligibility Analysis of Fast Synthesized Speech. In Proc. Interspeech, pages 2922-2926, Singapore, September 2014. [ bib | .pdf ]
In this paper we analyse the effect of speech corpus and compression method on the intelligibility of synthesized speech at fast rates. We recorded English and German language voice talents at a normal and a fast speaking rate and trained an HSMM-based synthesis system based on the normal and the fast data of each speaker. We compared three compression methods: scaling the variance of the state duration model, interpolating the duration models of the fast and the normal voices, and applying a linear compression method to generated speech. Word recognition results for the English voices show that generating speech at normal speaking rate and then applying linear compression resulted in the most intelligible speech at all tested rates. A similar result was found when evaluating the intelligibility of the natural speech corpus. For the German voices, interpolation was found to be better at moderate speaking rates but the linear method was again more successful at very high rates, for both blind and sighted participants. These results indicate that using fast speech data does not necessarily create more intelligible voices and that linear compression can more reliably provide higher intelligibility, particularly at higher rates.

[23] C. Valentini-Botinhao, J. Yamagishi, S. King, and R. Maia. Intelligibility enhancement of HMM-generated speech in additive noise by modifying mel cepstral coefficients to increase the glimpse proportion. Computer Speech and Language, 28(2):665-686, 2014. [ bib | DOI | .pdf ]
This paper describes speech intelligibility enhancement for hidden Markov model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the Glimpse Proportion – an objective measure of the intelligibility of speech in noise – increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1-4kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains.

[24] C. Valentini-Botinhao, J. Yamagishi, S. King, and Y. Stylianou. Combining perceptually-motivated spectral shaping with loudness and duration modification for intelligibility enhancement of HMM-based synthetic speech in noise. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]
[25] M. Cooke, C. Mayo, and C. Valentini-Botinhao. Intelligibility-enhancing speech modifications: the Hurricane Challenge. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]
[26] Cassia Valentini-Botinhao, Mirjam Wester, Junichi Yamagishi, and Simon King. Using neighbourhood density and selective SNR boosting to increase the intelligibility of synthetic speech in noise. In 8th ISCA Workshop on Speech Synthesis, pages 133-138, Barcelona, Spain, August 2013. [ bib | .pdf ]
Motivated by the fact that words are not equally confusable, we explore the idea of using word-level intelligibility predictions to selectively boost the harder-to-understand words in a sentence, aiming to improve overall intelligibility in the presence of noise. First, the intelligibility of a set of words from dense and sparse phonetic neighbourhoods was evaluated in isolation. The resulting intelligibility scores were used to inform two sentencelevel experiments. In the first experiment the signal-to-noise ratio of one word was boosted to the detriment of another word. Sentence intelligibility did not generally improve. The intelligibility of words in isolation and in a sentence were found to be significantly different, both in clean and in noisy conditions. For the second experiment, one word was selectively boosted while slightly attenuating all other words in the sentence. This strategy was successful for words that were poorly recognised in that particular context. However, a reliable predictor of word-in-context intelligibility remains elusive, since this involves – as our results indicate – semantic, syntactic and acoustic information about the word and the sentence.

[27] C. Valentini-Botinhao, E. Godoy, Y. Stylianou, B. Sauert, S. King, and J. Yamagishi. Improving intelligibility in noise of HMM-generated speech via noise-dependent and -independent methods. In Proc. ICASSP, Vancouver, Canada, May 2013. [ bib | .pdf ]
[28] Cassia Valentini-Botinhao. Intelligibility enhancement of synthetic speech in noise. PhD thesis, University of Edinburgh, 2013. [ bib | .pdf ]
Speech technology can facilitate human-machine interaction and create new communication interfaces. Text-To-Speech (TTS) systems provide speech output for dialogue, notification and reading applications as well as personalized voices for people that have lost the use of their own. TTS systems are built to produce synthetic voices that should sound as natural, expressive and intelligible as possible and if necessary be similar to a particular speaker. Although naturalness is an important requirement, providing the correct information in adverse conditions can be crucial to certain applications. Speech that adapts or reacts to different listening conditions can in turn be more expressive and natural. In this work we focus on enhancing the intelligibility of TTS voices in additive noise. For that we adopt the statistical parametric paradigm for TTS in the shape of a hidden Markov model (HMM-) based speech synthesis system that allows for flexible enhancement strategies. Little is known about which human speech production mechanisms actually increase intelligibility in noise and how the choice of mechanism relates to noise type, so we approached the problem from another perspective: using mathematical models for hearing speech in noise. To find which models are better at predicting intelligibility of TTS in noise we performed listening evaluations to collect subjective intelligibility scores which we then compared to the models’ predictions. In these evaluations we observed that modifications performed on the spectral envelope of speech can increase intelligibility significantly, particularly if the strength of the modification depends on the noise and its level. We used these findings to inform the decision of which of the models to use when automatically modifying the spectral envelope of the speech according to the noise. We devised two methods, both involving cepstral coefficient modifications. The first was applied during extraction while training the acoustic models and the other when generating a voice using pre-trained TTS models. The latter has the advantage of being able to address fluctuating noise. To increase intelligibility of synthetic speech at generation time we proposed a method for Mel cepstral coefficient modification based on the glimpse proportion measure, the most promising of the models of speech intelligibility that we evaluated. An extensive series of listening experiments demonstrated that this method brings significant intelligibility gains to TTS voices while not requiring additional recordings of clear or Lombard speech. To further improve intelligibility we combined our method with noise-independent enhancement approaches based on the acoustics of highly intelligible speech. This combined solution was as effective for stationary noise as for the challenging competing speaker scenario, obtaining up to 4dB of equivalent intensity gain. Finally, we proposed an extension to the speech enhancement paradigm to account for not only energetic masking of signals but also for linguistic confusability of words in sentences. We found that word level confusability, a challenging value to predict, can be used as an additional prior to increase intelligibility even for simple enhancement methods like energy reallocation between words. These findings motivate further research into solutions that can tackle the effect of energetic masking on the auditory system as well as on higher levels of processing.

[29] Y. Tang, M. Cooke, and C. Valentini-Botinhao. A distortion-weighted glimpse-based intelligibility metric for modified and synthetic speech. In Proc. SPIN, 2013. [ bib | .pdf ]
[30] M. Cooke, C. Mayo, C. Valentini-Botinhao, Y. Stylianou, B. Sauert, and Y. Tang. Evaluating the intelligibility benefit of speech modifications in known noise conditions. Speech Communication, 55:572-585, 2013. [ bib | .pdf ]
The use of live and recorded speech is widespread in applications where correct message reception is important. Furthermore, the deployment of synthetic speech in such applications is growing. Modifications to natural and synthetic speech have therefore been proposed which aim at improving intelligibility in noise. The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech. Listeners identified keywords in phonetically-balanced sentences representing ten different types of speech: plain and Lombard speech, five types of modified speech, and three forms of synthetic speech. Sentences were masked by either a stationary or a competing speech masker. Modification methods varied in the manner and degree to which they exploited estimates of the masking noise. The best-performing modifications led to equivalent intensity changes of around 5 dB in moderate and high noise levels for the stationary masker, and 3-4 dB in the presence of competing speech. These gains exceed those produced by Lombard speech. Synthetic speech in noise was always less intelligible than plain natural speech, but modified synthetic speech reduced this deficit by a significant amount.

[31] C. Valentini-Botinhao, J. Yamagishi, and S. King. Evaluating speech intelligibility enhancement for HMM-based synthetic speech in noise. In Proc. Sapa Workshop, Portland, USA, September 2012. [ bib | .pdf ]
It is possible to increase the intelligibility of speech in noise by enhancing the clean speech signal. In this paper we demonstrate the effects of modifying the spectral envelope of synthetic speech according to the environmental noise. To achieve this, we modify Mel cepstral coefficients according to an intelligibility measure that accounts for glimpses of speech in noise: the Glimpse Proportion measure. We evaluate this method against a baseline synthetic voice trained only with normal speech and a topline voice trained with Lombard speech, as well as natural speech. The intelligibility of these voices was measured when mixed with speech-shaped noise and with a competing speaker at three different levels. The Lombard voices, both natural and synthetic, were more intelligible than the normal voices in all conditions. For speech-shaped noise, the proposed modified voice was as intelligible as the Lombard synthetic voice without requiring any recordings of Lombard speech, which are hard to obtain. However, in the case of competing talker noise, the Lombard synthetic voice was more intelligible than the proposed modified voice.

[32] C. Valentini-Botinhao, S. Degenkolb-Weyers, A. Maier, E. Noeth, U. Eysholdt, and T. Bocklet. Automatic detection of sigmatism in children. In Proc. WOCCI, Portland, USA, September 2012. [ bib | .pdf ]
We propose in this paper an automatic system to detect sigmatism from the speech signal. Sigmatism occurs when the tongue is positioned incorrectly during articulation of sibilant phones like /s/ and /z/. For our task we extracted various sets of features from speech: Mel frequency cepstral coefficients, energies in specific bandwidths of the spectral envelope, and the so-called supervectors, which are the parameters of an adapted speaker model. We then trained several classifiers on a speech database of German adults simulating three different types of sigmatism. Recognition results were calculated at a phone, word and speaker level for both the simulated database and for a database of pathological speakers. For the simulated database, we achieved recognition rates of up to 86%, 87% and 94% at a phone, word and speaker level. The best classifier was then integrated as part of a Java applet that allows patients to record their own speech, either by pronouncing isolated phones, a specific word or a list of words, and provides them with a feedback whether the sibilant phones are being correctly pronounced.

[33] C. Valentini-Botinhao, J. Yamagishi, and S. King. Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. In Proc. Interspeech, Portland, USA, September 2012. [ bib ]
We propose a method that modifies the Mel cepstral coefficients of HMM-generated synthetic speech in order to increase the intelligibility of the generated speech when heard by a listener in the presence of a known noise. This method is based on an approximation we previously proposed for the Glimpse Proportion measure. Here we show how to update the Mel cepstral coefficients using this measure as an optimization criterion and how to control the amount of distortion by limiting the frequency resolution of the modifications. To evaluate the method we built eight different voices from normal read-text speech data from a male speaker. Some voices were also built from Lombard speech data produced by the same speaker. Listening experiments with speech-shaped noise and with a single competing talker indicate that our method significantly improves intelligibility when compared to unmodified synthetic speech. The voices built from Lombard speech outperformed the proposed method particularly for the competing talker case. However, compared to a voice using only the spectral parameters from Lombard speech, the proposed method obtains similar or higher performance.

[34] C. Valentini-Botinhao, J. Yamagishi, and S. King. Using an intelligibility measure to create noise robust cepstral coefficients for HMM-based speech synthesis. In Proc. LISTA Workshop, Edinburgh, UK, May 2012. [ bib | .pdf ]
[35] C. Valentini-Botinhao, R. Maia, J. Yamagishi, S. King, and H. Zen. Cepstral analysis based on the Glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise. In Proc. ICASSP, pages 3997-4000, Kyoto, Japan, March 2012. [ bib | DOI | .pdf ]
In this paper we introduce a new cepstral coefficient extraction method based on an intelligibility measure for speech in noise, the Glimpse Proportion measure. This new method aims to increase the intelligibility of speech in noise by modifying the clean speech, and has applications in scenarios such as public announcement and car navigation systems. We first explain how the Glimpse Proportion measure operates and further show how we approximated it to integrate it into an existing spectral envelope parameter extraction method commonly used in the HMM-based speech synthesis framework. We then demonstrate how this new method changes the modelled spectrum according to the characteristics of the noise and show results for a listening test with vocoded and HMM-based synthetic speech. The test indicates that the proposed method can significantly improve intelligibility of synthetic speech in speech shaped noise.

[36] Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Can objective measures predict the intelligibility of modified HMM-based synthetic speech in noise? In Proc. Interspeech, August 2011. [ bib | .pdf ]
Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility – and on how well objective measures predict it – when we separately modify speaking rate, fundamental frequency, line spectral pairs and spectral peaks. Shifting LSPs can increase intelligibility for human listeners; other modifications had weaker effects. Among the objective measures we evaluated, the Dau model and the Glimpse proportion were the best predictors of human performance.

[37] Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Evaluation of objective measures for intelligibility prediction of HMM-based synthetic speech in noise. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5112-5115, May 2011. [ bib | DOI | .pdf ]
In this paper we evaluate four objective measures of speech with regards to intelligibility prediction of synthesized speech in diverse noisy situations. We evaluated three intelligibility measures, the Dau measure, the glimpse proportion and the Speech Intelligibility Index (SII) and a quality measure, the Perceptual Evaluation of Speech Quality (PESQ). For the generation of synthesized speech we used a state of the art HMM-based speech synthesis system. The noisy conditions comprised four additive noises. The measures were compared with subjective intelligibility scores obtained in listening tests. The results show the Dau and the glimpse measures to be the best predictors of intelligibility, with correlations of around 0.83 to subjective scores. All measures gave less accurate predictions of intelligibility for synthetic speech than have previously been found for natural speech; in particular the SII measure. In additional experiments, we processed the synthesized speech by an ideal binary mask before adding noise. The Glimpse measure gave the most accurate intelligibility predictions in this situation.