The Centre for Speech Technology Research, The university of Edinburgh

Publications by Cassie Mayo

[1] Rosie Kay, Oliver Watts, Roberto Barra-Chicote, and Cassie Mayo. Knowledge versus data in tts: evaluation of a continuum of synthesis systems. In INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015, pages 3335-3339, 2015. [ bib | .pdf ]
Grapheme-based models have been proposed for both ASR and TTS as a way of circumventing the lack of expert-compiled pronunciation lexicons in under-resourced languages. It is a common observation that this should work well in languages employing orthographies with a transparent letter-to-phoneme relationship,such as Spanish. Our experience has shown, however,that there is still a significant difference in intelligibility between grapheme-based systems and conventional ones for this language. This paper explores the contribution of different levels of linguistic annotation to system intelligibility, and the trade-off between those levels and the quantity of data used for training. Ten systems spaced across these two continua of knowledge and data were subjectively evaluated for intelligibility.

[2] Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, and Simon King. Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In Proc. Interspeech, volume 15, pages 1504-1508, September 2014. [ bib | .pdf ]
Acoustic models used for statistical parametric speech synthesis typically incorporate many modelling assumptions. It is an open question to what extent these assumptions limit the naturalness of synthesised speech. To investigate this question, we recorded a speech corpus where each prompt was read aloud multiple times. By combining speech parameter trajectories extracted from different repetitions, we were able to quantify the perceptual effects of certain commonly used modelling assumptions. Subjective listening tests show that taking the source and filter parameters to be conditionally independent, or using diagonal covariance matrices, significantly limits the naturalness that can be achieved. Our experimental results also demonstrate the shortcomings of mean-based parameter generation.

Keywords: speech synthesis, acoustic modelling, stream independence, diagonal covariance matrices, repeated speech
[3] Mirjam Wester and Cassie Mayo. Accent rating by native and non-native listeners. In Proceedings of ICASSP, pages 7749-7753, Florence, Italy, May 2014. [ bib | .pdf ]
This study investigates the influence of listener native language with respect to talker native language on perception of degree of foreign accent in English. Listeners from native English, Finnish, German and Mandarin backgrounds rated the accentedness of native English, Finnish, German and Mandarin talkers producing a controlled set of English sentences. Results indicate that non-native listeners, like native listeners, are able to classify non-native talkers as foreign-accented, and native talkers as unaccented. However, while non-native talkers received higher accentedness ratings than native talkers from all listener groups, non-native listeners judged talkers with non-native accents less harshly than did native English listeners. Similarly, non-native listeners assigned higher degrees of foreign accent to native English talkers than did native English listeners. It seems that non-native listeners give accentedness ratings that are less extreme, or closer to the centre of the rating scale in both directions, than those used by native listeners.

[4] M. Cooke, C. Mayo, and C. Valentini-Botinhao. Intelligibility-enhancing speech modifications: the Hurricane Challenge. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]
[5] M. Cooke, C. Mayo, C. Valentini-Botinhao, Y. Stylianou, B. Sauert, and Y. Tang. Evaluating the intelligibility benefit of speech modifications in known noise conditions. Speech Communication, 55:572-585, 2013. [ bib | .pdf ]
The use of live and recorded speech is widespread in applications where correct message reception is important. Furthermore, the deployment of synthetic speech in such applications is growing. Modifications to natural and synthetic speech have therefore been proposed which aim at improving intelligibility in noise. The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech. Listeners identified keywords in phonetically-balanced sentences representing ten different types of speech: plain and Lombard speech, five types of modified speech, and three forms of synthetic speech. Sentences were masked by either a stationary or a competing speech masker. Modification methods varied in the manner and degree to which they exploited estimates of the masking noise. The best-performing modifications led to equivalent intensity changes of around 5 dB in moderate and high noise levels for the stationary masker, and 3-4 dB in the presence of competing speech. These gains exceed those produced by Lombard speech. Synthetic speech in noise was always less intelligible than plain natural speech, but modified synthetic speech reduced this deficit by a significant amount.

[6] Elizabeth Godoy, Catherine Mayo, and Yannis Stylianou. Linking loudness increases in normal and Lombard speech to decreasing vowel formant separation. In Proc. Interspeech, 2013. [ bib | .PDF ]
The increased vocal effort associated with the Lombard reflex produces speech that is perceived as louder and judged to be more intelligible in noise than normal speech. Previous work illustrates that, on average, Lombard increases in loudness result from boosting spectral energy in a frequency band spanning the range of formants F1-F3, particularly for voiced speech. Observing additionally that increases in loudness across spoken sentences are spectro-temporally localized, the goal of this work is to further isolate these regions of maximal loudness by linking them to specific formant trends, explicitly considering here the vowel formant separation. For both normal and Lombard speech, this work illustrates that, as loudness increases in frequency bands containing formants (e.g. F1-F2 or F2-F3), the observed separation between formant frequencies decreases. From a production standpoint, these results seem to highlight a physiological trait associated with how humans increase the loudness of their speech, namely moving vocal tract resonances closer together. Particularly, for Lombard speech, this phenomena is exaggerated: that is, the Lombard speech is louder and formants in corresponding spectro-temporal regions are even closer together

[7] Catherine Mayo, Fiona Gibbon, and Robert A. J. Clark. Phonetically trained and untrained adults' transcription of place of articulation for intervocalic lingual stops with intermediate acoustic cues. Journal of Speech, Language and Hearing Research, 56:779-791, 2013. [ bib | DOI ]
Purpose: In this study, the authors aimed to investigate how listener training and the presence of intermediate acoustic cues influence transcription variability for conflicting cue speech stimuli. Method: Twenty listeners with training in transcribing disordered speech, and 26 untrained listeners, were asked to make forced-choice labeling decisions for synthetic vowel–consonant–vowel (VCV) sequences "a doe" and "a go". Both the VC and CV transitions in these stimuli ranged through intermediate positions, from appropriate for /d/ to appropriate for /g/. Results: Both trained and untrained listeners gave more weight to the CV transitions than to the VC transitions. However, listener behavior was not uniform: The results showed a high level of inter- and intratranscriber inconsistency, with untrained listeners showing a nonsignificant tendency to be more influenced than trained listeners by CV transitions. Conclusions: Listeners do not assign consistent categorical labels to the type of intermediate, conflicting transitional cues that were present in the stimuli used in the current study and that are also present in disordered articulations. Although listener inconsistency in assigning labels to intermediate productions is not increased as a result of phonetic training, neither is it reduced by such training.

Keywords: speech perception, intermediate acoustic cues, phonetic transcription, multilevel logistic regression
[8] C. Mayo, V. Aubanel, and M. Cooke. Effect of prosodic changes on speech intelligibility. In Proc. Interspeech, Portland, OR, USA, 2012. [ bib ]
[9] M. Koutsogiannaki, M. Pettinato, C. Mayo, V. Kandia, and Y. Stylianou. Can modified casual speech reach the intelligibility of clear speech? In Proc. Interspeech, Portland, OR, USA, 2012. [ bib ]
[10] V. Aubanel, M. Cooke, E. Foster, M. L. Garcia-Lecumberri, and C. Mayo. Effects of the availability of visual information and presence of competing conversations on speech production. In Proc. Interspeech, Portland, OR, USA, 2012. [ bib ]
[11] C. Mayo, R. A. J. Clark, and S. King. Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis. Speech Communication, 53(3):311-326, 2011. [ bib | DOI ]
The quality of current commercial speech synthesis systems is now so high that system improvements are being made at subtle sub- and supra-segmental levels. Human perceptual evaluation of such subtle improvements requires a highly sophisticated level of perceptual attention to specific acoustic characteristics or cues. However, it is not well understood what acoustic cues listeners attend to by default when asked to evaluate synthetic speech. It may, therefore, be potentially quite difficult to design an evaluation method that allows listeners to concentrate on only one dimension of the signal, while ignoring others that are perceptually more important to them. The aim of the current study was to determine which acoustic characteristics of unit-selection synthetic speech are most salient to listeners when evaluating the naturalness of such speech. This study made use of multidimensional scaling techniques to analyse listeners' pairwise comparisons of synthetic speech sentences. Results indicate that listeners place a great deal of perceptual importance on the presence of artifacts and discontinuities in the speech, somewhat less importance on aspects of segmental quality, and very little importance on stress/intonation appropriateness. These relative differences in importance will impact on listeners' ability to attend to these different acoustic characteristics of synthetic speech, and should therefore be taken into account when designing appropriate methods of synthetic speech evaluation.

Keywords: Speech synthesis; Evaluation; Speech perception; Acoustic cue weighting; Multidimensional scaling
[12] Vasilis Karaiskos, Simon King, Robert A. J. Clark, and Catherine Mayo. The blizzard challenge 2008. In Proc. Blizzard Challenge Workshop, Brisbane, Australia, September 2008. [ bib | .pdf ]
The Blizzard Challenge 2008 was the fourth annual Blizzard Challenge. This year, participants were asked to build two voices from a UK English corpus and one voice from a Man- darin Chinese corpus. This is the first time that a language other than English has been included and also the first time that a large UK English corpus has been available. In addi- tion, the English corpus contained somewhat more expressive speech than that found in corpora used in previous Blizzard Challenges. To assist participants with limited resources or limited ex- perience in UK-accented English or Mandarin, unaligned la- bels were provided for both corpora and for the test sentences. Participants could use the provided labels or create their own. An accent-specific pronunciation dictionary was also available for the English speaker. A set of test sentences was released to participants, who were given a limited time in which to synthesise them and submit the synthetic speech. An online listening test was con- ducted, to evaluate naturalness, intelligibility and degree of similarity to the original speaker.

Keywords: Blizzard
[13] F. Gibbon and C. Mayo. Adults' perception of conflicting acoustic cues associated with epg-defined undifferentiated gestures. In 4th International EPG Symposium, Edinburgh, UK., 2008. [ bib ]
[14] Robert A. J. Clark, Monika Podsiadlo, Mark Fraser, Catherine Mayo, and Simon King. Statistical analysis of the Blizzard Challenge 2007 listening test results. In Proc. Blizzard 2007 (in Proc. Sixth ISCA Workshop on Speech Synthesis), Bonn, Germany, August 2007. [ bib | .pdf ]
Blizzard 2007 is the third Blizzard Challenge, in which participants build voices from a common dataset. A large listening test is conducted which allows comparison of systems in terms of naturalness and intelligibility. New sections were added to the listening test for 2007 to test the perceived similarity of the speaker's identity between natural and synthetic speech. In this paper, we present the results of the listening test and the subsequent statistical analysis.

Keywords: Blizzard
[15] C. Mayo, R. A. J. Clark, and S. King. Multidimensional scaling of listener responses to synthetic speech. In Proc. Interspeech 2005, Lisbon, Portugal, September 2005. [ bib | .pdf ]
[16] C. Mayo and A. Turk. The influence of spectral distinctiveness on acoustic cue weighting in children's and adults' speech perception. Journal of the Acoustical Society of America, 118:1730-1741, 2005. [ bib | .pdf ]
[17] C. Mayo and A. Turk. No available theories currently explain all adult-child cue weighting differences. In Proc. ISCA Workshop on Plasticity in Speech Perception, London, UK, 2005. [ bib | .pdf ]
[18] C. Mayo and A. Turk. The development of perceptual cue weighting within and across monosyllabic words. In LabPhon 9, University of Illinois at Urbana-Champaign, 2004. [ bib ]
[19] C. Mayo and T. Turk. Adult-child differences in acoustic cue weighting are influenced by segmental context: Children are not always perceptually biased towards transitions. Journal of the Acoustical Society of America, 115:3184-3194, 2004. [ bib | .pdf ]
[20] C. Mayo and A. Turk. Is the development of cue weighting strategies in children's speech perception context-dependent? In XVth International Congress of Phonetic Sciences, Barcelona, 2003. [ bib | .pdf ]
[21] C. Mayo, J. Scobbie, N. Hewlett, and D. Waters. The influence of phonemic awareness development on acoustic cue weighting in children's speech perception. Journal of Speech, Language and Hearing Research, 46:1184-1196, 2003. [ bib | .pdf ]
[22] C. Mayo, A. Turk, and J. Watson. Development of cue weighting strategies in children's speech perception. In Proceedings of TIPS: Temporal Integration in the Perception of Speech, Aix-en-Provence, 2002. [ bib ]
[23] C. Mayo, A. Turk, and J. Watson. Flexibility of acoustic cue weighting in children's speech perception. Journal of the Acoustical Society of America, 109:2313, 2001. [ bib | .pdf ]
[24] C. Mayo. The relationship between phonemic awareness and cue weighting in speech perception: longitudinal and cross-sectional child studies. PhD thesis, Queen Margaret University College, 2000. [ bib | .pdf ]
[25] C. Mayo. Perceptual weighting and phonemic awareness in pre-reading and early-reading children. In XIVth International Congress of Phonetic Sciences, San Francisco, 1999. [ bib | .pdf ]
[26] C. Mayo. The development of phonemic awareness and perceptual weighting in relation to early and later literacy acquisition. In 20th Annual Child Phonology Conference, Bangor, Wales, 1999. [ bib ]
[27] C. Mayo. The developmental relationship between perceptual weighting and phonemic awareness. In LabPhon 6, University of York, UK, 1998. [ bib ]
[28] C. Mayo. A longitudinal study of perceptual weighting and phonemic awarenes. In Chicago Linguistics Society 34, 1998. [ bib ]
[29] C. Mayo, M. Aylett, and D. R. Ladd. Prosodic transcription of glasgow english: an evaluation study of GlaToBI. In Intonation: Theory, Models and Applications, 1997. [ bib | .pdf ]