The Centre for Speech Technology Research, The university of Edinburgh

Publications by Volker Strom

[1] Michael Pucher, Friedrich Neubarth, and Volker Strom. Optimizing phonetic encoding for Viennese unit selection speech synthesis. In A. Esposito et al., editor, COST 2102 Int. Training School 2009, LNCS, Heidelberg, 2010. Springer-Verlag. [ bib | .ps | .pdf ]
While developing lexical resources for a particular language variety (Viennese), we experimented with a set of 5 different phonetic encodings, termed phone sets, used for unit selection speech synthesis. We started with a very rich phone set based on phonological considerations and covering as much phonetic variability as possible, which was then reduced to smaller sets by applying transformation rules that map or merge phone symbols. The optimal trade-off was found measuring the phone error rates of automatically learnt grapheme-to-phone rules and by a perceptual evaluation of 27 representative synthesized sentences. Further, we describe a method to semi-automatically enlarge the lexical resources for the target language variety using a lexicon base for Standard Austrian German.

[2] Michael Pucher, Friedrich Neubarth, Volker Strom, Sylvia Moosmüller, Gregor Hofer, Christian Kranzler, Gudrun Schuchmann, and Dietmar Schabus. Resources for speech synthesis of viennese varieties. In Proc. Int. Conf. on Language Resources and Evaluation, LREC'10, Malta, 2010. European Language Resources Association (ELRA). [ bib | .ps | .pdf ]
This paper describes our work on developing corpora of three varieties of Viennese for unit selection speech synthesis. The synthetic voices for Viennese varieties, implemented with the open domain unit selection speech synthesis engine Multisyn of Festival will also be released within Festival. The paper especially focuses on two questions: how we selected the appropriate speakers and how we obtained the text sources needed for the recording of these non-standard varieties. Regarding the first one, it turned out that working with a ‘prototypical’ professional speaker was much more preferable than striving for authenticity. In addition, we give a brief outline about the differences between the Austrian standard and its dialectal varieties and how we solved certain technical problems that are related to these differences. In particular, the specific set of phones applicable to each variety had to be determined by applying various constraints. Since such a set does not serve any descriptive purposes but rather is influencing the quality of speech synthesis, a careful design of such a (in most cases reduced) set was an important task.

[3] Volker Strom and Simon King. A classifier-based target cost for unit selection speech synthesis trained on perceptual data. In Proc. Interspeech, Makuhari, Japan, 2010. [ bib | .ps | .pdf ]
Our goal is to automatically learn a PERCEPTUALLY-optimal target cost function for a unit selection speech synthesiser. The approach we take here is to train a classifier on human perceptual judgements of synthetic speech. The output of the classifier is used to make a simple three-way distinction rather than to estimate a continuously-valued cost. In order to collect the necessary perceptual data, we synthesised 145,137 short sentences with the usual target cost switched off, so that the search was driven by the join cost only. We then selected the 7200 sentences with the best joins and asked 60 listeners to judge them, providing their ratings for each syllable. From this, we derived a rating for each demiphone. Using as input the same context features employed in our conventional target cost function, we trained a classifier on these human perceptual ratings. We synthesised two sets of test sentences with both our standard target cost and the new target cost based on the classifier. A/B preference tests showed that the classifier-based target cost, which was learned completely automatically from modest amounts of perceptual data, is almost as good as our carefully- and expertly-tuned standard target cost.

[4] Michael Pucher, Dietmar Schabus, Junichi Yamagishi, Friedrich Neubarth, and Volker Strom. Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis. Speech Communication, 52(2):164-179, 2010. [ bib | DOI ]
An HMM-based speech synthesis framework is applied to both Standard Austrian German and a Viennese dialectal variety and several training strategies for multi-dialect modeling such as dialect clustering and dialect-adaptive training are investigated. For bridging the gap between processing on the level of HMMs and on the linguistic level, we add phonological transformations to the HMM interpolation and apply them to dialect interpolation. The crucial steps are to employ several formalized phonological rules between Austrian German and Viennese dialect as constraints for the HMM interpolation. We verify the effectiveness of this strategy in a number of perceptual evaluations. Since the HMM space used is not articulatory but acoustic space, there are some variations in evaluation results between the phonological rules. However, in general we obtained good evaluation results which show that listeners can perceive both continuous and categorical changes of dialect varieties by using phonological transformations employed as switching rules in the HMM interpolation.

[5] Volker Strom and Simon King. Investigating Festival's target cost function using perceptual experiments. In Proc. Interspeech, Brisbane, 2008. [ bib | .ps | .pdf ]
We describe an investigation of the target cost used in the Festival unit selection speech synthesis system. Our ultimate goal is to automatically learn a perceptually optimal target cost function. In this study, we investigated the behaviour of the target cost for one segment type. The target cost is based on counting the mismatches in several context features. A carrier sentence (“My name is Roger”) was synthesised using all 147,820 possible combinations of the diphones /n_ei/ and /ei_m/. 92 representative versions were selected and presented to listeners as 460 pairwise comparisons. The listeners' preference votes were used to analyse the behaviour of the target cost, with respect to the values of its component linguistic context features.

[6] Leonardo Badino, Robert A.J. Clark, and Volker Strom. Including pitch accent optionality in unit selection text-to-speech synthesis. In Proc. Interspeech, Brisbane, 2008. [ bib | .ps | .pdf ]
A significant variability in pitch accent placement is found when comparing the patterns of prosodic prominence realized by different English speakers reading the same sentences. In this paper we describe a simple approach to incorporate this variability to synthesize prosodic prominence in unit selection text-to-speech synthesis. The main motivation of our approach is that by taking into account the variability of accent placements we enlarge the set of prosodically acceptable speech units, thus increasing the chances of selecting a good quality sequence of units, both in prosodic and segmental terms. Results on a large scale perceptual test show the benefits of our approach and indicate directions for further improvements.

[7] Volker Strom, Ani Nenkova, Robert Clark, Yolanda Vazquez-Alvarez, Jason Brenier, Simon King, and Dan Jurafsky. Modelling prominence and emphasis improves unit-selection synthesis. In Proc. Interspeech 2007, Antwerp, Belgium, August 2007. [ bib | .pdf ]
We describe the results of large scale perception experiments showing improvements in synthesising two distinct kinds of prominence: standard pitch-accent and strong emphatic accents. Previously prominence assignment has been mainly evaluated by computing accuracy on a prominence-labelled test set. By contrast we integrated an automatic pitch-accent classifier into the unit selection target cost and showed that listeners preferred these synthesised sentences. We also describe an improved recording script for collecting emphatic accents, and show that generating emphatic accents leads to further improvements in the fiction genre over incorporating pitch accent only. Finally, we show differences in the effects of prominence between child-directed speech and news and fiction genres. Index Terms: speech synthesis, prosody, prominence, pitch accent, unit selection

[8] K. Richmond, V. Strom, R. Clark, J. Yamagishi, and S. Fitt. Festival multisyn voices for the 2007 blizzard challenge. In Proc. Blizzard Challenge Workshop (in Proc. SSW6), Bonn, Germany, August 2007. [ bib | .pdf ]
This paper describes selected aspects of the Festival Multisyn entry to the Blizzard Challenge 2007. We provide an overview of the process of building the three required voices from the speech data provided. This paper focuses on new features of Multisyn which are currently under development and which have been employed in the system used for this Blizzard Challenge. These differences are the application of a more flexible phonetic lattice representation during forced alignment labelling and the use of a pitch accent target cost component. Finally, we also examine aspects of the speech data provided for this year's Blizzard Challenge and raise certain issues for discussion concerning the aim of comparing voices made with differing subsets of the data provided.

[9] R. Clark, K. Richmond, V. Strom, and S. King. Multisyn voices for the Blizzard Challenge 2006. In Proc. Blizzard Challenge Workshop (Interspeech Satellite), Pittsburgh, USA, September 2006. ( [ bib | .pdf ]
This paper describes the process of building unit selection voices for the Festival Multisyn engine using the ATR dataset provided for the Blizzard Challenge 2006. We begin by discussing recent improvements that we have made to the Multisyn voice building process, prompted by our participation in the Blizzard Challenge 2006. We then go on to discuss our interpretation of the results observed. Finally, we conclude with some comments and suggestions for the formulation of future Blizzard Challenges.

[10] Volker Strom, Robert Clark, and Simon King. Expressive prosody for unit-selection speech synthesis. In Proc. Interspeech, Pittsburgh, 2006. [ bib | .ps | .pdf ]
Current unit selection speech synthesis voices cannot produce emphasis or interrogative contours because of a lack of the necessary prosodic variation in the recorded speech database. A method of recording script design is proposed which addresses this shortcoming. Appropriate components were added to the target cost function of the Festival Multisyn engine, and a perceptual evaluation showed a clear preference over the baseline system.

[11] H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang. Visual prosody: Facial movements accompanying speech. In Proc Fifth Int. Conf. Automatic Face and Gesture Recognition, pages 397-401, 2002. [ bib | .ps | .pdf ]
As we articulate speech, we usually move the head and exhibit various facial expressions. This visual aspect of speech aids understanding and helps communicating additional information, such as the speaker's mood. In this paper we analyze quantitatively head and facial movements that accompany speech and investigate how they relate to the text's prosodic structure. We recorded several hours of speech and measured the locations of the speaker's main facial features as well as their head poses. The text was evaluated with a prosody prediction tool, identifying phrase boundaries and pitch accents. Characteristic for most speakers are simple motion patterns that are repeatedly applied in synchrony with the main prosodic events. Direction and strength of head movements vary widely from one speaker to another, yet their timing is typically well synchronized with the spoken text. Understanding quantitatively the correlations between head movements and spoken text is important for synthesizing photo-realistic talking heads. Talking heads appear much more engaging when they exhibit realistic motion patterns.

[12] V. Strom. From text to speech without ToBI. In Proc. ICSLP, Denver, 2002. [ bib | .ps | .pdf ]
A new method for predicting prosodic parameters, i.e. phone durations and F0 targets, from preprocessed text is presented. The prosody model comprises a set of CARTs, which are learned from a large database of labeled speech. This database need not be annotated with Tone and Break Indices (ToBI labels). Instead, a simpler symbolic prosodic description is created by a bootstrapping method. The method had been applied to one Spanish and two German speakers. For the German voices, two listening tests showed a significant preference for the new method over a more traditional approach of prosody prediction, based on hand-crafted rules.

[13] Juergen Schroeter, Alistair Conkie, Ann Syrdal, Mark Beutnagel, Matthias Jilka, Volker Strom, Yeon-Jun Kim, Hong-Goo Kang, and David Kapilow. A perspective on the next challanges for TTS. In IEEE 2002 Workshop in Speech Synthesis, pages 11-13, Santa Monica, CA, 2002. [ bib | .ps | .pdf ]
The quality of speech synthesis has come a long way since Homer Dudley's “Vocoder” in 1939. In fact, with the wide-spread use of unit-selection synthesizers, the naturalness of the synthesized speech is now high enough to pass the Turing test for short utterances, such as prompts. Therefore, it seems valid to ask the question “what are the next challenges for TTS Research?” This paper tries to identify unresoved issues, the solution of which would greatly enhance the state of the art in TTS.

[14] Ann K. Syrdal, Colin W. Wightman, Alistair Conkie, Yannis Stylianou, Mark Beutnagel, Juergen Schroeter, Volker Strom, and Ki-Seung Lee. Corpus-based techniques in the at&t nextgen synthesis system. In Proc. Int. Conf. on Spoken Language Processing, Beijing, 2000. [ bib | .ps | .pdf ]
The AT&T text-to-speech (TTS) synthesis system has been used as a framework for experimenting with a perceptually-guided data-driven approach to speech synthesis, with a primary focus on data-driven elements in the "back end". Statistical training techniques applied to a large corpus are used to make decisions about predicted speech events and selected speech inventory units. Our recent advances in automatic phonetic and prosodic labelling and a new faster harmonic plus noise model (HMM) and unit preselection implementations have significantly improved TTS quality and speeded up both development time and runtime.

[15] V. Strom and H. Heine. Utilizing prosody for unconstrained morpheme recognition. In Proc. European Conf. on Speech Communication and Technology, Budapest, 1999. [ bib | .ps | .pdf ]
Speech recognition systems for languages with a rich inflectional morphology (like German) suffer from the limitations of a word-based full-form lexicon. Although the morphological and acoustical knowledge about words is coded implicitly within the lexicon entries (which are usually closely related to the orthography of the language at hand) this knowledge is usually not explicitly available for other tasks (e.g. detecting OOV words, prosodic analysis). This paper presents an HMM-based `word' recognizer that uses morpheme-like units on the string level for recognizing spontaneous German conversational speech (Verbmobil corpus). The system has no explicit word knowledge but uses a morpheme-bigram to capture the German word and sentence structure to some extent. The morpheme recognizer is tightly coupled with a prosodic classifier in order to compensate for some of the additional ambiguity introduced by using morphemes instead of words.

[16] Günther Görz, Jörg Spilker, Volker Strom, and Hans Weber. Architectural considerations for conversational systems - the verbmobil/intarc experience. proceedings of First International Workshop on Human Computer Conversation, cs.CL/9907021, 1999. [ bib | .ps | .pdf ]
The paper describes the speech to speech translation system INTARC, developed during the first phase of the Verbmobil project. The general design goals of the INTARC system architecture were time synchronous processing as well as incrementality and interactivity as a means to achieve a higher degree of robustness and scalability. Interactivity means that in addition to the bottom-up (in terms of processing levels) data flow the ability to process top-down restrictions considering the same signal segment for all processing levels. The construction of INTARC 2.0, which has been operational since fall 1996, followed an engineering approach focussing on the integration of symbolic (linguistic) and stochastic (recognition) techniques which led to a generalization of the concept of a “one pass” beam search.

[17] V. Strom. Automatische Erkennung von Satzmodus, Akzentuierung und Phrasengrenzen. PhD thesis, University of Bonn, 1998. [ bib | .ps | .pdf ]
[18] V. Strom, A. Elsner, G. Görz, W. Hess, W. Kasper, A. Klein, H.U. Krieger, J. Spilker, and H. Weber. On the use of prosody in a speech-to-speech translator. In Proc. European Conf. on Speech Communication and Technology, Rhodes, 1997. [ bib | .ps | .pdf ]
In this paper a speech-to-speech translator from German to English is presented. Beside the traditional processing steps it takes advantage of acoustically detected prosodic phrase boundaries and focus. The prosodic phrase boundaries reduce search space during syntactic parsing and rule out analysis trees during semantic parsing. The prosodic focus faciliates a “shallow” translation based on the best word chain in cases where the deep analysis fails.

[19] V. Strom and C. Widera. What's in the “pure” prosody? In Proc. ICSLP, Philadelphia, 1996. [ bib | .ps | .pdf ]
Detectors for accents and phrase boundaries have been developed which derive prosodic features from the speech signal and its fundamental frequency to support other modules of a speech understanding system in an early analysis stage, or in cases where no word hypotheses are available. The detectors' underlying Gaussian distribution classifiers were trained with 50 minutes and tested with 30 minutes of spontaneous speech, yielding recognition rates of 74% for accents and 86% for phrase boundaries. Since this material was prosodically hand labelled, the question was, which labels for phrase boundaries and accentuation were only guided by syntactic or semantic knowledge, and which ones are really prosodically marked. Therefore a small test subset has been resynthesized in such a way that comprehensibility was lost, but the prosodic characteristics were kept. This subset has been re-labelled by 11 listeners with nearly the same accuracy as the detectors.

[20] W. Hess, A. Batliner, A. Kießling, R. Kompe, E. Nöth, A. Petzold, M. Reyelt, and V. Strom. Prosodic modules for speech recognition and understanding in VERBMOBIL. In Norio Higuchi Yoshinori Sagisaka, Nick Campbell, editor, Computing Prosody, pages Part IV, Chapter 23, pp. 363 - 383. Springer-Verlag, New York, 1995. [ bib | .ps | .pdf ]
[21] V. Strom. Detection of accents, phrase boundaries and sentence modality in German with prosodic features. In Proc. European Conf. on Speech Communication and Technology, volume 3, pages 2039-2041, Madrid, 1995. [ bib | .ps | .pdf ]
In this paper detectors for accents, phrase boundaries, and sentence modality are described which derive prosodic features only from the speech signal and its fundamental frequency to support other modules of a speech understanding system in an early analysis stage, or in cases where no word hypotheses are available. A new method for interpolating and decomposing the fundamental frequency is suggested. The detectors' underlying Gaussian distribution classifiers were trained and tested with approximately 50 minutes of spontaneous speech, yielding recognition rates of 78 percent for accents, 81 percent for phrase boundaries, and 85 percent for sentence modality.

[22] H. Niemann, J. Denzler, B. Kahles, R. Kompe, A. Kießling, E. Nöth, and V. Strom. Pitch determination considering laryngealization effects in spoken dialogs. In Proc. Int. Conf. on Neuronal Networks, volume 7, pages 4457-4461, Orlando, 1994. [ bib | .ps | .pdf ]
A frequent phenomen in spoken dialogs of the information seeking type are short elliptic utterances whose mood (declarative or interrogative) can only be distinguished by intonation. The main acoustic evidence is conveyed by the fundamental frequency or F0 contour. Many algorithms for F0 determination have been reported in the literature. A common problem are irregularities of speech known as laryngealizations. This article describes an approach based on neuronal network techniques for the improved determination of fundamental frequency. First, an improved version of our neuronal network algorithm for reconstruction of the voice source signal (glottis signal) is presented. Second, the reconstructed voice source signal is used as input to another neuronal network destinguishing the three classes 'voiceless', 'voiced-non-laryngealized', and 'voiced-laryngealized'. Third, the results are used to improve an existing F0 algorithm. Results of this approach are presented and discussed in the context of the application in a spoken dialog system.