The Centre for Speech Technology Research, The university of Edinburgh

Publications by Mirjam Wester

[1] John Dines, Hui Liang, Lakshmi Saheer, Matthew Gibson, William Byrne, Keiichiro Oura, Keiichi Tokuda, Junichi Yamagishi, Simon King, Mirjam Wester, Teemu Hirsimäki, Reima Karhila, and Mikko Kurimo. Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. Computer Speech and Language, 27(2):420-437, February 2013. [ bib | DOI | http ]
In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics.

Keywords: Speech-to-speech translation, Cross-lingual speaker adaptation, HMM-based speech synthesis, Speaker adaptation, Voice conversion
[2] Mirjam Wester. Talker discrimination across languages. Speech Communication, 54:781-790, 2012. [ bib | DOI | .pdf ]
This study investigated the extent to which listeners are able to discriminate between bilingual talkers in three language pairs – English–German, English–Finnish and English–Mandarin. Native English listeners were presented with two sentences spoken by bilingual talkers and were asked to judge whether they thought the sentences were spoken by the same person. Equal amounts of cross-language and matched-language trials were presented. The results show that native English listeners are able to carry out this task well; achieving percent correct levels at well above chance for all three language pairs. Previous research has shown this for English–German, this research shows listeners also extend this to Finnish and Mandarin, languages that are quite distinct from English from a genetic and phonetic similarity perspective. However, listeners are significantly less accurate on cross-language talker trials (English–foreign) than on matched-language trials (English–English and foreign–foreign). Understanding listeners’ behaviour in cross-language talker discrimination using natural speech is the first step in developing principled evaluation techniques for synthesis systems in which the goal is for the synthesised voice to sound like the original speaker, for instance, in speech-to-speech translation systems, voice conversion and reconstruction.

[3] Martin Cooke, Maria Luisa García Lecumberri, Yan Tang, and Mirjam Wester. Do non-native listeners benefit from speech modifications designed to promote intelligibility for native listeners? In Proceedings of The Listening Talker Workshop, page 59, 2012. http://listening-talker.org/workshop/programme.html. [ bib ]
[4] Keiichiro Oura, Junichi Yamagishi, Mirjam Wester, Simon King, and Keiichi Tokuda. Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping. Speech Communication, 54(6):703-714, 2012. [ bib | DOI | http ]
In the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech.

Keywords: HMM-based speech synthesis, Unsupervised speaker adaptation, Cross-lingual speaker adaptation, Speech-to-speech translation
[5] Leonardo Badino, Robert A.J. Clark, and Mirjam Wester. Towards hierarchical prosodic prominence generation in TTS synthesis. In Proc. Interspeech, Portland, USA, 2012. [ bib | .pdf ]
[6] Reima Karhila and Mirjam Wester. Rapid adaptation of foreign-accented HMM-based speech synthesis. In Proc. Interspeech, Florence, Italy, 2011. [ bib | .pdf ]
This paper presents findings of listeners’ perception of speaker identity in synthetic speech. Specifically, we investigated what the effect is on the perceived identity of a speaker when using differently accented average voice models and limited amounts (five and fifteen sentences) of a speaker’s data to create the synthetic stimuli. A speaker discrimination task was used to measure speaker identity. Native English listeners were presented with natural and synthetic speech stimuli in English and were asked to decide whether they thought the sentences were spoken by the same person or not. An accent rating task was also carried out to measure the perceived accents of the synthetic speech stimuli. The results show that listeners, for the most part, perform as well at speaker discrimination when the stimuli have been created using five or fifteen adaptation sentences as when using 105 sentences. Furthermore, the accent of the average voice model does not affect listeners’ speaker discrimination performance even though the accent rating task shows listeners are perceiving different accents in the synthetic stimuli. Listeners do not base their speaker similarity decisions on perceived accent.

[7] Mirjam Wester and Hui Liang. Cross-lingual speaker discrimination using natural and synthetic speech. In Proc. Interspeech, Florence, Italy, 2011. [ bib | .pdf ]
This paper describes speaker discrimination experiments in which native English listeners were presented with either natural speech stimuli in English and Mandarin, synthetic speech stimuli in English and Mandarin, or natural Mandarin speech and synthetic English speech stimuli. In each experiment, listeners were asked to decide whether they thought the sentences were spoken by the same person or not. We found that the results for Mandarin/English speaker discrimination are very similar to results found in previous work on German/English and Finnish/English speaker discrimination. We conclude from this and previous work that listeners are able to identify speakers across languages and they are able to identify speakers across speech types, but the combination of these two factors leads to a speaker discrimination task which is too difficult for listeners to perform successfully, given the quality of across-language speaker adapted speech synthesis at present.

[8] Mirjam Wester and Reima Karhila. Speaker similarity evaluation of foreign-accented speech synthesis using HMM-based speaker adaptation. In Proc. ICASSP, pages 5372-5375, Prague, Czech Republic, 2011. [ bib | .pdf ]
This paper describes a speaker discrimination experiment in which native English listeners were presented with natural and synthetic speech stimuli in English and were asked to judge whether they thought the sentences were spoken by the same person or not. The natural speech consisted of recordings of Finnish speakers speaking English. The synthetic stimuli were created using adaptation data from the same Finnish speakers. Two average voice models were compared: one trained on Finnish-accented English and the other on American-accented English. The experiments illustrate that listeners perform well at speaker discrimination when the stimuli are both natural or both synthetic, but when the speech types are crossed performance drops significantly. We also found that the type of accent in the average voice model had no effect on the listeners’ speaker discrimination performance.

[9] Mirjam Wester and Hui Liang. The EMIME Mandarin Bilingual Database. Technical Report EDI-INF-RR-1396, The University of Edinburgh, 2011. [ bib | .pdf ]
This paper describes the collection of a bilingual database of Mandarin/English data. In addition, the accents of the talkers in the database have been rated. English and Mandarin listeners assessed the English and Mandarin talkers' degree of foreign accent in English.

[10] Mirjam Wester. Cross-lingual talker discrimination. In Proc. of Interspeech, Makuhari, Japan, September 2010. [ bib | .pdf ]
This paper describes a talker discrimination experiment in which native English listeners were presented with two sentences spoken by bilingual talkers (English/German and English/Finnish) and were asked to judge whether they thought the sentences were spoken by the same person or not. Equal amounts of cross-lingual and matched-language trials were presented. The experiments showed that listeners are able to complete this task well, they can discriminate between talkers significantly better than chance. However, listeners are significantly less accurate on cross-lingual talker trials than on matched-language pairs. No significant differences were found on this task between German and Finnish. Bias (B”) and Sensitivity (A') values are presented to analyse the listeners' behaviour in more detail. The results are promising for the evaluation of EMIME, a project covering speech-to-speech translation with speaker adaptation.

[11] Mirjam Wester, John Dines, Matthew Gibson, Hui Liang, Yi-Jian Wu, Lakshmi Saheer, Simon King, Keiichiro Oura, Philip N. Garner, William Byrne, Yong Guan, Teemu Hirsimäki, Reima Karhila, Mikko Kurimo, Matt Shannon, Sayaka Shiota, Jilei Tian, Keiichi Tokuda, and Junichi Yamagishi. Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project. In Proc. of 7th ISCA Speech Synthesis Workshop, Kyoto, Japan, September 2010. [ bib | .pdf ]
This paper provides an overview of speaker adaptation research carried out in the EMIME speech-to-speech translation (S2ST) project. We focus on how speaker adaptation transforms can be learned from speech in one language and applied to the acoustic models of another language. The adaptation is transferred across languages and/or from recognition models to synthesis models. The various approaches investigated can all be viewed as a process in which a mapping is defined in terms of either acoustic model states or linguistic units. The mapping is used to transfer either speech data or adaptation transforms between the two models. Because the success of speaker adaptation in text-to-speech synthesis is measured by judging speaker similarity, we also discuss issues concerning evaluation of speaker similarity in an S2ST scenario.

[12] Mikko Kurimo, William Byrne, John Dines, Philip N. Garner, Matthew Gibson, Yong Guan, Teemu Hirsimäki, Reima Karhila, Simon King, Hui Liang, Keiichiro Oura, Lakshmi Saheer, Matt Shannon, Sayaka Shiota, Jilei Tian, Keiichi Tokuda, Mirjam Wester, Yi-Jian Wu, and Junichi Yamagishi. Personalising speech-to-speech translation in the EMIME project. In Proc. of the ACL 2010 System Demonstrations, Uppsala, Sweden, July 2010. [ bib | .pdf ]
In the EMIME project we have studied unsupervised cross-lingual speaker adaptation. We have employed an HMM statistical framework for both speech recognition and synthesis which provides transformation mechanisms to adapt the synthesized voice in TTS (text-to-speech) using the recognized voice in ASR (automatic speech recognition). An important application for this research is personalised speech-to-speech translation that will use the voice of the speaker in the input language to utter the translated sentences in the output language. In mobile environments this enhances the users' interaction across language barriers by making the output speech sound more like the original speaker's way of speaking, even if she or he could not speak the output language.

[13] M. Wester. The EMIME Bilingual Database. Technical Report EDI-INF-RR-1388, The University of Edinburgh, 2010. [ bib | .pdf ]
This paper describes the collection of a bilingual database of Finnish/English and German/English data. In addition, the accents of the talkers in the database have been rated. English, German and Finnish listeners assessed the English, German and Finnish talkersâ degree of foreign accent in English. Native English listeners showed higher inter-listener agreement than non-native listeners. Further analyses showed that non-native listeners judged Finnish and German female talkers to be significantly less accented than do English listeners. German males are judged less accented by Finnish listeners than they are by English and German listeners and there is no difference between listeners as to how they judge the accent of Finnish males. Finally, all English talkers are judged more accented by non-native listeners than they are by native English listeners.

[14] Keiichiro Oura, Keiichi Tokuda, Junichi Yamagishi, Mirjam Wester, and Simon King. Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. In Proc. of ICASSP, volume I, pages 4954-4957, 2010. [ bib | .pdf ]
In the EMIME project, we are developing a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrate two techniques, unsupervised adaptation for HMM-based TTS using a word-based large-vocabulary continuous speech recognizer and cross-lingual speaker adaptation for HMM-based TTS, into a single architecture. Thus, an unsupervised cross-lingual speaker adaptation system can be developed. Listening tests show very promising results, demonstrating that adapted voices sound similar to the target speaker and that differences between supervised and unsupervised cross-lingual speaker adaptation are small.

[15] J. Frankel, M. Wester, and S. King. Articulatory feature recognition using dynamic Bayesian networks. Computer Speech & Language, 21(4):620-640, October 2007. [ bib | .pdf ]
We describe a dynamic Bayesian network for articulatory feature recognition. The model is intended to be a component of a speech recognizer that avoids the problems of conventional “beads-on-a-string” phoneme-based models. We demonstrate that the model gives superior recognition of articulatory features from the speech signal compared with a stateof- the art neural network system. We also introduce a training algorithm that offers two major advances: it does not require time-aligned feature labels and it allows the model to learn a set of asynchronous feature changes in a data-driven manner.

[16] S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester. Speech production knowledge in automatic speech recognition. Journal of the Acoustical Society of America, 121(2):723-742, February 2007. [ bib | .pdf ]
Although much is known about how speech is produced, and research into speech production has resulted in measured articulatory data, feature systems of different kinds and numerous models, speech production knowledge is almost totally ignored in current mainstream approaches to automatic speech recognition. Representations of speech production allow simple explanations for many phenomena observed in speech which cannot be easily analyzed from either acoustic signal or phonetic transcription alone. In this article, we provide a survey of a growing body of work in which such representations are used to improve automatic speech recognition.

[17] S. Chang, M. Wester, and S. Greenberg. An elitist approach to automatic articulatory-acoustic feature classification for phonetic characterization of spoken language. Speech Communication, 47:290-311, 2005. [ bib | .pdf ]
A novel framework for automatic articulatory-acoustic feature extraction has been developed for enhancing the accuracy of place- and manner-of-articulation classification in spoken language. The "elitist" approach provides a principled means of selecting frames for which multi-layer perceptron, neural-network classifiers are highly confident. Using this method it is possible to achieve a frame-level accuracy of 93% on "elitist" frames for manner classification on a corpus of American English sentences passed through a telephone network (NTIMIT). Place-of-articulation information is extracted for each manner class independently, resulting in an appreciable gain in place-feature classification relative to performance for a manner-independent system. A comparable enhancement in classification performance for the elitist appraoch is evidenced when applied to a Dutch corpus of quasi-spontaneous telephone interactions (VIOS). The elitist framework provides a potential means of automatically annotating a corpus at the phonetic level without recourse to a word-level transcript and could thus be of utility for developing traning materials for automatic speech recognition and speech synthesis applications, as well as aid the empirical study of spoken language. (c)2005 Elsevier B.V. All rights reserved.

[18] M. Wester, J. Frankel, and S. King. Asynchronous articulatory feature recognition using dynamic Bayesian networks. In Proc. IEICI Beyond HMM Workshop, Kyoto, December 2004. [ bib | .ps | .pdf ]
This paper builds on previous work where dynamic Bayesian networks (DBN) were proposed as a model for articulatory feature recognition. Using DBNs makes it possible to model the dependencies between features, an addition to previous approaches which was found to improve feature recognition performance. The DBN results were promising, giving close to the accuracy of artificial neural nets (ANNs). However, the system was trained on canonical labels, leading to an overly strong set of constraints on feature co-occurrence. In this study, we describe an embedded training scheme which learns a set of data-driven asynchronous feature changes where supported in the data. Using a subset of the OGI Numbers corpus, we describe articulatory feature recognition experiments using both canonically-trained and asynchronous DBNs. Performance using DBNs is found to exceed that of ANNs trained on an identical task, giving a higher recognition accuracy. Furthermore, inter-feature dependencies result in a more structured model, giving rise to fewer feature combinations in the recognition output. In addition to an empirical evaluation of this modelling approach, we give a qualitative analysis, comparing asynchrony found through our data-driven methods to the asynchrony which may be expected on the basis of linguistic knowledge.

[19] J. Frankel, M. Wester, and S. King. Articulatory feature recognition using dynamic Bayesian networks. In Proc. ICSLP, September 2004. [ bib | .ps | .pdf ]
This paper describes the use of dynamic Bayesian networks for the task of articulatory feature recognition. We show that by modeling the dependencies between a set of 6 multi-leveled articulatory features, recognition accuracy is increased over an equivalent system in which features are considered independent. Results are compared to those found using artificial neural networks on an identical task.

[20] Alexander Gutkin, David Gay, Lev Goldfarb, and Mirjam Wester. On the Articulatory Representation of Speech within the Evolving Transformation System Formalism. In Lev Goldfarb, editor, Pattern Representation and the Future of Pattern Recognition (Proc. Satellite Workshop of 17th International Conference on Pattern Recognition), pages 57-76, Cambridge, UK, August 2004. [ bib | .ps.gz | .pdf ]
This paper deals with the formulation of an alternative, structural, approach to the speech representation and recognition problem. In this approach, we require both the representation and the learning algorithms to be linguistically meaningful and to naturally represent the linguistic data at hand. This allows the speech recognition system to discover the emergent combinatorial structure of the linguistic classes. The proposed approach is developed within the ETS formalism, the first formalism in applied mathematics specifically designed to address the issues of class and object/event representation. We present an initial application of ETS to the articulatory modelling of speech based on elementary physiological gestures that can be reliably represented as the ETS primitives. We discuss the advantages of this gestural approach over prevalent methods and its promising potential to mathematical modelling and representation in linguistics.

[21] J. Sturm, J. M. Kessens, M. Wester, F. de Wet, E. Sanders, and H. Strik. Automatic transcription of football commentaries in the MUMIS project. In Proc. Eurospeech '03, pages -, 2003. [ bib | .pdf ]
This paper describes experiments carried out to automatically transcribe football commentaries in Dutch, English and German for multimedia indexing. Our results show that the high levels of stadium noise in the material create a task that is extremely difficult for conventional ASR. The baseline WERs vary from 83% to 94% for the three languages investigated. Employing state-of-the-art noise robustness techniques leads to relative reductions of 9-10% WER. Application specific words such as players names are recognized correctly in about 50% of cases. Although this result is substantially better than the overall result, it is inadequate. Much better results can be obtained if the football commentaries are recorded separately from the stadium noise. This would make the automatic transcriptions more useful for multimedia indexing.

[22] M. Wester. Syllable classification using articulatory-acoustic features. In Proc. of Eurospeech '03, pages -, Geneva, 2003. [ bib | .pdf ]
This paper investigates the use of articulatory-acoustic features for the classification of syllables in TIMIT. The main motivation for this study is to circumvent the “beads-on-a-string” problem, i.e. the assumption that words can be described as a simple concatenation of phones. Posterior probabilities for articulatory-acoustic features are obtained from artificial neural nets and are used to classify speech within the scope of syllables instead of phones. This gives the opportunity to account for asynchronous feature changes, exploiting the strengths of the articulatory-acoustic features, instead of losing the potential by reverting to phones.

[23] M. Wester. Pronunciation modeling for ASR - knowledge-based and data-derived methods. Computer Speech and Language, 17:69-85, 2003. [ bib | .pdf ]
This article focuses on modeling pronunciation variation in two different ways: data-derived and knowledge-based. The knowledge-based approach consists of using phonological rules to generate variants. The data-derived approach consists of performing phone recognition, followed by smoothing using decision trees (D-trees) to alleviate some of the errors in the phone recognition. Using phonological rules led to a small improvement in WER; a data-derived approach in which the phone recognition was smoothed using D-trees prior to lexicon generation led to larger improvements compared to the baseline. The lexicon was employed in two different recognition systems: a hybrid HMM/ANN system and a HMM-based system, to ascertain whether pronunciation variation was truly being modeled. This proved to be the case as no significant differences were found between the results obtained with the two systems. Furthermore, we found that 10% of variants generated by the phonological rules were also found using phone recognition, and this increased to 28% when the phone recognition output was smoothed by using D-trees. This indicates that the D-trees generalize beyond what has been seen in the training material, whereas when the phone recognition approach is employed directly, unseen pronunciations cannot be predicted. In addition, we propose a metric to measure confusability in the lexicon. Using this confusion metric to prune variants results in roughly the same improvement as using the D-tree method.

[24] M. Wester, J.M. Kessens, and H. Strik. Goal-directed ASR in a multimedia indexing and searching environment (MUMIS). In Proc. of ICSLP, pages 1993-1996, Denver, 2002. [ bib | .pdf ]
This paper describes the contribution of automatic speech recognition (ASR) within the framework of MUMIS (Multimedia Indexing and Searching Environment). The domain is football commentaries. The initial results of carrying out ASR on Dutch and English football commentaries are presented. We found that overall word error rates are high, but application specific words are recognized reasonably well. The difficulty of the ASR task is greatly increased by the high levels of noise present in the material.

[25] Mirjam Wester. Pronunciation Variation Modeling for Dutch Automatic Speech Recognition. PhD thesis, University of Nijmegen, 2002. [ bib | .pdf ]
This thesis consists of an introductory review to pronunciation variation modeling, followed by four papers in which the PhD research is described.

[26] M. Wester, J. M. Kessens, C. Cucchiarini, and H. Strik. Obtaining phonetic transcriptions: a comparison between expert listeners and a continuous speech recognizer. Language and Speech, 44(3):377-403, 2001. [ bib | .pdf ]
In this article, we address the issue of using a continuous speech recognition tool to obtain phonetic or phonological representations of speech. Two experiments were carried out in which the performance of a continuous speech recognizer (CSR) was compared to the performance of expert listeners in a task of judging whether a number of prespecified phones had been realized in an utterance. In the first experiment, nine expert listeners and the CSR carried out exactly the same task: deciding whether a segment was present or not in 467 cases. In the second experiment, we expanded on the first experiment by focusing on two phonological processes: schwa-deletion and schwa-insertion. The results of these experiments show that significant differences in performance were found between the CSR and the listeners, but also between individual listeners. Although some of these differences appeared to be statistically significant, their magnitude is such that they may very well be acceptable depending on what the transcriptions are needed for. In other words, although the CSR is not infallible, it makes it possible to explore large datasets, which might outweigh the errors introduced by the mistakes the CSR makes. For these reasons, we can conclude that the CSR can be used instead of a listener to carry out this type of task: deciding whether a phone is present or not.

[27] S. Chang, S. Greenberg, and M. Wester. An elitist approach to articulatory-acoustic feature classification. In Proc. of Eurospeech '01, pages 1729-1733, Aalborg, 2001. [ bib | .pdf ]
A novel framework for automatic articulatory-acoustic feature extraction has been developed for enhancing the accuracy of place- and manner-of-articulation classification in spoken language. The elitist approach focuses on frames for which neural network (MLP) classifiers are highly confident, and discards the rest. Using this method, it is possible to achieve a frame-level accuracy of 93% for manner information on a corpus of American English sentences passed through a telephone network (NTIMIT). Place information is extracted for each manner class independently, resulting in an appreciable gain in place-feature classification relative to performance for a manner- independent system. The elitist framework provides a potential means of automatically annotating a corpus at the phonetic level without recourse to a word-level transcript and could thus be of utility for developing training materials for automatic speech recognition and speech synthesis applications, as well as aid the empirical study of spoken language.

[28] M. Wester, S. Greenberg, and S. Chang. A Dutch treatment of an elitist approach to articulatory-acoustic feature classification. In Proc. of Eurospeech '01, pages 1729-1732, Aalborg, 2001. [ bib | .pdf ]
A novel approach to articulatory-acoustic feature extraction has been developed for enhancing the accuracy of classification associated with place and manner of articulation information. This elitist approach is tested on a corpus of spontaneous Dutch using two different systems, one trained on a subset of the same corpus, the other trained on a corpus from a different language (American English). The feature dimensions, voicing and manner of articulation transfer relatively well between the two languages. However, place information transfers less well. Manner-specific training can be used to improve classification of articulatory place information.

[29] J.M. Kessens, M. Wester, and H. Strik. Automatic detection and verification of Dutch phonological rules. In PHONUS 5: Proceedings of the "Workshop on Phonetics and Phonology in ASR", pages 117-128, Saarbruecken, 2000. [ bib | .pdf ]
In this paper, we propose two methods for automatically obtaining hypotheses about pronunciation variation. To this end, we used two different approaches in which we employed a continuous speech recognizer to derive this information from the speech signal. For the first method, the output of a phone recognition was compared to a reference transcription in order obtain hypotheses about pronunciation variation. Since phone recognition contains errors, we used forced recognition in order to exclude unreliable hypotheses. For the second method, forced recognition was also used, but the hypotheses about the deletion of phones were not constrained beforehand. This was achieved by allowing each phone to be deleted. After forced recognition, we selected the most frequently applied rules as the set of deletion rules. Since previous research showed that forced recognition is a reliable tool for testing hypotheses about pronunciation variation, we can expect that this will also hold for the hypotheses about pronunciation variation which we found using each of the two methods. Another reason for expecting the rule hypotheses to be reliable is that we found that 37-53% of the rules are related to Dutch phonological processes that have been described in the literature.

[30] M. Wester, J.M. Kessens, and H. Strik. Pronunciation variation in ASR: Which variation to model? In Proc. of ICSLP '00, volume IV, pages 488-491, Beijing, 2000. [ bib | .pdf ]
This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling within-word and cross-word pronunciation variation. A relative improvement of 8.8% in WER was found compared to baseline system performance. However, as WERs do not reveal the full effect of modeling pronunciation variation, we performed a detailed analysis of the differences in recognition results that occur due to modeling pronunciation variation and found that indeed a lot of the differences in recognition results are not reflected in the error rates. Furthermore, error analysis revealed that testing sets of variants in isolation does not predict their behavior in combination. However, these results appeared to be corpus dependent.

[31] M. Wester and E. Fosler-Lussier. A comparison of data-derived and knowledge-based modeling of pronunciation variation. In Proc. of ICSLP '00, volume I, pages 270-273, Beijing, 2000. [ bib | .pdf ]
This paper focuses on modeling pronunciation variation in two different ways: data-derived and knowledge-based. The knowledge-based approach consists of using phonological rules to generate variants. The data-derived approach consists of performing phone recognition, followed by various pruning and smoothing methods to alleviate some of the errors in the phone recognition. Using phonological rules led to a small improvement in WER; whereas, using a data-derived approach in which the phone recognition was smoothed using simple decision trees (d-trees) prior to lexicon generation led to a significant improvement compared to the baseline. Furthermore, we found that 10% of variants generated by the phonological rules were also found using phone recognition, and this increased to 23% when the phone recognition output was smoothed by using d-trees. In addition, we propose a metric to measure confusability in the lexicon and we found that employing this confusion metric to prune variants results in roughly the same improvement as using the d-tree method.

[32] M. Wester, J.M. Kessens, and H. Strik. Using Dutch phonological rules to model pronunciation variation in ASR. In Phonus 5: proceedings of the "workshop on phonetics and phonology in ASR", pages 105-116, Saarbruecken, 2000. [ bib | .pdf ]
In this paper, we describe how the performance of a continuous speech recognizer for Dutch has been improved by modeling within-word and cross-word pronunciation variation. Within-word variants were automatically generated by applying five phonological rules to the words in the lexicon. Cross-word pronunciation variation was modeled by adding multi-words and their variants to the lexicon. The best results were obtained when the cross-word method was combined with the within-word method: a relative improvement of 8.8% in the WER was found compared to baseline system performance. We also describe an error analysis that was carried out to investigate whether rules in isolation can predict the performance of rules in combination.

[33] J.M. Kessens, M. Wester, and H. Strik. Improving the performance of a Dutch CSR by modeling within-word and cross-word pronunciation variation. Speech Communication, 29:193-207, 1999. [ bib | .pdf ]
This article describes how the performance of a Dutch continuous speech recognizer was improved by modeling pronunciation variation. We propose a general procedure for modeling pronunciation variation. In short, it consists of adding pronunciation variants to the lexicon, retraining phone models and using language models to which the pronunciation variants have been added. First, within-word pronunciation variants were generated by applying a set of five optional phonological rules to the words in the baseline lexicon. Next, a limited number of cross-word processes were modeled, using two different methods. In the first approach, cross-word processes were modeled by directly adding the cross-word variants to the lexicon, and in the second approach this was done by using multi-words. Finally, the combination of the within-word method with the two cross-word methods was tested. The word error rate (WER) measured for the baseline system was 12.75%. Compared to the baseline, a small but statistically significant improvement of 0.68% in WER was measured for the within-word method, whereas both cross-word methods in isolation led to small, non-signicant improvements. The combination of the within-word method and cross-word method 2 led to the best result: an absolute improvement of 1.12% in WER was found compared to the baseline, which is a relative improvement of 8.8% in WER.

[34] J.M. Kessens, M. Wester, and H. Strik. Modeling within-word and cross-word pronunciation variation to improve the performance of a Dutch CSR. In Proc. of ICPhS '99, pages 1665-1668, San Francisco, 1999. [ bib | .pdf ]
This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling within-word and cross-word pronunciation variation. Within-word variants were automatically generated by applying five phonological rules to the words in the lexicon. For the within-word method, a significant improvement is found compared to the baseline. Cross-word pronunciation variation was modeled using two different methods: 1) adding cross-word variants directly to the lexicon, 2) only adding multi-words and their variants to the lexicon. Overall, cross-word method 2 leads to better results than cross-word method 1. The best results were obtained when cross-word method 2 was combined with the within-word method: a relative improvement of 8.8% WER was found compared to the baseline.

[35] M. Wester and J.M. Kessens. Comparison between expert listeners and continuous speech recognizers in selecting pronunciation variants. In Proc. of ICPhS '99, pages 723-726, San Francisco, 1999. [ bib | .pdf ]
In this paper, the performance of an automatic transcription tool is evaluated. The transcription tool is a continuous speech recognizer (CSR) which can be used to select pronunciation variants (i.e. detect insertions and deletions of phones). The performance of the CSR was compared to a reference transcription based on the judgments of expert listeners. We investigated to what extent the degree of agreement between the listeners and the CSR was affected by employing various sets of phone models (PMs). Overall, the PMs perform more similarly to the listeners when pronunciation variation is modeled. However, the various sets of PMs lead to different results for insertion and deletion processes. Furthermore, we found that to a certain degree, word error rates can be used to predict which set of PMs to use in the transcription tool.

[36] M. Wester, J.M. Kessens, C. Cucchiarini, and H. Strik. Selection of pronunciation variants in spontaneous speech: Comparing the performance of man and machine. In Proc. of the ESCA Workshop on the Sound Patterns of Spontaneous Speech: Production and Perception, pages 157-160, Aix-en-Provence, 1998. [ bib | .pdf ]
Dans cet article, les performances d'un outil de transcription automatique sont évaluées. L'outil de transcription est un reconnaisseur de parole continue (CSR) fonctionnant en mode de reconnaissance forcée. Pour l'évaluation les performances du CSR ont été comparées à celles de neuf auditeurs experts. La machine et l'humain ont effectué exactement la même tâche: décider si un segment était présent ou non dans 467 cas. Il s'est avéré que les performances du CSR étaient comparables à celle des experts.

[37] M. Wester, J.M. Kessens, and H. Strik. Modeling pronunciation variation for a Dutch CSR: testing three methods. In Proc. ICSLP '98, pages 2535-2538, Sydney, 1998. [ bib | .pdf ]
This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods to model pronunciation variation. First, within-word variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, cross-word pronunciation variation was modeled using two different approaches. The first approach was to model cross-word processes by adding the variants as separate words to the lexicon and in the second approach this was done by using multi-words. For each of the methods, recognition experiments were carried out. A significant improvement was found for modeling within-word variation. Furthermore, modeling crossword processes using multi-words leads to significantly better results than modeling them using separate words in the lexicon.

[38] M. Wester, J.M. Kessens, and H. Strik. Improving the performance of a Dutch CSR by modeling pronunciation variation. In Proc. of the Workshop Modeling Pronunciation Variation for Automatic Speech Recognition, pages 145-150, Kerkrade, 1998. [ bib | .pdf ]
This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods in order to model pronunciation variation. First, withinword variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, cross-word pronunciation variation was accounted for by adding multi-words and their variants to the lexicon. Thirdly, probabilities of pronunciation variants were incorporated in the language model (LM), and thresholds were used to choose which pronunciation variants to add to the LMs. For each of the methods, recognition experiments were carried out. A significant improvement in error rates was measured.

[39] M. Wester, J.M. Kessens, and H. Strik. Two automatic approaches for analyzing the frequency of connected speech processes in Dutch. In Proc. ICSLP Student Day '98, pages 3351-3356, Sydney, 1998. [ bib | .pdf ]
This paper describes two automatic approaches used to study connected speech processes (CSPs) in Dutch. The first approach was from a linguistic point of view - the top-down method. This method can be used for verification of hypotheses about CSPs. The second approach - the bottom-up method - uses a constrained phone recognizer to generate phone transcriptions. An alignment was carried out between the two transcriptions and a reference transcription. A comparison between the two methods showed that 68% agreement was achieved on the CSPs. Although phone accuracy is only 63%, the bottom-up approach is useful for studying CSPs. From the data generated using the bottom-up method, indications of which CSPs are present in the material can be found. These indications can be used to generate hypotheses which can then be tested using the top-down method.

[40] M. Wester. Automatic classification of voice quality: Comparing regression models and hidden Markov models. In Proc. of VOICEDATA98, Symposium on Databases in Voice Quality Research and Education, pages 92-97, Utrecht, 1998. [ bib | .pdf ]
In this paper, two methods for automatically classifying voice quality are compared: regression analysis and hidden Markov models (HMMs). The findings of this research show that HMMs can be used to classify voice quality. The HMMs performed better than the regression models in classifying breathiness and overall degree of deviance, and the two methods showed similar results on the roughness scale. However, the results are not spectacular. This is mainly due to the type of material that was available and the number of listeners who assessed the material. Nonetheless, I argue in this paper that these findings are interesting because they are a promising step towards developing a system for classifying voice quality.

[41] J.M. Kessens, M. Wester, C. Cucchiarini, and H. Strik. The selection of pronunciation variants: Comparing the performance of man and machine. In Proc. of ICSLP '98, pages 2715-2718, Sydney, 1998. [ bib | .pdf ]
In this paper the performance of an automatic transcription tool is evaluated. The transcription tool is a Continuous Speech Recognizer (CSR) running in forced recognition mode. For evaluation the performance of the CSR was compared to that of nine expert listeners. Both man and the machine carried out exactly the same task: deciding whether a segment was present or not in 467 cases. It turned out that the performance of the CSR is comparable to that of the experts.

[42] J.M. Kessens and M. Wester. Improving recognition performance by modelling pronunciation variation. In Proc. of the CLS opening Academic Year '97 '98, pages 1-20, Nijmegen, 1997. [ bib | .pdf ]
This paper describes a method for improving the performance of a continuous speech recognizer by modelling pronunciation variation. Although the improvements obtained with this method are small, they are in line with those reported by other authors. A series of experiments was carried out to model pronunciation variation. In the first set of experiments word internal pronunciation variation was modelled by applying a set of four phonological rules to the words in the lexicon. In the second set of experiments, variation across word boundaries was also modelled. The results obtained with both methods are presented in detail. Furthermore, statistics are given on the application of the four phonological rules on the training database. We will explain why the improvements obtained with this method are small and how we intend to increase the improvements in our future research.

[43] M. Wester, J.M. Kessens, C. Cucchiarini, and H. Strik. Modelling pronunciation variation: some preliminary results. In Proc. of the Dept. of Language & Speech, pages 127-137, Nijmegen, 1997. [ bib | .pdf ]
In this paper we describe a method for improving the performance of a continuous speech recognizer by modelling pronunciation variation. Although the results obtained with this method are in line with those reported by other authors, the magnitude of the improvements is very small. In looking for possible explanations for these results, we computed various sorts of statistics about the material. Since these data proved to be very useful in understanding the effects of our method, they are discussed in this paper. Moreover, on the basis of these statistics we discuss how the system can be improved in the future.

[44] J.M. Kessens, M. Wester, C. Cucchiarini, and H. Strik. Testing a method for modelling pronunciation variation. In Proceedings of the COST workshop, pages 37-40, Rhodos, 1997. [ bib | .pdf ]
In this paper we describe a method for improving the performance of a continuous speech recognizer by modelling pronunciation variation. Although the results obtained with this method are in line with those reported by other authors, the magnitude of the improvements is very small. In looking for possible explanations for these results, we computed various sorts of statistics about the material. Since these data proved to be very useful in understanding the effects of our method, they are discussed in this paper. Moreover, on the basis of these statistics we discuss how the system can be improved in the future.