Hiroshi Shimodaira, Jun Rokui, and Mitsuru Nakai. Improving The Generalization Performance Of The MCE/GPD Learning. In ICSLP'98, Australia, December 1998. [ bib | .pdf ]

A novel method to prevent the over-fitting effect and improve the generalization performance of the Minimum Classification Error (MCE) / Generalized Probabilistic Descent (GPD) learning is proposed. The MCE/GPD method, which is one of the newest discriminative-learning approaches proposed by Katagiri and Juang in 1992, results in better recognition performance in various areas of pattern recognition than the maximum-likelihood (ML) based approach where a posteriori probabilities are estimated. Despite its superiority in recognition performance, it still suffers from the problem of over-fitting to the training samples as it is with other learning algorithms. In the present study, a regularization technique is employed to the MCE method to overcome this problem. Feed-forward neural networks are employed as a recognition platform to evaluate the recognition performance of the proposed method. Recognition experiments are conducted on several sorts of datasets. The proposed method shows better generalization performance than the original one.

Simon King, Todd Stephenson, Stephen Isard, Paul Taylor, and Alex Strachan. Speech recognition via phonetically featured syllables. In Proc. ICSLP `98, pages 1031-1034, Sydney, Australia, December 1998. [ bib | .ps | .pdf ]

We describe a speech recogniser which uses a speech production-motivated phonetic-feature description of speech. We argue that this is a natural way to describe the speech signal and offers an efficient intermediate parameterisation for use in speech recognition. We also propose to model this description at the syllable rather than phone level. The ultimate goal of this work is to generate syllable models whose parameters explicitly describe the trajectories of the phonetic features of the syllable. We hope to move away from Hidden Markov Models (HMMs) of context-dependent phone units. As a step towards this, we present a preliminary system which consists of two parts: recognition of the phonetic features from the speech signal using a neural network; and decoding of the feature-based description into phonemes using HMMs.

Mitsuru Nakai and Hiroshi Shimodaira. The Use of F0 Reliability Function for Prosodic Command Analysis on F0 Contour Generation Model. In Proc. ICSLP'98, December 1998. [ bib | .pdf ]

Sue Fitt and Steve Isard. Representing the environments for phonological processes in an accent-independent lexicon for synthesis of English. In Proc. ICSLP 1998, volume 3, pages 847-850, Sydney, Australia, December 1998. [ bib | .ps | .pdf ]

This paper reports on work developing an accent-independent lexicon for use in synthesising speech in English. Lexica which use phonemic transcriptions are only suitable for one accent, and developing a lexicon for a new accent is a long and laborious process. Potential solutions to this problem include the use of conversion rules to generate lexica of regional pronunciations from standard accents and encoding of regional variation by means of keywords. The latter proposal forms the basis of the current work. However, even if we use a keyword system for lexical transcription there are a number of remaining theoretical and methodological problems if we are to synthesise and recognise accents to a high degree of accuracy; these problems are discussed in the following paper.

Kanad Keeni, Kenji Nakayama, and Hiroshi Shimodaira. Automatic Generation of Initial Weights and Target Outputs of Multi-layer Neural Networks and its Application to Pattern Classification. In International Conference on Neural Information Processing (ICONIP'98), pages 1622-1625, October 1998. [ bib ]

Jun Rokui and Hiroshi Shimodaira. Modified Minimum Classification Error Learning and Its Application to Neural Networks. In ICONIP'98, Kitakyushu, Japan, October 1998. [ bib ]

Eiji Iida, Hiroshi Shimodaira, Susumu Kunifuji, and Masayuki Kimura. A system to Perform Human Problem Solving. In The 5th International Conference on Soft Computing and Information / Intelligent Systems (IIZUKA'98), October 1998. [ bib ]

Kanad Keeni, Kenji Nakayama, and Hiroshi Shimodaira. Automatic Generation of Initial Weights and Estimation of Hidden Units for Pattern Classification Using Neural Networks. In 14th International Conference on Pattern Recognition (ICPR'98), pages 1568-1571, August 1998. [ bib ]

Eiji Iida, Susumu Kunifuji, Hiroshi Shimodaira, and Masayuki Kimura. A Scale-Down Solution of N^2-1 Puzzle. Trans. IEICE(D-I), J81-D-I(6):604-614, June 1998. (in Japanese). [ bib ]

Kanad Keeni, Hiroshi Shimodaira, Kenji Nakayama, and Kazunori Kotani. On Parameter Initialization of Multi-layer Feed-forward Neural Networks for Pattern Recognition. In International Conference on Computational Linguistics, Speech and Document Processing (ICCLSDP-'98), Calcutta, India, pages D8-12, February 1998. [ bib ]

Janet Hitzeman and Massimo Poesio. Long distance pronominalization and global focus. In COLING-ACL '98, volume 1, pages 550-556, Montreal, Quebec, Canada, 1998. [ bib | .ps | .pdf ]

D. Abberley, S. Renals, and G. Cook. Retrieval of broadcast news documents with the THISL system. In Proc IEEE ICASSP, pages 3781-3784, Seattle, 1998. [ bib | .ps.gz | .pdf ]

This paper describes a spoken document retrieval system, combining the Abbot large vocabulary continuous speech recognition (LVCSR) system developed by Cambridge University, Sheffield University and SoftSound, and the PRISE information retrieval engine developed by NIST. The system was constructed to enable us to participate in the TREC 6 Spoken Document Retrieval experimental evaluation. Our key aims in this work wer e to produce a complete system for the SDR task, to investigate the effect of a word error rate of 30-50% on retrieval performance and to investigate the integration of LVCSR and word spotting in a retrieval task.

Paul A. Taylor, S. King, S. D. Isard, and H. Wright. Intonation and dialogue context as constraints for speech recognition. Language and Speech, 41(3):493-512, 1998. [ bib | .ps | .pdf ]

S. Renals and D. Abberley. The THISL spoken document retrieval system. In Proc. 14th Twente Workshop on Language Technology, pages 129-140, 1998. [ bib | .ps.gz | .pdf ]

THISL is an ESPRIT Long Term Research Project focused the development and construction of a system to items from an archive of television and radio news broadcasts. In this paper we outline our spoken document retrieval system based on the Abbot speech recognizer and a text retrieval system based on Okapi term-weighting . The system has been evaluated as part of the TREC-6 and TREC-7 spoken document retrieval evaluations and we report on the results of the TREC-7 evaluation based on a document collection of 100 hours of North American broadcast news.

M. Carreira-Perpiñán and S. Renals. Experimental evaluation of latent variable models for dimensionality reduction. In IEEE Proc. Neural Networks for Signal Processing, volume 8, pages 165-173, Cambridge, 1998. [ bib | .ps.gz | .pdf ]

We use electropalatographic (EPG) data as a test bed for dimensionality reduction methods based in latent variable modelling, in which an underlying lower dimension representation is inferred directly from the data. Several models (and mixtures of them) are investigated, including factor analysis and the generative topographic mapping (GTM). Experiments indicate that nonlinear latent variable modelling reveals a low-dimensional structure in the data inaccessible to the investigated linear models.

C. Mayo. The developmental relationship between perceptual weighting and phonemic awareness. In LabPhon 6, University of York, UK, 1998. [ bib ]

M. Wester, J.M. Kessens, C. Cucchiarini, and H. Strik. Selection of pronunciation variants in spontaneous speech: Comparing the performance of man and machine. In Proc. ESCA Workshop on the Sound Patterns of Spontaneous Speech: Production and Perception, pages 157-160, Aix-en-Provence, 1998. [ bib | .pdf ]

Tae-Yeoub Jang, Minsuck Song, and Kiyeong Lee. Disambiguation of korean utterances using automatic intonation recognition. In Proceedings of ICSLP98, volume 3, pages 603-606, Sydney, Australia, 1998. [ bib | .ps | .pdf ]

Helen Wright. Automatic utterance type detection using suprasegmental features. In ICSLP'98, volume 4, page 1403, Sydney, Australia, 1998. [ bib | .ps | .pdf ]

Richard Sproat, Andrew Hunt, Mari Ostendorf, Paul Taylor, Alan Black, and Kevin Lenzo. Sable: a standard for TTS markup. In Third ESCA workshop on speech synthesis, pages 27-30, Jenolan Caves, Blue Mountains, Australia, 1998. [ bib | .ps | .pdf ]

Ann Syrdal, Gregor Moehler, Kurt Dusterhoff, Alistair Conkie, and Alan W Black. Three methods of intonation modeling. In 3rd ESCA Workshop on Speech Synthesis, pages 305-310, Jenolan Caves, 1998. [ bib | .ps | .pdf ]

Michael O'Donnell, Alistair Knott, Janet Hitzeman, and Hua Cheng. Integrating referring and informing in np planning. In Coling-ACL Workshop on the Computational Treatment of Nominals, Montreal, Quebec, Canada, 1998. [ bib | .ps | .pdf ]

Paul Taylor and Alan Black. Assigning phrase breaks from part of speech sequences. Computer Speech and Language, 12:99-117, 1998. [ bib | .ps | .pdf ]

Paul A Taylor. The Tilt intonation model. In ICSLP98, Sydney, 1998. [ bib | .ps | .pdf ]

J. Barker, G. Williams, and S. Renals. Acoustic confidence measures for segmenting broadcast news. In Proc. ICSLP, pages 2719-2722, Sydney, 1998. [ bib | .ps.gz | .pdf ]

In this paper we define an acoustic confidence measure based on the estimates of local posterior probabilities produced by a HMM/ANN large vocabulary continuous speech recognition system. We use this measure to segment continuous audio into regions where it is and is not appropriate to expend recognition effort. The segmentation is computationally inexpensive and provides reductions in both overall word error rate and decoding time. The technique is evaluated using material from the Broadcast News corpus.

Paul A Taylor, Alan Black, and Richard Caley. The architecture of the festival speech synthesis system. In The Third ESCA Workshop in Speech Synthesis, pages 147-151, Jenolan Caves, Australia, 1998. [ bib | .ps | .pdf ]

K. Dusterhoff. An investigation into the effectiveness of sub-syllable acoustics in automatic intonantion analysis. In Proceedings of University of Edinburgh Linguistics/Applied Linguistics Postgraduate Conference, 1998. [ bib | .ps | .pdf ]

D. Abberley, S. Renals, G. Cook, and T. Robinson. The 1997 THISL spoken document retrieval system. In Proc. Sixth Text Retrieval Conference (TREC-6), pages 747-752, 1998. [ bib | .ps.gz | .pdf ]

The THISL spoken document retrieval system is based on the Abbot Large Vocabulary Continuous Speech Recognition (LVCSR) system developed by Cambridge University, Sheffield University and SoftSound, and uses PRISE (NIST) for indexing and retrieval. We participated in full SDR mode. Our approach was to transcribe the spoken documents at the word level using Abbot, indexing the resulting text transcriptions using PRISE. The LVCSR system uses a recurrent network-based acoustic model (with no adaptation to different conditions) trained on the 50 hour Broadcast News training set, a 65,000 word vocabulary and a trigram language model derived from Broadcast News text. Words in queries which were out-of-vocabulary (OOV) were word spotted at query time (utilizing the posterior phone probabilities output by the acoustic model), added to the transcriptions of the relevant documents and the collection was then re-indexed. We generated pronunciations at run-time for OOV words using the Festival TTS system (University of Edinburgh).

M. Lincoln, S.J. Cox, and S. Ringland. A comparison of two unsupervised approaches to accent identification. In Int. Conf. on Spoken Language Processing, pages 109-112, Sydney, 1998. [ bib | .pdf ]

The ability to automatically identify a speaker's accent would be very useful for a speech recognition system as it would enable the system to use both a pronunciation dictionary and speech models speci c to the accent, techniques which have been shown to improve accuracy. Here, we describe some experiments in unsupervised accent classi cation. Two techniques have been investigated to classify British- and Americanaccented speech: an acoustic approach, in which we analyse the pattern of usage of the distributions in the recogniser by a speaker to decide on his most probable accent, and a high-level approach in which we use a phonotactic model for classi cation of the accent. Results show that both techniques give excellent performance on this task which is maintained when testing is done on data from an independent dataset.

Simon King. Using Information Above the Word Level for Automatic Speech Recognition. PhD thesis, University of Edinburgh, 1998. [ bib | .ps | .pdf ]

This thesis introduces a general method for using information at the utterance level and across utterances for automatic speech recognition. The method involves classification of utterances into types. Using constraints at the utterance level via this classification method allows information sources to be exploited which cannot necessarily be used directly for word recognition. The classification power of three sources of information is investigated: the language model in the speech recogniser, dialogue context and intonation. The method is applied to a challenging task: the recognition of spontaneous dialogue speech. The results show success in automatic utterance type classification, and subsequent word error rate reduction over a baseline system, when all three information sources are probabilistically combined.

G. Williams and S. Renals. Confidence measures derived from an acceptor HMM. In Proc. ICSLP, pages 831-834, Sydney, 1998. [ bib | .ps.gz | .pdf ]

In this paper we define a number of confidence measures derived from an acceptor HMM and evaluate their performance for the task of utterance verification using the North American Business News (NAB) and Broadcast News (BN) corpora. Results are presented for decodings made at both the word and phone level which show the relative profitability of rejection provided by the diverse set of confidence measures. The results indicate that language model dependent confidence measures have reduced performance on BN data relative to that for the more grammatically constrained NAB data. An explanation linking the observations that rejection is more profitable for noisy acoustics, for a reduced vocabulary and at the phone level is also given.

M. Wester, J.M. Kessens, and H. Strik. Modeling pronunciation variation for a Dutch CSR: testing three methods. In Proc. ICSLP '98, pages 2535-2538, Sydney, 1998. [ bib | .pdf ]

This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods to model pronunciation variation. First, within-word variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, cross-word pronunciation variation was modeled using two different approaches. The first approach was to model cross-word processes by adding the variants as separate words to the lexicon and in the second approach this was done by using multi-words. For each of the methods, recognition experiments were carried out. A significant improvement was found for modeling within-word variation. Furthermore, modeling crossword processes using multi-words leads to significantly better results than modeling them using separate words in the lexicon.

M. Wester, J.M. Kessens, and H. Strik. Improving the performance of a Dutch CSR by modeling pronunciation variation. In Proc. Workshop Modeling Pronunciation Variation for Automatic Speech Recognition, pages 145-150, Kerkrade, 1998. [ bib | .pdf ]

This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods in order to model pronunciation variation. First, withinword variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, cross-word pronunciation variation was accounted for by adding multi-words and their variants to the lexicon. Thirdly, probabilities of pronunciation variants were incorporated in the language model (LM), and thresholds were used to choose which pronunciation variants to add to the LMs. For each of the methods, recognition experiments were carried out. A significant improvement in error rates was measured.

Sue Fitt. Processing unfamiliar words - a study in the perception and production of native and foreign placenames. PhD thesis, The Centre for Speech Technology Research, Edinburgh University, 1998. [ bib | .ps | .pdf ]

This thesis sets out to examine some of the linguistic processes which take place when speakers are faced with unfamiliar and potentially foreign place names, and the possible psycholinguistic origins of these processes. It is concluded that lexical networks are used to map from input to output, and that phonological rule-based models do not fully account for the data. Previous studies of nativisation have tended to catalogue the phonological and spelling changes which have taken place in historical examples, and explanations have generally been limited to comparison of details of the borrowed and borrowing languages, rather than being set in a solid linguistic framework describing the ways in which speakers and readers process words. There have been psycholinguistic studies of unfamiliar words, but these have generally ignored the foreign dimension, and have been limited in scope. Traditional linguistic work, meanwhile, focuses on descriptions, either abstract or more related to mental processes, of the language that we know and use every day. Studies of foreign language learning also have a rather different focus from the current work, as they examine what happens when we attempt, over a period of time, to acquire new sounds, vocabulary and grammar. This study takes an experimental approach to nativisation, presenting Edinburgh secondary school pupils with a series of unfamiliar spoken and written European town names, and asking them to reproduce the names either in writing or speech, along with a judgement of origin. The resulting pronunciations and spellings are examined for accuracy, errors and changes, both in perception and production. Different explanations of the output are considered, and it is concluded that models which apply a set of linguistic rules to the input in order to generate an output cannot account for the variety of data produced. Lexicon-based models, on the other hand, using activation of known words or word-sets, and analogy with word-parts, are more able to explain both the details of individual responses and the variety of responses across subjects.

M. Wester, J.M. Kessens, and H. Strik. Two automatic approaches for analyzing the frequency of connected speech processes in Dutch. In Proc. ICSLP Student Day '98, pages 3351-3356, Sydney, 1998. [ bib | .pdf ]

This paper describes two automatic approaches used to study connected speech processes (CSPs) in Dutch. The first approach was from a linguistic point of view - the top-down method. This method can be used for verification of hypotheses about CSPs. The second approach - the bottom-up method - uses a constrained phone recognizer to generate phone transcriptions. An alignment was carried out between the two transcriptions and a reference transcription. A comparison between the two methods showed that 68% agreement was achieved on the CSPs. Although phone accuracy is only 63%, the bottom-up approach is useful for studying CSPs. From the data generated using the bottom-up method, indications of which CSPs are present in the material can be found. These indications can be used to generate hypotheses which can then be tested using the top-down method.

Hiroshi Shimodaira, Jun Rokui, and Mitsuru Nakai. Modified Minimum Classification Error Learning and Its Application to Neural Networks. In 2nd International Workshop on Statistical Techniques in Pattern Recognition (SPR'98), Sydney, Australia, 1998. [ bib | .pdf ]

A novel method to improve the generalization performance of the Minimum Classification Error (MCE) / Generalized Probabilistic Descent (GPD) learning is proposed. The MCE/GPD learning proposed by Juang and Katagiri in 1992 results in better recognition performance than the maximum-likelihood (ML) based learning in various areas of pattern recognition. Despite its superiority in recognition performance, as well as other learning algorithms, it still suffers from the problem of “over-fitting” to the training samples. In the present study, a regularization technique has been employed to the MCE learning to overcome this problem. Feed-forward neural networks are employed as a recognition platform to evaluate the recognition performance of the proposed method. Recognition experiments are conducted on several sorts of data sets.

Janet Hitzeman, Alan W. Black, Paul Taylor, Chris Mellish, and Jon Oberlander. On the use of automatically generated discourse-level information in a concept-to-speech synthesis system. In ICSLP98, volume 6, pages 2763-2768, Sydney, Australia, 1998. [ bib | .ps | .pdf ]

Briony Williams. Levels of annotation for a Welsh speech database for phonetic research. In Workshop on Language Resources for European Minority Languages, Granada, Spain, May 27 1998, Workshop on Language Resources for European Minority Languages, Granada, Spain, May 27 1998, 1998. [ bib | .ps | .pdf ]

Laurence Molloy and Stephen Isard. Suprasegmental duration modelling with elastic constraints in automatic speech recognition. In ICSLP, volume 7, pages 2975-2978, Sydney, Australia, 1998. [ bib | .ps | .pdf ]

V. Strom. Automatische Erkennung von Satzmodus, Akzentuierung und Phrasengrenzen. PhD thesis, University of Bonn, 1998. [ bib | .ps | .pdf ]

Vincent Pagel, Kevin Lenzo, and Alan W Black. Letter to sound rules for accented lexicon compression. In ICSLP98, volume 5, pages 2015-2020, 1998. [ bib | .ps | .pdf ]

Yoshinori Shiga, Hiroshi Matsuura, and Tsuneo Nitta. Segmental duration control based on an articulatory model. In Proc. ICSLP, volume 5, pages 2035-2038, 1998. [ bib | .ps | .pdf ]

This paper proposes a new method that determines segmental duration for text-to-speech conversion based on the movement of articulatory organs which compose an articulatory model. The articulatory model comprises four time-variable articulatory parameters representing the conditions of articulatory organs whose physical restriction seems to significantly influence the segmental duration. The parameters are controlled according to an input sequence of phonetic symbols, following which segmental duration is determined based on the variation of the articulatory parameters. The proposed method is evaluated through an experiment using a Japanese speech database that consists of 150 phonetically balanced sentences. The results indicate that the mean square error of predicted segmental duration is approximately 15[ms] for the closed set and 15-17[ms] for the open set. The error is within 20[ms], the level of acceptability for distortion of segmental duration without loss of naturalness, and hence the method is proved to effectively predict segmental duration.

M. Carreira-Perpiñán and S. Renals. Dimensionality reduction of electropalatographic data using latent variable models. Speech Communication, 26:259-282, 1998. [ bib | .ps.gz | .pdf ]

We consider the problem of obtaining a reduced dimension representation of electropalatographic (EPG) data. An unsupervised learning approach based on latent variable modelling is adopted, in which an underlying lower dimension representation is inferred directly from the data. Several latent variable models are investigated, including factor analysis and the generative topographic mapping (GTM). Experiments were carried out using a subset of the EUR-ACCOR database, and the results indicate that these automatic methods capture important, adaptive structure in the EPG data. Nonlinear latent variable modelling clearly outperforms the investigated linear models in terms of log-likelihood and reconstruction error and suggests a substantially smaller intrinsic dimensionality for the EPG data than that claimed by previous studies. A two-dimensional representation is produced with applications to speech therapy, language learning and articulatory dynamics.

Andreas Stolcke, E. Shriberg, R. Bates, P. Taylor, K. Ries, D. Jurafsky, N. Coccaro, R. Martin, M. Meteer, and C. Van Ess-Dykema. Dialog act modelling for conversational speech. In AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, 1998. [ bib | .ps | .pdf ]

G. Williams and S. Renals. Confidence measures for evaluating pronunciation models. In ESCA Workshop on Modeling pronunciation variation for automatic speech recognition, pages 151-155, Kerkrade, Netherlands, 1998. [ bib | .ps.gz | .pdf ]

In this paper, we investigate the use of confidence measures for the evaluation of pronunciation models and the employment of these evaluations in an automatic baseform learning process. The confidence measures and pronunciation models are obtained from the Abbot hybrid Hidden Markov Model/Artificial Neural Network Large Vocabulary Continuous Speech Recognition system. Experiments were carried out for a number of baseform learning schemes using the ARPA North American Business News and the Broadcast News corpora from which it was found that a confidence measure based scheme provided the largest reduction in Word Error Rate.

C. Mayo. A longitudinal study of perceptual weighting and phonemic awarenes. In Chicago Linguistics Society 34, 1998. [ bib ]

M. Wester. Automatic classification of voice quality: Comparing regression models and hidden Markov models. In Proc. VOICEDATA98, Symposium on Databases in Voice Quality Research and Education, pages 92-97, Utrecht, 1998. [ bib | .pdf ]

In this paper, two methods for automatically classifying voice quality are compared: regression analysis and hidden Markov models (HMMs). The findings of this research show that HMMs can be used to classify voice quality. The HMMs performed better than the regression models in classifying breathiness and overall degree of deviance, and the two methods showed similar results on the roughness scale. However, the results are not spectacular. This is mainly due to the type of material that was available and the number of listeners who assessed the material. Nonetheless, I argue in this paper that these findings are interesting because they are a promising step towards developing a system for classifying voice quality.

Alan W Black, Kevin Lenzo, and Vincent Pagel. Issues in building general letter to sound rules. In The Third ESCA Workshop in Speech Synthesis, pages 77-80, 1998. [ bib | .ps | .pdf ]

Elizabeth Shriberg, R. Bates, P. Taylor, A. Stolcke, K. Ries, D. Jurafsky, N. Coccaro, R. Martin, M. Meteer, and C. Van Ess-Dykema. Can prosody aid the automatic classification of dialog acts in conversational speech? Language and Speech, 41(3-4), 1998. [ bib | .ps | .pdf ]

Briony Williams. The phonetic manifestation of stress in Welsh. 1998. [ bib | .ps | .pdf ]

J.M. Kessens, M. Wester, C. Cucchiarini, and H. Strik. The selection of pronunciation variants: Comparing the performance of man and machine. In Proc. ICSLP '98, pages 2715-2718, Sydney, 1998. [ bib | .pdf ]

In this paper the performance of an automatic transcription tool is evaluated. The transcription tool is a Continuous Speech Recognizer (CSR) running in forced recognition mode. For evaluation the performance of the CSR was compared to that of nine expert listeners. Both man and the machine carried out exactly the same task: deciding whether a segment was present or not in 467 cases. It turned out that the performance of the CSR is comparable to that of the experts.

Richard Sproat, Andrew Hunt, Mari Ostendorf, Paul Taylor, Alan Black, and Kevin Lenzo. Sable: a standard for TTS markup. In ICSLP98, volume 5, pages 1719-1724, Sydney, Australia, 1998. [ bib | .ps | .pdf ]