Mitsuru Nakai, Harald Singer, Yoshimori Sagisaka, and Hiroshi Shimodaira. Accent Phrase Segmentation Based on F0 Templates Using a Superpositional Prosodic Model. Trans. IEICE (D-II), J80-D-II(10):2605-2614, October 1997. (in Japanese). [ bib ]

Sue Fitt. The generation of regional pronunciations of English for speech synthesis. In Proc. Eurospeech 1997, Rhodes, Greece, September 1997. [ bib | .ps | .pdf ]

Most speech synthesisers and recognisers for English currently use pronunciation lexicons in standard British or American accents, but as use of speech technology grows there will be more demand for the incorporation of regional accents. This paper describes the use of rules to transform existing lexicons of standard British and American pronunciations to a set of regional British and American accents. The paper briefly discusses some features of the regional accents in the project, and the framework used for generatiing pronunciations. Certain theoretical and practical problems are highlighted; for some of these, solutions are suggested, but it is shown that some difficulties cannot be resolved by automatic rules. However, althought the method described cannot produce phonetic transcriptions with 100% accuracy, it is more accurate than using letter-to-sound rules, and faster than producing transcriptions by hand.

Simon King, Thomas Portele, and Florian Höfer. Speech synthesis using non-uniform units in the Verbmobil project. In Proc. Eurospeech 97, volume 2, pages 569-572, Rhodes, Greece, September 1997. [ bib | .ps | .pdf ]

We describe a concatenative speech synthesiser for British English which uses the HADIFIX inventory structure originally developed for German by Portele. An inventory of non-uniform units was investigated with the aim of improving segmental quality compared to diphones. A combination of soft (diphone) and hard concatenation was used, which allowed a dramatic reduction in inventory size. We also present a unit selection algorithm which selects an optimum sequence of units from this inventory for a given phoneme sequence. The work described is part of the concept-to-speech synthesiser for the language and speech project Verbmobil which is funded by the German Ministry of Science (BMBF).

Hiroshi Shimodaira, Mitsuru Nakai, and Akihiro Kumata. Restration of Pitch Pattern of Speech Based on a Pitch Gereration Model. In Proc. EuroSpeech'97, pages 512-524, September 1997. [ bib | .pdf ]

In this paper a model-based approach for restoring a continuous fundamental frequency (F0) contour from the noisy output of an F0 extractor is investigated. In contrast to the conventional pitch trackers based on numerical curve-fitting, the proposed method employs a quantitative pitch generation model, which is often used for synthesizing F0 contour from prosodic event commands for estimating continuous F0 pattern. An inverse filtering technique is introduced for obtaining the initial candidates of the prosodic commands. In order to find the optimal command sequence from the commands efficiently, a beam-search algorithm and an N-best technique are employed. Preliminary experiments for a male speaker of the ATR B-set database showed promising results both in quality of the restored pattern and estimation of the prosodic events.

Mitsuru Nakai and Hiroshi Shimodaira. On Representation of Fundamental Frequency of Speech for Prosody Analysis Using Reliability Function. In Proc. EuroSpeech'97, pages 243-246, September 1997. [ bib | .pdf ]

K. Richmond. A proposal for the compartmental modelling of stellate cells in the anteroventral cochlear nucleus, using realistic auditory nerve inputs. Master's thesis, Centre for Cognitive Science, University of Edinburgh, September 1997. [ bib ]

K. Richmond, A. Smith, and E. Amitay. Detecting subject boundaries within text: A language-independent statistical approach. In Proc. The Second Conference on Empirical Methods in Natural Language Processing, pages 47-54, Brown University, Providence, USA, August 1997. [ bib | .ps | .pdf ]

We describe here an algorithm for detecting subject boundaries within text based on a statistical lexical similarity measure. Hearst has already tackled this problem with good results (Hearst, 1994). One of her main assumptions is that a change in subject is accompanied by a change in vocabulary. Using this assumption, but by introducing a new measure of word significance, we have been able to build a robust and reliable algorithm which exhibits improved accuracy without sacrificing language independency.

Kanad Keeni, Hiroshi Shimodaira, and Kenji Nakayama. On Distributed Representation of Output Layer for Recognizing Japanese Kana Characters Using Neural Networks. In Proceedings of the 4'th International Conference on Document Analysis and Recognition, ICDAR'97, pages 600-603, July 1997. Ulm, Germany. [ bib ]

Tu Bao Ho, Nguyen Trong Dung, Hiroshi Shimodaira, and Masayuki Kimura. An Interactive-Graphic Environment for Discovering and Using Conceptual Knowledge. In 7th European-Japanese Conference on Information Modelling and Knowledge Bases, pages 327-343, May 1997. [ bib ]

Kanad Keeni and Hiroshi Shimodaira. On Representation of Output Layer for Recognizing Japanese Kana Characters Using Neural Networks. In Proc. the `17'th International Conference on Computer Processing of Oriental Languages, pages 305-308, April 1997. Baptist University, Kowloon Tong, Hong Kong. [ bib ]

Briony J. Williams and Stephen Isard. A keyvowel approach to the synthesis of regional accents of English. In Eurospeech 97, Rhodes, Greece, 1997. [ bib | .ps | .pdf ]

Robert A. J. Clark. Language acquisition and implication for language change: A computational model. In Proceedings of the GALA 97 Conference on Language Acquisition, pages 322-326, 1997. [ bib | .ps | .pdf ]

J.M. Kessens and M. Wester. Improving recognition performance by modelling pronunciation variation. In Proc. CLS opening Academic Year '97 '98, pages 1-20, Nijmegen, 1997. [ bib | .pdf ]

This paper describes a method for improving the performance of a continuous speech recognizer by modelling pronunciation variation. Although the improvements obtained with this method are small, they are in line with those reported by other authors. A series of experiments was carried out to model pronunciation variation. In the first set of experiments word internal pronunciation variation was modelled by applying a set of four phonological rules to the words in the lexicon. In the second set of experiments, variation across word boundaries was also modelled. The results obtained with both methods are presented in detail. Furthermore, statistics are given on the application of the four phonological rules on the training database. We will explain why the improvements obtained with this method are small and how we intend to increase the improvements in our future research.

Janet Hitzeman. Semantic partition and the ambiguity of temporal adverbials. Journal of Natural Language Semantics, 5:87-100, 1997. [ bib | .ps | .pdf ]

J. Hennebert, C. Ris, H. Bourlard, S. Renals, and N. Morgan. Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems. In Proc. Eurospeech, pages 1951-1954, Rhodes, 1997. [ bib | .ps.gz | .pdf ]

The results of our research presented in this paper are two-fold. First, an estimation of global posteriors[5 5 is formalized in the framework of hybrid HMM/ANN systems. It is shown that hybrid HMM/ANN systems, in which the ANN part estimates local posteriors can be used to model global posteriors. This formalization provides us with a clear theory in which both REMAP and “classical” Viterbi trained hybrid systems are unified. Second, a new forward-backward training of hybrid HMM/ANN systems is derived from the previous formulation. Comparisons of performance between Viterbi and forward-backward hybrid systems are presented and discussed.

M. Huckvale, C. Benoit, C. Bowerman, A. Eriksson, M. Rosner, M. Tatham, and Briony J. Williams. Opportunities for computer-aided instruction in phonetics and speech communication provided by the internet. In Eurospeech 97, Rhodes, Greece, 1997. [ bib | .ps | .pdf ]

Jean Carletta, Amy Isard, Stephen Isard, Jacqueline C. Kowtko, Gwyneth Doherty-Sneddon, and Anne H. Anderson. The reliability of a dialogue structure coding scheme. Computational Linguistics, 23(1):13-31, 1997. [ bib | .ps | .pdf ]

M. Wester, J.M. Kessens, C. Cucchiarini, and H. Strik. Modelling pronunciation variation: some preliminary results. In Proc. Dept. of Language & Speech, pages 127-137, Nijmegen, 1997. [ bib | .pdf ]

In this paper we describe a method for improving the performance of a continuous speech recognizer by modelling pronunciation variation. Although the results obtained with this method are in line with those reported by other authors, the magnitude of the improvements is very small. In looking for possible explanations for these results, we computed various sorts of statistics about the material. Since these data proved to be very useful in understanding the effects of our method, they are discussed in this paper. Moreover, on the basis of these statistics we discuss how the system can be improved in the future.

Janet Hitzeman, Chris Mellish, and Jon Oberlander. Generation of museum web pages: The intelligent labelling explorer. Archives and Museum Informatics, 11:107-115, 1997. [ bib | .ps | .pdf ]

Alan W. Black and Paul A. Taylor. Assigning phrase breaks from part-of-speech sequences. In Eurospeech97, volume 2, pages 995-998, Rhodes, Greece, 1997. [ bib | .ps | .pdf ]

G. Williams and S. Renals. Confidence measures for hybrid HMM/ANN speech recognition. In Proc. Eurospeech, pages 1955-1958, Rhodes, 1997. [ bib | .ps.gz | .pdf ]

In this paper we introduce four acoustic confidence measures which are derived from the output of a hybrid HMM/ANN large vocabulary continuous speech recognition system. These confidence measures, based on local posterior probability estimates computed by an ANN, are evaluated at both phone and word levels, using the North American Business News corpus.

Simon King. Final report for Verbmobil Teilprojekt 4.4. Technical Report ISSN 1434-8845, IKP, Universitaet Bonn, January 1997. Verbmobil-Report 195 available at http://verbmobil.dfki.de. [ bib ]

Final report for Verbmobil English speech synthesis

M. Lincoln, S.J. Cox, and S. Ringland. A fast method of speaker normalisation using formant estimation. In 5th European Conference on Speech Communication and Technology, pages 2095-2098, Rhodes, 1997. [ bib | .pdf ]

It has recently been shown that normalisation of vocal tract length can significantly increase recognition accuracy in speaker independent automatic speech recognition systems. An inherent difficulty with this technique is in automatically estimating the normalisation parameter from a new speaker's speech and previous techniques have typically relied on an exhaustive search to estimate this parameter. In this paper, we present a method of normalising utterances by a linear warping of the mel filter bank channels in which in which the normalisation parameter is estimated by fitting formant estimates to a probabilistic model. This method is fast, computitionally inexpensive and requires only a limited amount of data for estimation. It generates normalisations which are close to those which would be found by an exhaustive search. The normalisation is applied to a phoneme recognition task using the TIMIT database and results show a useful improvement over an un-normalised speaker independent system.

R. Sproat, Paul A. Taylor, M. Tanenblatt, and Amy Isard. A markup language for text-to-speech synthesis. In Eurospeech 97, 1997. [ bib | .ps | .pdf ]

Y. Gotoh and S. Renals. Document space models using latent semantic analysis. In Proc. Eurospeech, pages 1443-1446, Rhodes, 1997. [ bib | .ps.gz | .pdf ]

In this paper, an approach for constructing mixture language models (LMs) based on some notion of semantics is discussed. To this end, a technique known as latent semantic analysis (LSA) is used. The approach encapsulates corpus-derived semantic information and is able to model the varying style of the text. Using such information, the corpus texts are clustered in an unsupervised manner and mixture LMs are automatically created. This work builds on previous work in the field of information retrieval which was recently applied by Bellegarda et. al. to the problem of clustering words by semantic categories. The principal contribution of this work is to characterize the document space resulting from the LSA modeling and to demonstrate the approach for mixture LM application. Comparison is made between manual and automatic clustering in order to elucidate how the semantic information is expressed in the space. It is shown that, using semantic information, mixture LMs performs better than a conventional single LM with slight increase of computational cost.

Mitsuru Nakai, Harald Singer, Yoshinori Sagisaka, and Hiroshi Shimodaira. Accent Phrase Segmentation by F0 Clustering Using Superpositional Modeling, pages 343-360. Springer, January 1997. [ bib ]

Alan W. Black and Paul A. Taylor. Automatically clustering similar units for unit selection in speech synthesis. In Eurospeech97, volume 2, pages 601-604, Rhodes, Greece, 1997. [ bib | .ps | .pdf ]

C. Mayo, M. Aylett, and D. R. Ladd. Prosodic transcription of glasgow english: an evaluation study of GlaToBI. In Intonation: Theory, Models and Applications, 1997. [ bib | .pdf ]

B. L. Karlsen, G. J. Brown, M. Cooke, P. Green, and S. Renals. Analysis of a simultaneous speaker sound corpus. In D. F. Rosenthal and H. G. Okuno, editors, Computational Auditory Scene Analysis, pages 321-334. Lawrence Erlbaum Associates, 1997. [ bib ]

Sukeyasu Kanno and Hiroshi Shimodaira. Voiced Sound Detection under Nonstationary and Heavy Noisy Environment Using the Prediction Error of Low-Frequency Spectrum. Trans. IEICE(D-II), J80-D-II(1):26-35, January 1997. (in Japanese). [ bib ]

Alan W. Black and Paul A. Taylor. The Festival Speech Synthesis System: System documentation. Technical Report HCRC/TR-83, Human Communciation Research Centre, University of Edinburgh, Scotland, UK, 1997. Avaliable at http://www.cstr.ed.ac.uk/projects/festival.html. [ bib ]

B Williams. Computer-Aided Learning and Use of the Internet: Speech Sciences Education (section of chapter). 1997. [ bib ]

Beth Ann Hockey, Deborah Rossen-Knill, Beverly Spejewski, Matthew Stone, and Stephen Isard. Can you predict responses to yes/no questions? yes, no, and stuff. In Eurospeech '97, pages 2267-2270, 1997. [ bib ]

Jacqueline Kowtko. The function of intonation in spontaneous and read dialogue. In Proceedings of the XIIIth International Congress of Phonetic Sciences, volume 2, pages 286-289, Stockholm, Sweden, 1997. [ bib ]

Helen Wright and Paul A. Taylor. Modelling intonational structure using hidden markov models. In ESCA workshop on Intonation: Theory Models and Applications, Athens, Greece, 1997. [ bib | .ps | .pdf ]

Kurt Dusterhoff and Alan W. Black. Generating f0 contours for speech synthesis using the tilt intonation theory. In Proc. ESCA Workshop on Intonation, pages 107-110, Athens, Greece., 1997. [ bib | .ps | .pdf ]

V. Strom, A. Elsner, G. Görz, W. Hess, W. Kasper, A. Klein, H.U. Krieger, J. Spilker, and H. Weber. On the use of prosody in a speech-to-speech translator. In Proc. European Conf. on Speech Communication and Technology, Rhodes, 1997. [ bib | .ps | .pdf ]

In this paper a speech-to-speech translator from German to English is presented. Beside the traditional processing steps it takes advantage of acoustically detected prosodic phrase boundaries and focus. The prosodic phrase boundaries reduce search space during syntactic parsing and rule out analysis trees during semantic parsing. The prosodic focus faciliates a “shallow” translation based on the best word chain in cases where the deep analysis fails.

Dan Jurafsky, A. Stolcke, E. Shriberg, R. Bates, P. Taylor, K. Ries, N. Coccaro, R. Martin, M. Meteer, and C. Van Ess-Dykema. Automatic detection of discourse structure for speech recognition and understanding. In 1997 IEEEWorkshop on Speech Recognition and Understanding,, Santa Barbara, 1997. [ bib | .ps | .pdf ]

Alan W. Black. Predicting the intonation of discourse segments from examples in dialogue speech. In Y. Sagisaka, N. Campbell, and N. Higuchi, editors, Computing Prosody, pages 117-128. Springer-Verlag, 1997. [ bib ]

J.M. Kessens, M. Wester, C. Cucchiarini, and H. Strik. Testing a method for modelling pronunciation variation. In Proceedings of the COST workshop, pages 37-40, Rhodos, 1997. [ bib | .pdf ]

In this paper we describe a method for improving the performance of a continuous speech recognizer by modelling pronunciation variation. Although the results obtained with this method are in line with those reported by other authors, the magnitude of the improvements is very small. In looking for possible explanations for these results, we computed various sorts of statistics about the material. Since these data proved to be very useful in understanding the effects of our method, they are discussed in this paper. Moreover, on the basis of these statistics we discuss how the system can be improved in the future.

Paul A. Taylor, Simon King, Stephen Isard, Helen Wright, and Jacqueline Kowtko. Using intonation to constrain language models in speech recognition. In Proc. Eurospeech'97, Rhodes, 1997. [ bib | .pdf ]

This paper describes a method for using intonation to reduce word error rate in a speech recognition system designed to recognise spontaneous dialogue speech. We use a form of dialogue analysis based on the theory of conversational games. Different move types under this analysis conform to different language models. Different move types are also characterised by different intonational tunes. Our overall recognition strategy is first to predict from intonation the type of game move that a test utterance represents, and then to use a bigram language model for that type of move during recognition. point in a game.

Paul A. Taylor and Amy Isard. SSML: A speech synthesis markup language. Speech Communication, (21):123-133, 1997. [ bib | .ps | .pdf ]

B Williams. Spoken Language Corpus Representation (section of chapter). longmans, 1997. [ bib ]