Simon King. Users Manual for Verbmobil Teilprojekt 4.4. IKP, Universitaet Bonn, October 1996. [ bib ]
Verbmobil English synthesiser users manual
Simon King. Inventory design for Verbmobil Teilprojekt 4.4. Technical report, IKP, Universität Bonn, October 1996. [ bib ]
Inventory design for Verbmobil English speech synthesis synthesis
Kanad Keeni, Hiroshi Shimodaira, Tetsuro Nishino, and Yasuo Tan. Recognition of Devanagari Characters Using Neural Networks. IEICE, E79-D(5):523-528, May 1996. [ bib ]
Andrew Hunt and Alan W. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP-96, volume 1, pages 373-376, Atlanta, Georgia, 1996. [ bib | .ps | .pdf ]
G Knowles, L. Taylor, and B. Williams. A corpus of formal British English speech. 1996. [ bib ]
K. Dusterhoff. Intone: A prototype intonation analysis system. Master's thesis, Georgetown University, 1996. [ bib ]
S. Renals. Phone deactivation pruning in large vocabulary continuous speech recognition. IEEE Signal Processing Letters, 3:4-6, 1996. [ bib | .ps.gz ]
In this letter we introduce a new pruning strategy for large vocabulary continuous speech recognition based on direct estimates of local posterior phone probabilities. This approach is well suited to hybrid connectionist/hidden Markov model systems. Experiments on the Wall Street Journal task using a 20,000 word vocabulary and a trigram language model have demonstrated that phone deactivation pruning can increase the speed of recognition-time search by up to a factor of 10, with a relative increase in error rate of less than 2%.
Briony J. Williams. The status of corpora as linguistic data. In A. Wichmann & P. Alderson G. Knowles, editor, Working with Speech. London: Longmans, 1996. [ bib ]
B. Pickering, Briony J. Williams, and G. Knowles. Analysis of transcriber differences in the sec. In Working with Speech. 1996. [ bib ]
Paul A. Taylor, Hiroshi Shimodaira, Stephen Isard, Simon King, and Jacqueline Kowtko. Using prosodic information to constrain language models for spoken dialogue. In Proc. ICSLP `96, Philadelphia, 1996. [ bib | .ps | .pdf ]
We present work intended to improve speech recognition performance for computer dialogue by taking into account the way that dialogue context and intonational tune interact to limit the possibilities for what an utterance might be. We report here on the extra constraint achieved in a bigram language model expressed in terms of entropy by using separate submodels for different sorts of dialogue acts and trying to predict which submodel to apply by analysis of the intonation of the sentence being recognised.
N. Campbell and Alan W. Black. CHATR: a multi-lingual speech re-sequencing synthesis system. In Institute of Electronic, Information and Communication Engineers, Tokyo, 1996. [ bib ]
John McKenna. Tone and initial/final recognition for mandarin chinese. Master's thesis, University of Edinburgh, 1996. [ bib | .ps | .pdf ]
Robert A.J. Clark. Internal and external factors affecting language change: A computational model. Master's thesis, University of Edinburgh, 1996. [ bib | .ps | .pdf ]
Sue Fitt. Spelling unfamiliar names. In Proc. International Congress of Onomastic Sciences 1996, 1996. [ bib | .ps | .pdf ]
This paper will examine the written transcription of unfamiliar spoken names. It is well documented that the writing of personal and place names by people who are unfamiliar with the spelling of the name contributes to the evolution of names. The current paper describes a study which examines the processes involved, using experiments in which Scottish subjects are asked to write down unfamiliar spoken British and European town names.
D. Kershaw, T. Robinson, and S. Renals. The 1995 Abbot LVCSR system for multiple unknown microphones. In Proc. ICSLP, pages 1325-1328, Philadelphia PA, 1996. [ bib ]
Briony J. Williams and P. Alderson. Synthesising British English intonation. In Working with Speech. 1996. [ bib ]
Alan W. Black and Andrew Hunt. Generating f0 contours from ToBI labels using linear regression. In ICSLP96, volume 3, pages 1385-1388, Philadelphia, PA., 1996. [ bib ]
Briony J. Williams. The formulation of an intonation transcription system for British English. In A. Wichmann & P. Alderson G. Knowles, editor, Working with Speech. London: Longmans, 1996. [ bib ]
T. Robinson, M. Hochberg, and S. Renals. The use of recurrent networks in continuous speech recognition. In C.-H. Lee, K. K. Paliwal, and F. K. Soong, editors, Automatic Speech and Speaker Recognition - Advanced Topics, pages 233-258. Kluwer Academic Publishers, 1996. [ bib | .ps.gz ]
This chapter describes a use of recurrent neural networks (ie, feedback is incorporated in the computation) as an acoustic model for continuous speech recognition. The form of the recurrent neural network is described, along with an appropriate parameter estimation procedure. For each frame of acoustic data, the recurrent network generates an estimate of the posterior probability of the possible phones given the observed acoustic signal. The posteriors are then converted into scaled likelihoods and used as the observation probabilities within a conventional decoding paradigm (eg, Viterbi decoding). The advantages of the using recurrent networks are that they require a small number of parameters and provide a fast decoding capability (relative to conventional large vocabulary HMM systems).
S. Renals and M. Hochberg. Efficient evaluation of the LVCSR search space using the NOWAY decoder. In Proc IEEE ICASSP, pages 149-152, Atlanta, 1996. [ bib | .ps.gz ]
This work further develops and analyses the large vocabulary continuous speech recognition search strategy reported at ICASSP-95. In particular, the posterior-based phone deactivation pruning approach has been extended to include phone-dependent thresholds and an improved estimate of the least upper bound on the utterance log-probability has been developed. Analysis of the pruning procedures and of the search's interaction with the language model has also been performed. Experiments were carried out using the ARPA North American Business News task with a 20,000 word vocabulary and a trigram language model. As a result of these improvements and analyses, the computational cost of the recognition process performed by the Noway decoder has been substantially reduced.
D. Kershaw, T. Robinson, and S. Renals. The 1995 Abbot hybrid connectionist-HMM large vocabulary recognition system. In Proc. ARPA Spoken Language Technology Conference, pages 93-99, 1996. [ bib ]
Jacqueline Kowtko. The Function of Intonation in Task-Oriented Dialogue. PhD thesis, 1996. [ bib | .ps | .pdf ]
N. Campbell and Alan W. Black. Prosody and the selection of source units for concatenative synthesis. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, editors, Progress in Speech Synthesis, pages 279-282. Springer Verlag, 1996. [ bib ]
K. Dusterhoff. Using computational analysis to determine pitch accent. In Proceedings Computational Linguistics in Montreal, pages 1-4, 1996. [ bib ]
A. Conkie and Stephen D. Isard. Optimal coupling of diphones. In J. P. H. Santen, R. W. Sproat, J. P. Olive, and Hirschberg, editors, Progress in Speech Synthesis. Springer, 1996. [ bib ]
V. Strom and C. Widera. What's in the “pure” prosody? In Proc. ICSLP, Philadelphia, 1996. [ bib | .ps | .pdf ]
Detectors for accents and phrase boundaries have been developed which derive prosodic features from the speech signal and its fundamental frequency to support other modules of a speech understanding system in an early analysis stage, or in cases where no word hypotheses are available. The detectors' underlying Gaussian distribution classifiers were trained with 50 minutes and tested with 30 minutes of spontaneous speech, yielding recognition rates of 74% for accents and 86% for phrase boundaries. Since this material was prosodically hand labelled, the question was, which labels for phrase boundaries and accentuation were only guided by syntactic or semantic knowledge, and which ones are really prosodically marked. Therefore a small test subset has been resynthesized in such a way that comprehensibility was lost, but the prosodic characteristics were kept. This subset has been re-labelled by 11 listeners with nearly the same accuracy as the detectors.