Alexander Gutkin. Log-Linear Interpolation of Language Models. MPhil. thesis, Department of Engineering, University of Cambridge, UK, December 2000. [ bib | .ps.gz | .pdf ]
Shigeki Matsuda, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Feature-dependent Allophone Clustering. In Proc. ICSLP2000, pages 413-416, October 2000. [ bib | .pdf ]
We propose a novel method for clustering allophones called Feature-Dependent Allophone Clustering (FD-AC) that determines feature-dependent HMM topology automatically. Existing methods for allophone clustering are based on parameter sharing between the allophone models that resemble each other in behaviors of feature vector sequences. However, all the features of the vector sequences may not necessarily have a common allophone clustering structures. It is considered that the vector sequences can be better modeled by allocating the optimal allophone clustering structure to each feature. In this paper, we propose Feature-Dependent Successive State Splitting (FD-SSS) as an implementation of FD-AC. In speaker-dependent continuous phoneme recognition experiments, HMMs created by FD-SSS reduced the error rates by about 10 compared with the conventional HMMs that have a common allophone clustering structure for all the features.
Hiroshi Shimodaira, Toshihiko Akae, Mitsuru Nakai, and Shigeki Sagayama. Jacobian Adaptation of HMM with Initial Model Selection for Noisy Speech Recognition. In Proc. ICSLP2000, pages 1003-1006, October 2000. [ bib | .pdf ]
An extension of Jacobian Adaptation (JA) of HMMs for degraded speech recognition is presented in which appropriate set of initial models is selected from a number of initial-model sets designed for different noise environments. Based on the first order Taylor series approximation in the acoustic feature domain, JA adapts the acoustic model parameters trained in the initial noise environment A to the new environment B much faster than PMC that creates the acoustic models for the target environment from scratch. Despite the advantage of JA to PMC, JA has a theoretical limitation that the change of acoustic parameters from the environment A to B should be small in order that the linear approximation holds. To extend the coverage of JA, the ideas of multiple sets of initial models and their automatic selection scheme are discussed. Speaker-dependent isolated-word recognition experiments are carried out to evaluate the proposed method.
Shigeki Matsuda, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Asynchronous-Transition HMM. In Proc. ICASSP 2000 (Istanbul, Turkey), Vol. II, pages 1001-1004, June 2000. [ bib | .pdf ]
We propose a new class of hidden Markov model (HMM) called asynchronous-transition HMM (AT-HMM). Opposed to conventional HMMs where hidden state transition occurs simultaneously to all features, the new class of HMM allows state transitions asynchronous between individual features to better model asynchronous timings of acoustic feature changes. In this paper, we focus on a particular class of AT-HMM with sequential constraints introducing a concept of “state tying across time”. To maximize the advantage of the new model, we also introduce feature-wise state tying technique. Speaker-dependent speech recognition experiments demonstrated that reduced error rates more than 30% and 50% in phoneme and isolated word recognition, respectively, compared with conventional HMMs.
Y. Gotoh and S. Renals. Information extraction from broadcast news. Philosophical Transactions of the Royal Society of London, Series A, 358:1295-1310, 2000. [ bib | .ps.gz | .pdf ]
This paper discusses the development of trainable statistical models for extracting content from television and radio news broadcasts. In particular we concentrate on statistical finite state models for identifying proper names and other named entities in broadcast speech. Two models are presented: the first models name class information as a word attribute; the second explicitly models both word-word and class-class transitions. A common n-gram based formulation is used for both models. The task of named entity identification is characterized by relatively sparse training data and issues related to smoothing are discussed. Experiments are reported using the DARPA/NIST Hub-4E evaluation for North American Broadcast News.
J.M. Kessens, M. Wester, and H. Strik. Automatic detection and verification of Dutch phonological rules. In PHONUS 5: Proceedings of the "Workshop on Phonetics and Phonology in ASR", pages 117-128, Saarbruecken, 2000. [ bib | .pdf ]
In this paper, we propose two methods for automatically obtaining hypotheses about pronunciation variation. To this end, we used two different approaches in which we employed a continuous speech recognizer to derive this information from the speech signal. For the first method, the output of a phone recognition was compared to a reference transcription in order obtain hypotheses about pronunciation variation. Since phone recognition contains errors, we used forced recognition in order to exclude unreliable hypotheses. For the second method, forced recognition was also used, but the hypotheses about the deletion of phones were not constrained beforehand. This was achieved by allowing each phone to be deleted. After forced recognition, we selected the most frequently applied rules as the set of deletion rules. Since previous research showed that forced recognition is a reliable tool for testing hypotheses about pronunciation variation, we can expect that this will also hold for the hypotheses about pronunciation variation which we found using each of the two methods. Another reason for expecting the rule hypotheses to be reliable is that we found that 37-53% of the rules are related to Dutch phonological processes that have been described in the literature.
J.A. Bangham, S.J. Cox, M. Lincoln, I. Marshall, M. Tutt, and M Wells. Signing for the deaf using virtual humans. In IEE Colloquium on Speech and Language processing for Disabled and Elderly, 2000. [ bib | .pdf ]
Research at Televirtual (Norwich) and the University of East Anglia, funded predominantly by the Independent Television Commission and more recently by the UK Post Office also, has investigated the feasibility of using virtual signing as a communication medium for presenting information to the Deaf. We describe and demonstrate the underlying virtual signer technology, and discuss the language processing techniques and discourse models which have been investigated for information communication in a transaction application in Post Offices, and for presentation of more general textual material in texts such as subtitles accompanying television programmes.
Andreas Stolcke, N. Coccaro, R. Bates, P. Taylor, C. Van Ess-Dykema, K. Ries, Elizabeth Shriberg, D. Jurafsky, R.Martin, and M. Meteer. Dialog act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3), 2000. [ bib | .ps | .pdf ]
Ann K. Syrdal, Colin W. Wightman, Alistair Conkie, Yannis Stylianou, Mark Beutnagel, Juergen Schroeter, Volker Strom, and Ki-Seung Lee. Corpus-based techniques in the at&t nextgen synthesis system. In Proc. Int. Conf. on Spoken Language Processing, Beijing, 2000. [ bib | .ps | .pdf ]
The AT&T text-to-speech (TTS) synthesis system has been used as a framework for experimenting with a perceptually-guided data-driven approach to speech synthesis, with a primary focus on data-driven elements in the "back end". Statistical training techniques applied to a large corpus are used to make decisions about predicted speech events and selected speech inventory units. Our recent advances in automatic phonetic and prosodic labelling and a new faster harmonic plus noise model (HMM) and unit preselection implementations have significantly improved TTS quality and speeded up both development time and runtime.
S. Renals, D. Abberley, D. Kirby, and T. Robinson. Indexing and retrieval of broadcast news. Speech Communication, 32:5-20, 2000. [ bib | .ps.gz | .pdf ]
This paper describes a spoken document retrieval (SDR) system for British and North American Broadcast News. The system is based on a connectionist large vocabulary speech recognizer and a probabilistic information retrieval system. We discuss the development of a realtime Broadcast News speech recognizer, and its integration into an SDR system. Two advances were made for this task: automatic segmentation and statistical query expansion using a secondary corpus. Precision and recall results using the Text Retrieval Conference (TREC) SDR evaluation infrastructure are reported throughout the paper, and we discuss the application of these developments to a large scale SDR task based on an archive of British English broadcast news.
M. Carreira-Perpiñán and S. Renals. Practical identifiability of finite mixtures of multivariate Bernoulli distributions. Neural Computation, 12:141-152, 2000. [ bib | .ps.gz | .pdf ]
The class of finite mixtures of multivariate Bernoulli distributions is known to be nonidentifiable, i.e., different values of the mixture parameters can correspond to exactly the same probability distribution. In principle, this would mean that sample estimates using this model would give rise to different interpretations. We give empirical support to the fact that estimation of this class of mixtures can still produce meaningful results in practice, thus lessening the importance of the identifiability problem. We also show that the EM algorithm is guaranteed to converge to a proper maximum likelihood estimate, owing to a property of the log-likelihood surface. Experiments with synthetic data sets show that an original generating distribution can be estimated from a sample. Experiments with an electropalatography (EPG) data set show important structure in the data.
Paul Taylor. Analysis and synthesis of intonation using the tilt model. Journal of the Acoustical Society of America, 107(3):1697-1714, 2000. [ bib | .ps | .pdf ]
Kurt Dusterhoff. Synthesizing Fundamental Frequency Using Models Automatically Trained from Data. PhD thesis, University of Edinburgh, 2000. [ bib | .ps | .pdf ]
M. Wester, J.M. Kessens, and H. Strik. Pronunciation variation in ASR: Which variation to model? In Proc. of ICSLP '00, volume IV, pages 488-491, Beijing, 2000. [ bib | .pdf ]
This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling within-word and cross-word pronunciation variation. A relative improvement of 8.8% in WER was found compared to baseline system performance. However, as WERs do not reveal the full effect of modeling pronunciation variation, we performed a detailed analysis of the differences in recognition results that occur due to modeling pronunciation variation and found that indeed a lot of the differences in recognition results are not reflected in the error rates. Furthermore, error analysis revealed that testing sets of variants in isolation does not predict their behavior in combination. However, these results appeared to be corpus dependent.
Helen Wright. Modelling Prosodic and Dialogue Information for Automatic Speech Recognition. PhD thesis, University of Edinburgh, 2000. [ bib | .ps | .pdf ]
A. Wrench and K. Richmond. Continuous speech recognition using articulatory data. In Proc. ICSLP 2000, Beijing, China, 2000. [ bib | .ps | .pdf ]
In this paper we show that there is measurable information in the articulatory system which can help to disambiguate the acoustic signal. We measure directly the movement of the lips, tongue, jaw, velum and larynx and parameterise this articulatory feature space using principle components analysis. The parameterisation is developed and evaluated using a speaker dependent phone recognition task on a specially recorded TIMIT corpus of 460 sentences. The results show that there is useful supplementary information contained in the articulatory data which yields a small but significant improvement in phone recognition accuracy of 2%. However, preliminary attempts to estimate the articulatory data from the acoustic signal and use this to supplement the acoustic input have not yielded any significant improvement in phone accuracy.
M. Wester and E. Fosler-Lussier. A comparison of data-derived and knowledge-based modeling of pronunciation variation. In Proc. of ICSLP '00, volume I, pages 270-273, Beijing, 2000. [ bib | .pdf ]
This paper focuses on modeling pronunciation variation in two different ways: data-derived and knowledge-based. The knowledge-based approach consists of using phonological rules to generate variants. The data-derived approach consists of performing phone recognition, followed by various pruning and smoothing methods to alleviate some of the errors in the phone recognition. Using phonological rules led to a small improvement in WER; whereas, using a data-derived approach in which the phone recognition was smoothed using simple decision trees (d-trees) prior to lexicon generation led to a significant improvement compared to the baseline. Furthermore, we found that 10% of variants generated by the phonological rules were also found using phone recognition, and this increased to 23% when the phone recognition output was smoothed by using d-trees. In addition, we propose a metric to measure confusability in the lexicon and we found that employing this confusion metric to prune variants results in roughly the same improvement as using the d-tree method.
K. Koumpis and S. Renals. Transcription and summarization of voicemail speech. In Proc. ICSLP, volume 2, pages 688-691, Beijing, 2000. [ bib | .ps.gz | .pdf ]
This paper describes the development of a system to transcribe and summarize voicemail messages. The results of the research presented in this paper are two-fold. First, a hybrid connectionist approach to the Voicemail transcription task shows that competitive performance can be achieved using a context-independent system with fewer parameters than those based on mixtures of Gaussian likelihoods. Second, an effective and robust combination of statistical with prior knowledge sources for term weighting is used to extract information from the decoders output in order to deliver summaries to the message recipients via a GSM Short Message Service (SMS) gateway.
Y. Gotoh and S. Renals. Variable word rate n-grams. In Proc IEEE ICASSP, pages 1591-1594, Istanbul, 2000. [ bib | .ps.gz | .pdf ]
The rate of occurrence of words is not uniform but varies from document to document. Despite this observation, parameters for conventional n-gram language models are usually derived using the assumption of a constant word rate. In this paper we investigate the use of variable word rate assumption, modelled by a Poisson distribution or a continuous mixture of Poissons. We present an approach to estimating the relative frequencies of words or n-grams taking prior information of their occurrences into account. Discounting and smoothing schemes are also considered. Using the Broadcast News task, the approach demonstrates a reduction of perplexity up to 10%.
P A Taylor. Concept-to-speech by phonological structure matching. Philosophical Transactions of the Royal Society, Series A, 2000. [ bib | .ps | .pdf ]
J. Frankel, K. Richmond, S. King, and P. Taylor. An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces. In Proc. ICSLP, 2000. [ bib | .ps | .pdf ]
In this paper we describe a speech recognition system using linear dynamic models and articulatory features. Experiments are reported in which measured articulation from the MOCHA corpus has been used, along with those where the articulatory parameters are estimated from the speech signal using a recurrent neural network.
Y. Gotoh and S. Renals. Sentence boundary detection in broadcast speech transcripts. In ISCA ITRW: ASR2000, pages 228-235, Paris, 2000. [ bib | .ps.gz | .pdf ]
This paper presents an approach to identifying sentence boundaries in broadcast speech transcripts. We describe finite state models that extract sentence boundary information statistically from text and audio sources. An n-gram language model is constructed from a collection of British English news broadcasts and scripts. An alternative model is estimated from pause duration information in speech recogniser outputs aligned with their programme script counterparts. Experimental results show that the pause duration model alone outperforms the language modelling approach and that, by combining these two models, it can be improved further and precision and recall scores of over 70% were attained for the task.
M. Wester, J.M. Kessens, and H. Strik. Using Dutch phonological rules to model pronunciation variation in ASR. In Phonus 5: proceedings of the "workshop on phonetics and phonology in ASR", pages 105-116, Saarbruecken, 2000. [ bib | .pdf ]
In this paper, we describe how the performance of a continuous speech recognizer for Dutch has been improved by modeling within-word and cross-word pronunciation variation. Within-word variants were automatically generated by applying five phonological rules to the words in the lexicon. Cross-word pronunciation variation was modeled by adding multi-words and their variants to the lexicon. The best results were obtained when the cross-word method was combined with the within-word method: a relative improvement of 8.8% in the WER was found compared to baseline system performance. We also describe an error analysis that was carried out to investigate whether rules in isolation can predict the performance of rules in combination.
C. Mayo. The relationship between phonemic awareness and cue weighting in speech perception: longitudinal and cross-sectional child studies. PhD thesis, Queen Margaret University College, 2000. [ bib | .pdf ]
O. Goubanova and P. Taylor. Using Bayesian Belief networks for model duration in text-to-speech systems. In CD-ROM Proc. ICSLP 2000, Beijing, China, 2000. [ bib ]
Edmilson Morais, Paul Taylor, and Fabio Violaro. Concatenative text-to-speech synthesis based on prototype waveform interpolation (a time frequency approach). In Proc. ICSLP 2000, Beijing, China, 2000. [ bib | .ps | .pdf ]
S. King, P. Taylor, J. Frankel, and K. Richmond. Speech recognition via phonetically-featured syllables. In PHONUS, volume 5, pages 15-34, Institute of Phonetics, University of the Saarland, 2000. [ bib | .ps | .pdf ]
We describe recent work on two new automatic speech recognition systems. The first part of this paper describes the components of a system based on phonological features (which we call EspressoA) in which the values of these features are estimated from the speech signal before being used as the basis for recognition. In the second part of the paper, another system (which we call EspressoB) is described in which articulatory parameters are used instead of phonological features and a linear dynamical system model is used to perform recognition from automatically estimated values of these articulatory parameters.
Simon King and Paul Taylor. Detection of phonological features in continuous speech using neural networks. Computer Speech and Language, 14(4):333-353, 2000. [ bib | .ps | .pdf ]
We report work on the first component of a two stage speech recognition architecture based on phonological features rather than phones. The paper reports experiments on three phonological feature systems: 1) the Sound Pattern of English (SPE) system which uses binary features, 2)a multi valued (MV) feature system which uses traditional phonetic categories such as manner, place etc, and 3) Government Phonology (GP) which uses a set of structured primes. All experiments used recurrent neural networks to perform feature detection. In these networks the input layer is a standard framewise cepstral representation, and the output layer represents the values of the features. The system effectively produces a representation of the most likely phonological features for each input frame. All experiments were carried out on the TIMIT speaker independent database. The networks performed well in all cases, with the average accuracy for a single feature ranging from 86 to 93 percent. We describe these experiments in detail, and discuss the justification and potential advantages of using phonological features rather than phones for the basis of speech recognition.
D. Abberley, S. Renals, D. Ellis, and T. Robinson. The THISL SDR system at TREC-8. In Proc. Eighth Text Retrieval Conference (TREC-8), 2000. [ bib | .ps.gz | .pdf ]
This paper describes the participation of the THISL group at the TREC-8 Spoken Document Retrieval (SDR) track. The THISL SDR system consists of the realtime version of the Abbot large vocabulary speech recognition system and the thislIR text retrieval system. The TREC-8 evaluation assessed SDR performance on a corpus of 500 hours of broadcast news material collected over a five month period. The main test condition involved retrieval of stories defined by manual segmentation of the corpus in which non-news material, such as commercials, were excluded. An optional test condition required required retrieval of the same stories from the unsegmented audio stream. The THISL SDR system participated at both test conditions. The results show that a system such as THISL can produce respectable information retrieval performance on a realistically-sized corpus of unsegmented audio material.