The Centre for Speech Technology Research, The university of Edinburgh

Publications by Peter Bell

[1] P. Bell, M. Gales, P. Lanchantin, X. Liu, Y. Long, S. Renals, P. Swietojanski, and P. Woodland. Transcription of multi-genre media archives using out-of-domain data. In Proc. IEEE Workshop on Spoken Language Technology, Miami, Florida, USA, December 2012. [ bib | .pdf ]
We describe our work on developing a speech recognition system for multi-genre media archives. The high diversity of the data makes this a challenging recognition task, which may benefit from systems trained on a combination of in-domain and out-of-domain data. Working with tandem HMMs, we present Multi-level Adaptive Networks (MLAN), a novel technique for incorporating information from out-of-domain posterior features using deep neural networks. We show that it provides a substantial reduction in WER over other systems, with relative WER reductions of 15% over a PLP baseline, 9% over in-domain tandem features and 8% over the best out-of-domain tandem features.

[2] Adriana Stan, Peter Bell, and Simon King. A grapheme-based method for automatic alignment of speech and text data. In Proc. IEEE Workshop on Spoken Language Technology, Miami, Florida, USA, December 2012. [ bib | .pdf ]
This paper introduces a method for automatic alignment of speech data with unsynchronised, imperfect transcripts, for a domain where no initial acoustic models are available. Using grapheme-based acoustic models, word skip networks and orthographic speech transcripts, we are able to harvest 55% of the speech with a 93% utterance-level accuracy and 99% word accuracy for the produced transcriptions. The work is based on the assumption that there is a high degree of correspondence between the speech and text, and that a full transcription of all of the speech is not required. The method is language independent and the only prior knowledge and resources required are the speech and text transcripts, and a few minor user interventions.

[3] Peter Bell, Myroslava Dzikovska, and Amy Isard. Designing a spoken language interface for a tutorial dialogue system. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]
We describe our work in building a spoken language interface for a tutorial dialogue system. Our goal is to allow natural, unrestricted student interaction with the computer tutor, which has been shown to improve the student's learning gain, but presents challenges for speech recognition and spoken language understanding. We discuss the choice of system components and present the results of development experiments in both acoustic and language modelling for speech recognition in this domain.

[4] Myroslava O. Dzikovska, Peter Bell, Amy Isard, and Johanna D. Moore. Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 471-481, Avignon, France, April 2012. Association for Computational Linguistics. [ bib | http ]
[5] Myroslava Dzikovska, Amy Isard, Peter Bell, Johanna Moore, Natalie Steinhauser, and Gwendolyn Campbell. Beetle II: an adaptable tutorial dialogue system. In Proceedings of the SIGDIAL 2011 Conference, demo session, pages 338-340, Portland, Oregon, June 2011. Association for Computational Linguistics. [ bib | http ]
We present Beetle II, a tutorial dialogue system which accepts unrestricted language input and supports experimentation with different tutorial planning and dialogue strategies. Our first system evaluation compared two tutorial policies and demonstrated that the system can be used to study the impact of different approaches to tutoring. The system is also designed to allow experimentation with a variety of natural language techniques, and discourse and dialogue strategies.

[6] Myroslava Dzikovska, Amy Isard, Peter Bell, Johanna D. Moore, Natalie B. Steinhauser, Gwendolyn E. Campbell, Leanne S. Taylor, Simon Caine, and Charlie Scott. Adaptive intelligent tutorial dialogue in the Beetle II system. In Artificial Intelligence in Education - 15th International Conference (AIED 2011), interactive event, volume 6738 of Lecture Notes in Computer Science, page 621, Auckland, New Zealand, 2011. Springer. [ bib | DOI ]
[7] Dong Wang, Simon King, Joe Frankel, and Peter Bell. Stochastic pronunciation modelling and soft match for out-of-vocabulary spoken term detection. In Proc. ICASSP, Dallas, Texas, USA, March 2010. [ bib | .pdf ]
A major challenge faced by a spoken term detection (STD) system is the detection of out-of-vocabulary (OOV) terms. Although a subword-based STD system is able to detect OOV terms, performance reduction is always observed compared to in-vocabulary terms. One challenge that OOV terms bring to STD is the pronunciation uncertainty. A commonly used approach to address this problem is a soft matching procedure,and the other is the stochastic pronunciation modelling (SPM) proposed by the authors. In this paper we compare these two approaches, and combine them using a discriminative decision strategy. Experimental results demonstrated that SPM and soft match are highly complementary, and their combination gives significant performance improvement to OOV term detection.

Keywords: confidence estimation, spoken term detection, speech recognition
[8] Peter Bell and Simon King. Diagonal priors for full covariance speech recognition. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Merano, Italy, December 2009. [ bib | DOI | .pdf ]
We investigate the use of full covariance Gaussians for large-vocabulary speech recognition. The large number of parameters gives high modelling power, but when training data is limited, the standard sample covariance matrix is often poorly conditioned, and has high variance. We explain how these problems may be solved by the use of a diagonal covariance smoothing prior, and relate this to the shrinkage estimator, for which the optimal shrinkage parameter may itself be estimated from the training data. We also compare the use of generatively and discriminatively trained priors. Results are presented on a large vocabulary conversational telephone speech recognition task.

[9] Dong Wang, Simon King, Joe Frankel, and Peter Bell. Term-dependent confidence for out-of-vocabulary term detection. In Proc. Interspeech, pages 2139-2142, Brighton, UK, September 2009. [ bib | .pdf ]
Within a spoken term detection (STD) system, the decision maker plays an important role in retrieving reliable detections. Most of the state-of-the-art STD systems make decisions based on a confidence measure that is term-independent, which poses a serious problem for out-of-vocabulary (OOV) term detection. In this paper, we study a term-dependent confidence measure based on confidence normalisation and discriminative modelling, particularly focusing on its remarkable effectiveness for detecting OOV terms. Experimental results indicate that the term-dependent confidence provides much more significant improvement for OOV terms than terms in-vocabulary.

[10] Peter Bell and Simon King. A shrinkage estimator for speech recognition with full covariance HMMs. In Proc. Interspeech, Brisbane, Australia, September 2008. Shortlisted for best student paper award. [ bib | .pdf ]
We consider the problem of parameter estimation in full-covariance Gaussian mixture systems for automatic speech recognition. Due to the high dimensionality of the acoustic feature vector, the standard sample covariance matrix has a high variance and is often poorly-conditioned when the amount of training data is limited. We explain how the use of a shrinkage estimator can solve these problems, and derive a formula for the optimal shrinkage intensity. We present results of experiments on a phone recognition task, showing that the estimator gives a performance improvement over a standard full-covariance system

[11] Peter Bell and Simon King. Covariance updates for discriminative training by constrained line search. In Proc. Interspeech, Brisbane, Australia, September 2008. [ bib | .pdf ]
We investigate the recent Constrained Line Search algorithm for discriminative training of HMMs and propose an alternative formula for variance update. We compare the method to standard techniques on a phone recognition task.

[12] Peter Bell and Simon King. Sparse gaussian graphical models for speech recognition. In Proc. Interspeech 2007, Antwerp, Belgium, August 2007. [ bib | .pdf ]
We address the problem of learning the structure of Gaussian graphical models for use in automatic speech recognition, a means of controlling the form of the inverse covariance matrices of such systems. With particular focus on data sparsity issues, we implement a method for imposing graphical model structure on a Gaussian mixture system, using a convex optimisation technique to maximise a penalised likelihood expression. The results of initial experiments on a phone recognition task show a performance improvement over an equivalent full-covariance system.

[13] Peter Bell, Tina Burrows, and Paul Taylor. Adaptation of prosodic phrasing models. In Proc. Speech Prosody 2006, Dresden, Germany, May 2006. [ bib | .pdf ]
There is considerable variation in the prosodic phrasing of speech betweeen different speakers and speech styles. Due to the time and cost of obtaining large quantities of data to train a model for every variation, it is desirable to develop models that can be adapted to new conditions with a limited amount of training data. We describe a technique for adapting HMM-based phrase boundary prediction models which alters a statistic distribution of prosodic phrase lengths. The adapted models show improved prediction performance across different speakers and types of spoken material.