In this letter we introduce a new pruning strategy for large vocabulary continuous speech recognition based on direct estimates of local posterior phone probabilities. This approach is well suited to hybrid connectionist/hidden Markov model systems. Experiments on the Wall Street Journal task using a 20,000 word vocabulary and a trigram language model have demonstrated that phone deactivation pruning can increase the speed of recognition-time search by up to a factor of 10, with a relative increase in error rate of less than 2%.

We present work intended to improve speech recognition performance for computer dialogue by taking into account the way that dialogue context and intonational tune interact to limit the possibilities for what an utterance might be. We report here on the extra constraint achieved in a bigram language model expressed in terms of entropy by using separate submodels for different sorts of dialogue acts and trying to predict which submodel to apply by analysis of the intonation of the sentence being recognised.

This paper will examine the written transcription of unfamiliar spoken names. It is well documented that the writing of personal and place names by people who are unfamiliar with the spelling of the name contributes to the evolution of names. The current paper describes a study which examines the processes involved, using experiments in which Scottish subjects are asked to write down unfamiliar spoken British and European town names.

This chapter describes a use of recurrent neural networks (ie, feedback is incorporated in the computation) as an acoustic model for continuous speech recognition. The form of the recurrent neural network is described, along with an appropriate parameter estimation procedure. For each frame of acoustic data, the recurrent network generates an estimate of the posterior probability of the possible phones given the observed acoustic signal. The posteriors are then converted into scaled likelihoods and used as the observation probabilities within a conventional decoding paradigm (eg, Viterbi decoding). The advantages of the using recurrent networks are that they require a small number of parameters and provide a fast decoding capability (relative to conventional large vocabulary HMM systems).

This work further develops and analyses the large vocabulary continuous speech recognition search strategy reported at ICASSP-95. In particular, the posterior-based phone deactivation pruning approach has been extended to include phone-dependent thresholds and an improved estimate of the least upper bound on the utterance log-probability has been developed. Analysis of the pruning procedures and of the search's interaction with the language model has also been performed. Experiments were carried out using the ARPA North American Business News task with a 20,000 word vocabulary and a trigram language model. As a result of these improvements and analyses, the computational cost of the recognition process performed by the Noway decoder has been substantially reduced.

Detectors for accents and phrase boundaries have been developed which derive prosodic features from the speech signal and its fundamental frequency to support other modules of a speech understanding system in an early analysis stage, or in cases where no word hypotheses are available. The detectors' underlying Gaussian distribution classifiers were trained with 50 minutes and tested with 30 minutes of spontaneous speech, yielding recognition rates of 74% for accents and 86% for phrase boundaries. Since this material was prosodically hand labelled, the question was, which labels for phrase boundaries and accentuation were only guided by syntactic or semantic knowledge, and which ones are really prosodically marked. Therefore a small test subset has been resynthesized in such a way that comprehensibility was lost, but the prosodic characteristics were kept. This subset has been re-labelled by 11 listeners with nearly the same accuracy as the detectors.