This paper will discuss pronunciations of unfamiliar names, both British and foreign, by native speakers of English. Most studies which look at peoples' pronunciations of unfamiliar of pseudowords are based on English word-patterns, rather than a cross-language selection, while algorithms for determining the pronunciation of names from a variety of languages do not necessarily tell us how real people behave in such a situation. This paper shows that subjects may use different systems or sub-systems of rules to pronounce unknown names which they perceive to be non-native. If we wish to model human behaviour in novel word pronunciation, we need to take account the fact that, while native speakers are not experts in all foreign languages, neither are they linguistically naive.

We report on an automatic speech recognition system intended for use in dialogue, whose original aspect is its use of prosodic information for two different purposes. The first is to improve the word level accuracy of the system. The second is to constrain the language model applied to a given utterance by taking into account the way that dialogue context and intonational tune interact to limit the possibilities for what an utterance might be.

In this paper we present a novel, efficient search strategy for large vocabulary continuous speech recognition (LVCSR). The search algorithm, based on stack decoding, uses posterior phone probability estimates to substantially increase its efficiency with minimal effect on accuracy. In particular, the search space is dramatically reduced by phone deactivation pruning where phones with a small local posterior probability are deactivated. This approach is particularly well-suited to hybrid connectionist/hidden Markov model systems because posterior phone probabilities are directly computed by the acoustic model. On large vocabulary tasks, using a trigram language model, this increased the search speed by an order of magnitude, with 2% or less relative search error. Results from a hybrid system are presented using the Wall Street Journal LVCSR database for a 20,000 word task using a backed-off trigram language model. For this task, our single-pass decoder took around 15 times realtime on an HP735 workstation. At the cost of 7% relative search error, decoding time can be speeded up to approximately realtime.

It is well known that recognition performance degrades significantly when moving from a speaker- dependent to a speaker-independent system. Traditional hidden Markov model (HMM) systems have successfully applied speaker-adaptation approaches to reduce this degradation. In this paper we present and evaluate some techniques for speaker-adaptation of a hybrid HMM-artificial neural network (ANN) continuous speech recognition system. These techniques are applied to a well trained, speaker-independent, hybrid HMM-ANN system and the recognizer parameters are adapted to a new speaker through off-line procedures. The techniques are evaluated on the DARPA RM corpus using varying amounts of adaptation material and different ANN architectures. The results show that speaker-adaptation within the hybrid framework can substantially improve system performance.

ABBOT is the hybrid connectionist-hidden Markov model (HMM) large-vocabulary continuous speech recognition (CSR) system developed at Cambridge University. This system uses a recurrent network to estimate the acoustic observation probabilities within an HMM framework. A major advantage of this approach is that good performance is achieved using context-independent acoustic models and requiring many fewer parameters than comparable HMM systems. This paper presents substantial performance improvements gained from new approaches to connectionist model combination and phone-duration modeling. Additional capability has also been achieved by extending the decoder to handle larger vocabulary tasks (20,000 words and greater) with a trigram language model. This paper describes the recent modifications to the system and experimental results are reported for various test and development sets from the November 1992, 1993, and 1994 ARPA evaluations of spoken language systems.

In this paper detectors for accents, phrase boundaries, and sentence modality are described which derive prosodic features only from the speech signal and its fundamental frequency to support other modules of a speech understanding system in an early analysis stage, or in cases where no word hypotheses are available. A new method for interpolating and decomposing the fundamental frequency is suggested. The detectors' underlying Gaussian distribution classifiers were trained and tested with approximately 50 minutes of spontaneous speech, yielding recognition rates of 78 percent for accents, 81 percent for phrase boundaries, and 85 percent for sentence modality.