Intonation in Dialogue for Speech Recognition
A crucial consideration in speech recognition is how to constrain the set of possibilities for what might have been said. The fewer the possibilities, the better the chances of picking out the right one. In most current speech recognition systems, the main sources of constraint are the lexicon - it is assumed that the speaker is producing a sequence of words from a known stock - and the grammar - it is assumed that some sequences of words are impossible, or highly unlikely. Our project hopes to improve the accuracy of the speech recogniser by considering constraints on the way that dialogue acts can follow one another, and by using correlations between types of dialogue act and prosody.
In a dialogue, there is a strong connection between the intonation of an utterance and the "conversational move" that the utterance constitutes. By considering what moves are possible at a given point in a dialogue, and which ones would go together with the intonation of the current utterance, it should be possible to narrow down the number of possible utterances and thereby improve recognition accuracy. A very simple case would be one in which a yes/no question had been asked, and a brief reply uttered, low in the speaker's pitch range with little pitch movement. The chances are very good that the reply amounts to yes or no. An objection to the question or a refusal to answer would be more intonationally marked.
Past work looked at how to build a basic acoustic/phonetic intonation analyser, and how various intonational patterns relate to dialogue moves. This project concentrates on improving these components, and intergrating them so as to guide the speech recogniser's grammar.
- Steve Isard (PI)
- Simon King
- Jaqueline Kowtko
- Paul Taylor
Current ProgressSo far we have built:
- A speaker independent speech recogniser for the maptask.
- A HMM based bottom up intonation recognition system.
- An intonation analysis system which uses the Tilt theory of intonation to provide an parameterised intonational tune description for a phrase.
- A system which predicts the most suitable game move for a given utterance and then selects a bigram language model which has been trained on data of this type.
- The baseline performance of the recogniser should improve as we train on more data and use context dependent phones.
- We will develop a better architecture for the system to avoid some of the hard decisions currently being made.
This work was being funded by the Engineering and Physical Science Research Council. Duration: October 1993 - March 1997[an error occurred while processing this directive]