ESPRESSO I:
Phonetically-featured syllables for speech recognition
The first grant for the Espresso project ran from October 1997 to September 1999.
Introduction
The use of syllables as the unit for recognition, rather than the ubiquitous phone, has two advantages:
- syllables exhibit far less context dependence than phones
- co-articulation can be modelled by allowing phonetic (or other) features to overlap within syllables.
The second property means that groups of features (such as voicing, frication, etc.) are not rigidly aligned at phone boundaries, and can spread into neighbouring segments. In traditional phone based recognisers, this is accommodated by context dependent units (triphones, typically). We believe that a syllable model can represent this feature overlap in a more natural way by assigning feature values to whole syllables rather than phones. One possible scheme might be to describe a syllable as having a voiced coda, nasal onset, and so on.
Personnel
Summary of findings
Experimental work used the TIMIT database which has high quality word and phone-level labels.
Feature detection using
- hidden Markov models
- neural networks
Two feature systems have been investigated : Chomsky and Halle's binary system from "The Sound Pattern of English" (SPE) and a mulivalued system adapted from work by Kirchhoff (see ICSLP `98 paper for details). Both systems can be succesfully recognised by neural networks (NNs), and, less successfully by HMMs.
Phone and syllable recognition using HMMs
The multivalued system from above has been used in recognition experiments using HMMs. Models were trained to recognise either phones or syllables from the NN output. The results of these, and the feature detection experiments, are reported in the ICSLP `98 paper.
Syllable classification using simple trajectory templates
A new syllable model is under development. A simple model has been investigated, and shows promise. This model represents changing feature values through the course of the syllable by polynomial trajectories. Training these models from segmented data involves straightforward least squares solution of linear equations. Classification expermients (i.e. with segmented test data) have demonstrated the potential of this approach.
Future Work
Several areas have been identified for future work:- Continued development of new trajectory-based syllable models, possibly with dynamical system models of the trajectories. Such models are clearly related to segmental Markov models, and this will be examined.
- Development of training and recognition algorithms for the new models, to remove the need for segmented data. Ultimately this may include embedded training of the neural network and syllable models together, techniques for parameter sharing between models, etc.
- Exploration of more feature systems, with articulatory or perceptual bases.
- Use of real articulatory data acquired using the electro-magnetic articulograph facility to develop and train articulatory models.
- Adoption of a larger and more realistic database, such as the Wall Street Journal.
Publications
- You can find all the publications relating to Espresso here.
- Student reports and dissertations can be found here
- Conference posters are available for ICPhS99 and ICSLP98
- The final report to the EPSRC and a summary are also available.
Funding
This work was funded by the Engineering and Physical Science Research Council, EPSRC grant number GR/L59566: "Phonetically Featured Syllables for Speech Recognition" under the "Realising Our Potential Awards" (ROPA) scheme.
Contact Simon King for more details.