The Centre for Speech Technology Research, The university of Edinburgh

Phonetically-featured syllables for speech recognition

The first grant for the Espresso project ran from October 1997 to September 1999.


The use of syllables as the unit for recognition, rather than the ubiquitous phone, has two advantages:

  1. syllables exhibit far less context dependence than phones
  2. co-articulation can be modelled by allowing phonetic (or other) features to overlap within syllables.

The second property means that groups of features (such as voicing, frication, etc.) are not rigidly aligned at phone boundaries, and can spread into neighbouring segments. In traditional phone based recognisers, this is accommodated by context dependent units (triphones, typically). We believe that a syllable model can represent this feature overlap in a more natural way by assigning feature values to whole syllables rather than phones. One possible scheme might be to describe a syllable as having a voiced coda, nasal onset, and so on.


Summary of findings

Experimental work used the TIMIT database which has high quality word and phone-level labels.

Feature detection using

Two feature systems have been investigated : Chomsky and Halle's binary system from "The Sound Pattern of English" (SPE) and a mulivalued system adapted from work by Kirchhoff (see ICSLP `98 paper for details). Both systems can be succesfully recognised by neural networks (NNs), and, less successfully by HMMs.

Phone and syllable recognition using HMMs

The multivalued system from above has been used in recognition experiments using HMMs. Models were trained to recognise either phones or syllables from the NN output. The results of these, and the feature detection experiments, are reported in the ICSLP `98 paper.

Syllable classification using simple trajectory templates

A new syllable model is under development. A simple model has been investigated, and shows promise. This model represents changing feature values through the course of the syllable by polynomial trajectories. Training these models from segmented data involves straightforward least squares solution of linear equations. Classification expermients (i.e. with segmented test data) have demonstrated the potential of this approach.

Future Work

Several areas have been identified for future work:



This work was funded by the Engineering and Physical Science Research Council, EPSRC grant number GR/L59566: "Phonetically Featured Syllables for Speech Recognition" under the "Realising Our Potential Awards" (ROPA) scheme.

Contact Simon King for more details.