The Centre for Speech Technology Research, The university of Edinburgh

03 Jun 2003

Antti-Veikko Rosti (Cambridge University)


Switching linear dynamical systems for speech recognition

Currently the most popular acoustic model for speech recognition is the hidden Markov model (HMM). However, HMMs are based on a series of assumptions some of which are known to be poor. In particular the assumption that successive speech frames are conditionally independent given the state that generated them. To overcome this, segment models have been proposed. These model whole segments of frames rather than individual ones. One form is the stochastic segment model (SSM), which uses a standard linear dynamical system to model the sequence of observations within a segment. Here the dynamics are modelled by a first-order Gauss-Markov process in some low-dimensional state space. The feature vector is a noise corrupted linear transformation of the state vector. Though the training and recognition algorithms are more complex compared to HMMs, it is feasible to use standard techniques for inference with SSMs.

For the SSM, segments are assumed to be independent. Intuitively, this is not always valid due to co-articulation between the modelling units. Switching linear dynamical systems (SLDS) have therefore been proposed. In SLDS, the posterior distribution of the state vector is propagated between segments. Unfortunately, exact inference in SLDS is not tractable due to exponential growth of components in time. In this talk, approximate methods for the inference in SLDSs will be presented. First there are approximate methods based on heuristic Viterbi-like algorithm. Alternatively variational learning may be used. Finally approaches based on Markov chain Monte Carlo methods can be used, including a training scheme based on stochastic expectation maximisation (SEM). For the SEM scheme, convergence and implementation issues for use with SLDS will be discussed in detail.

[back to PWorkshop Archives]

<owner-pworkshop@ling.ed.ac.uk>