29 Jan 2002
Dr Philip Jackson (University of Birmingham)
Data-driven, non-linear, formant-to-acoustic mapping for ASR
The underlying dynamics of speech can be captured in an automatic speech recognition system via an articulatory representation, which resides in a domain other than that of the acoustic observations. Thus, given a set of models in this hidden domain, it is essential that a mapping can be obtained to relate the intermediate representation to the acoustic domain. In this talk, two methods for mapping from formants to short-term spectra will be compared: multi-layered perceptrons (MLPs) and radial-basis function (RBF) networks. Both are capable of providing non-linear transformations, and were trained using features extracted from the TIMIT database. Various schemes for dividing the frames of speech data according to their phone class will also be discussed. Results show that the RBF networks perform approximately 10% better than the MLPs, in terms of an rms error, and that a classification based on discrete regions of the articulatory space gives the greatest improvements over a single network.