The Centre for Speech Technology Research, The university of Edinburgh

Publications by Heng Lu

[1] Heng Lu, Simon King, and Oliver Watts. Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis. In 8th ISCA Workshop on Speech Synthesis, pages 281-285, Barcelona, Spain, August 2013. [ bib | .pdf ]
Conventional statistical parametric speech synthesis relies on decision trees to cluster together similar contexts, result- ing in tied-parameter context-dependent hidden Markov models (HMMs). However, decision tree clustering has a major weak- ness: it use hard division and subdivides the model space based on one feature at a time, fragmenting the data and failing to exploit interactions between linguistic context features. These linguistic features themselves are also problematic, being noisy and of varied relevance to the acoustics. We propose to combine our previous work on vector-space representations of linguistic context, which have the added ad- vantage of working directly from textual input, and Deep Neural Networks (DNNs), which can directly accept such continuous representations as input. The outputs of the network are probability distributions over speech features. Maximum Likelihood Parameter Generation is then used to create parameter trajectories, which in turn drive a vocoder to generate the waveform. Various configurations of the system are compared, using both conventional and vector space context representations and with the DNN making speech parameter predictions at two dif- ferent temporal resolutions: frames, or states. Both objective and subjective results are presented.

[2] H. Lu and S. King. Factorized context modelling for text-to-speech synthesis. In Proc. ICASSP, Vancouver, Canada, May 2013. [ bib | .pdf ]
Because speech units are so context-dependent, a large number of linguistic context features are generally used by HMM- based Text-to-Speech (TTS) speech synthesis systems, via context-dependent models. Since it is impossible to train separate models for every context, decision trees are used to discover the most important combinations of features that should be modelled. The task of the decision tree is very hard to generalize from a very small observed part of the context feature space to the rest and they have a major weakness: they cannot directly take advantage of factorial properties: they subdivide the model space based on one feature at a time. We propose a Dynamic Bayesian Network (DBN) based Mixed Memory Markov Model (MMMM) to provide factorization of the context space. The results of a listening test are provided as evidence that the model successfully learns the factorial nature of this space.

[3] Heng Lu and Simon King. Using Bayesian networks to find relevant context features for HMM-based speech synthesis. In Proc. Interspeech, Portland, Oregon, USA, September 2012. [ bib | .pdf ]
Speech units are highly context-dependent, so taking contextual features into account is essential for speech modelling. Context is employed in HMM-based Text-to-Speech speech synthesis systems via context-dependent phone models. A very wide context is taken into account, represented by a large set of contextual factors. However, most of these factors probably have no significant influence on the speech, most of the time. To discover which combinations of features should be taken into account, decision tree-based context clustering is used. But the space of context-dependent models is vast, and the number of contexts seen in the training data is only a tiny fraction of this space, so the task of the decision tree is very hard: to generalise from observations of a tiny fraction of the space to the rest of the space, whilst ignoring uninformative or redundant context features. The structure of the context feature space has not been systematically studied for speech synthesis. In this paper we discover a dependency structure by learning a Bayesian Network over the joint distribution of the features and the speech. We demonstrate that it is possible to discard the majority of context features with minimal impact on quality, measured by a perceptual test.

Keywords: HMM-based speech synthesis, Bayesian Networks, context information