Integrating Articulatory Features into HMM Synthesis

Perceptual test stimuli

This page presents synthetic stimuli used in perceptual tests recently conducted at CSTR. The aim was to evaluate our work on introducing articulatory control into HMM-based speech synthesis. These audio examples accompany the journal paper:

[1] Z. Ling, K. Richmond, J. Yamagishi, and R. Wang. Integrating articulatory features into HMM-based parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing, 2008. (submitted).

The following table gives the synthetic stimuli used in the listening test described in Section III.E.

Change in tongue height (cm)	+1.5	+1	+0.5	-0.5	-1	-1.5
bet	+1.5	+1.0	+0.5	-0.5	-1.0	-1.5
hem	+1.5	+1.0	+0.5	-0.5	-1.0	-1.5
led	+1.5	+1.0	+0.5	-0.5	-1.0	-1.5
peck	+1.5	+1.0	+0.5	-0.5	-1.0	-1.5
set	+1.5	+1.0	+0.5	-0.5	-1.0	-1.5

Synthetic stimuli used for evaluation of vowel quality modification using feature-dependency modeling

This experiment was designed to demonstrate the feasibility of controlling the quality of synthesized vowels by manipulating articulatory features. Five monosyllabic words ("bet", "hem", "led", "peck", and "set") with vowel [ε] were selected and embedded into the carrier sentence "Now we'll say ... again.". We then synthesised the 35 stimuli contained in this table, with 7 variants for each word featuring a modification to the articulatory parameters corresponding to tongue height. The modification distance was set from -1.5cm (=lower) to +1.5cm (=higher) in 0.5cm intervals. All other parameters remained constant. Note that the articulatory parameters are generated by the trained model itself, and so no new articulatory data is required for synthesis.

Listening to these examples, we clearly hear a change in vowel quality as we raise the tongue, so that "set" comes to sound like the word "sit" ([ɪ] vowel). Conversely, when lowering the tongue, the word "set" comes to sound like "sat" ([æ] vowel). This is entirely consistent with knowledge about speech production.

Brief background

The Hidden Markov Model (HMM), which has been widely used in the speech recognition field for many years, has within the last decade also been intensively researched as a means for speech synthesis. To synthesise speech, the spectra, F0 and segment duration are directly predicted from HMMs and then sent to a parametric synthesiser to generate a speech waveform. This method can produce highly intelligible and consistent speech.

A significant advantage compared with waveform concatenation (e.g. unit selection), is that it is very much more flexible. Many model adaptation and interpolation methods can be adopted to control the model parameters and hence diversify the generated speech. However, this is typically reliant upon data-driven learning methods and is constrained by the nature of the training or adaptation data that is available. For example, to synthesise foreign words, some speech samples containing the necessary sounds must be provided as a prerequisite. It is very difficult to apply phonetic knowledge to the generation of acoustic features directly.

In this work, we have attempted to integrate articulatory data into HMM-based synthesis with two aims. First, we hope to benefit from the attractive properties of parallel articulatory parameters (smoothly and slowly varying, subject to physiological constraints etc.) to improve the statistical modelling of the acoustic parameters. Second, we aim to introduce articulatory control; since articulatory features have physiological meaning, it is far more convenient to modify them according to linguistic knowledge than to modify the acoustic features directly. Hence, we want to model the relationship between articulation and acoustics so that we can manipulate the articulatory description directly in order to change the synthesised speech.

(Full details of the techniques mentioned here can be found in the paper referenced above.)

Personnel

Zhen-Hua Ling, Korin Richmond, Junichi Yamagishi, Ren-Hua Wang.