Integrating Articulatory Features into HMM Synthesis
Perceptual test stimuli
This page presents synthetic stimuli used in perceptual tests recently
conducted at CSTR. The aim was to evaluate our work on introducing
articulatory control into HMM-based speech synthesis. These audio
examples accompany the journal paper:
[1] |
Z. Ling, K. Richmond, J. Yamagishi, and R. Wang.
Integrating articulatory features into HMM-based parametric speech
synthesis.
IEEE Transactions on Audio, Speech and Language Processing,
2008.
(submitted). |
The following table gives the synthetic stimuli used in the listening test described in Section III.E.
Change in tongue height (cm) | +1.5 | +1 | +0.5 | 0 | -0.5 | -1 | -1.5 |
bet | +1.5 | +1.0 | +0.5 | 0 | -0.5 | -1.0 | -1.5 |
hem | +1.5 | +1.0 | +0.5 | 0 | -0.5 | -1.0 | -1.5 |
led | +1.5 | +1.0 | +0.5 | 0 | -0.5 | -1.0 | -1.5 |
peck | +1.5 | +1.0 | +0.5 | 0 | -0.5 | -1.0 | -1.5 |
set | +1.5 | +1.0 | +0.5 | 0 | -0.5 | -1.0 | -1.5 |
This experiment was designed to demonstrate the feasibility of controlling the quality of synthesized vowels by manipulating articulatory features. Five monosyllabic words ("bet", "hem", "led", "peck", and "set") with vowel [ε] were selected and embedded into the carrier sentence "Now we'll say ... again.". We then synthesised the 35 stimuli contained in this table, with 7 variants for each word featuring a modification to the articulatory parameters corresponding to tongue height. The modification distance was set from -1.5cm (=lower) to +1.5cm (=higher) in 0.5cm intervals. All other parameters remained constant. Note that the articulatory parameters are generated by the trained model itself, and so no new articulatory data is required for synthesis.
Listening to these examples, we clearly hear a change in vowel quality as we raise the tongue, so that "set" comes to sound like the word "sit" ([ɪ] vowel). Conversely, when lowering the tongue, the word "set" comes to sound like "sat" ([æ] vowel). This is entirely consistent with knowledge about speech production.
Brief background
The Hidden Markov Model (HMM), which has been widely used in the speech recognition field for many years, has within the last decade also been intensively researched as a means for speech synthesis. To synthesise speech, the spectra, F0 and segment duration are directly predicted from HMMs and then sent to a parametric synthesiser to generate a speech waveform. This method can produce highly intelligible and consistent speech.
A significant advantage compared with waveform concatenation (e.g. unit selection), is that it is very much more flexible. Many model adaptation and interpolation methods can be adopted to control the model parameters and hence diversify the generated speech. However, this is typically reliant upon data-driven learning methods and is constrained by the nature of the training or adaptation data that is available. For example, to synthesise foreign words, some speech samples containing the necessary sounds must be provided as a prerequisite. It is very difficult to apply phonetic knowledge to the generation of acoustic features directly.
In this work, we have attempted to integrate articulatory data into
HMM-based synthesis with two aims. First, we hope to benefit from the
attractive properties of parallel articulatory parameters (smoothly
and slowly varying, subject to physiological constraints etc.) to
improve the statistical modelling of the acoustic parameters. Second,
we aim to introduce articulatory control; since articulatory features
have physiological meaning, it is far more convenient to modify them
according to linguistic knowledge than to modify the acoustic features
directly. Hence, we want to model the relationship between
articulation and acoustics so that we can manipulate the articulatory
description directly in order to change the synthesised speech.
(Full details of the techniques mentioned here can be found in the
paper referenced above.)
Personnel
- Zhen-Hua Ling, Korin Richmond, Junichi Yamagishi, Ren-Hua Wang.