The Centre for Speech Technology Research, The university of Edinburgh

Voice transformation

Project Summary

Transforming the quality and intonation of the speech of one speaker so that it sounds like another speaker

Project Details

Transforming Voice Quality and Intonation


Voice transformation is the process of transforming the characteristics of speech uttered by a source speaker, such that a listener would believe the speech was uttered by a target speaker. In this thesis two aspects of the transformation problem are addressed: voice quality and intonation.

The voice quality transformation component of our system has two main parts corresponding to the two components of the source-filter model. The first component transforms the spectral envelope as represented by a linear prediction model. The transformation is achieved using a Gaussian mixture model, which is trained on aligned speech from source and target speakers. The second part of the system predicts the spectral detail from the transformed LPC parameters. A novel approach is proposed, which is based on a classifier and residual codebooks. The system has some similarities with earlier work by Kain, however the work reported here is not restricted to speech spoken in a monotone and with mimicked prosody. Also, on the basis of a number of performance metrics it outperforms existing systems.

We also present a new method for the transformation of pitch contours from one speaker to another based on a small linguistically motivated parameter set. The system performs a piecewise-linear mapping using these parameters. A perceptual experiment, clearly demonstrates that the presented system is at least as good as the existing technique for all speaker pairs, and that in many cases it is much better and almost as good as using the target pitch contour.

Thesis Voice Quality Transformation Examples

The following are example waveforms to demonstrate the effectiveness of my system. In the examples, I use the prosody (fundamental pitch and duration) of the target speech, but predict everything else. The system was trained on two minutes of speech, and the examples below were not part of the training set.

Example 1: Source Target Converted
Example 2: Source Target Converted

The following examples demonstrate the results of each part of the system in isolation.

Example: Source Target Residual Prediction Only LSF Prediction Only Fully Converted

Pitch Transformation Examples

The following are example waveforms to demonstrate the pitch transformation system I have developed.

Example 1 (JV->MT): Target Normalisation conversion Piecewise linear conversion
Example 2 (FL->SO): Target Normalisation conversion Piecewise linear conversion

If you are interested in this research, please feel to contact Simon King (link to homepage is below).