|
[1]
|
Rasmus Dall, Christophe Veaux, Junichi Yamagishi, and Simon King.
Analysis of speaker clustering strategies for HMM-based speech
synthesis.
In Proc. Interspeech, Portland, Oregon, USA, September 2012.
[ bib |
.pdf ]
This paper describes a method for speaker clustering,
with the application of building average voice models
for speaker-adaptive HMM-based speech synthesis that
are a good basis for adapting to specific target
speakers. Our main hypothesis is that using
perceptually similar speakers to build the average
voice model will be better than use unselected
speakers, even if the amount of data available from
perceptually similar speakers is smaller. We measure
the perceived similarities among a group of 30 female
speakers in a listening test and then apply multiple
linear regression to automatically predict these
listener judgements of speaker similarity and thus to
identify similar speakers automatically. We then
compare a variety of average voice models trained on
either speakers who were perceptually judged to be
similar to the target speaker, or speakers selected by
the multiple linear regression, or a large global set
of unselected speakers. We find that the average voice
model trained on perceptually similar speakers provides
better performance than the global model, even though
the latter is trained on more data, confirming our main
hypothesis. However, the average voice model using
speakers selected automatically by the multiple linear
regression does not reach the same level of
performance.
|