Modelling global information with Latent Semantic Analysis

Building Unit Selection Voice for Festival

Yoko Saikach

Unit selection synthesis, where appropriate sub-word units are selected from multiple examples in a database of natural speech, has been shown to produce high quality natural sounding speech. However, the quality of such systems is inherently related to the quality and appropriateness of the database from which the units are selected. How to find the best set of utterances to record that will exactly cover the acoustic variations as well as possible with minimal redundancy is still very much an on-going research question. The study presented here addresses this problem, the optimization of a textual

database of continuous speech. A set of sentences was greedily selected by setting frequency-weighted phonetic criteria from a large textual database, which has been phonetically transcribed. Not only is the best diphone coverage required but other factors such as lexical stress, syllable boundary and position in a word must also be well represented in the database. As pointed out by van Santen getting all possible features and all contexts would result in a prohibitively large database. The objective is, therefore, to attain a good compromise between a uniform distribution and a natural distribution, while achieving a reasonable total inventory size. The textual database selected by these criteria will be recorded for this project and anit selection voice will then be built using the Festvox tools. The synthetic speech will be tested and analyzed in order to examine the appropriateness of criteria of database selection.