Euromasters summer school: Tutorial 3 - Festival Speech Synthesis

Introduction

This tutorial is designed to be an introduction to the Festival Speech Synthesis System, along with an introduction to building voices for the new Multisyn engine.

The commands that you have to type in this tutorial come in two types, firstly those that you have to type into a Unix shell and secondly that you have to type into Festival. Shell commands should to typed at a prompt which looks something like this:

: [machine]user:

It won't actually say `machine' and `user', it will say the name of the machine you are sitting at and the username you are logged in as.

Once Festival is running, the prompt will change (to remind you that Festival is running) and look like this:

: festival>

You need to make sure you type the right commands at the right prompt, or you will get errors. Additionally, remember that all Festival commands MUST be enclosed in brackets, otherwise you will get errors.

All of the commands that you need to type in this tutorial are presented in the boxes with blue (or grey if it is printed in black and white) backgrounds, and the prompt is included before each command to make clear which are shell commands and which are Festival commands.

Environment settings
The very first time you login, you need to run the following command to create some files and directories for you:

[machine]user: /group/cstr/projects/euromasters/tutorial3/bin/emcreatedirs

Then, each time you log in (including the first time) you need to set a number of variables and paths for the system to run correctly:

[machine]user: source emsetup.sh

Festival basics

First you need to familiarise yourself with Festival. Festival is started by running the `festival' command at a shell prompt.

Running Festival

[machine]user: festival

Once Festival is running you can issue it commands to make it speak, change voice or many other things.
To make Festival speak you can use the SayText command:

Making Festival speak

festival>(SayText "Hello world")

Don't forget the brackets.

If you want to keep the data structure that Festival generates during synthesis, you need to set a variable to the result of the SayText command. You do this like this:

Keeping an utterance structure

festival>(set! utt (SayText "Hello World"))

Notice the two sets of brackets! This sets a variable called utt

Once you have synthesised an utterance you can do lots of things with it. Here are a few examples.

Utterance commands

festival>(utt.play utt)

festival>(utt.relation.print utt 'Word)

festival>(utt.relation.print utt 'Segment)

To change voice in Festival you need to run a voice command to select a new voice. All voice commands start with the name voice_. Try the following voice:

Selecting a voice in Festival

festival>(voice_cmu_us_awb_arctic_hts)

Compare the quality of this voice to the original voice.
Once you have started to type a command in Festival if you press the TAB key it will list any completions to this command that Festival recognises. So if you type (voice_ and press TAB, it will tell you which voices are available. You should see three voices listed. The default diphone voice: voice_kal_diphone. The HTS voice you just loaded: voice_cmu_us_awb_arctic_hts. There is also a multisyn voice called voice_em_nina_multisyn which won't run yet as this is the voice you are going to build! (You also see a voice_reset function with just resets the current voice.)
Finally to exit festival and return to the shell use the exit command.

Exiting Festival

festival>(exit)

The voice data

First you must decide which voice you want to build. Your options are summarised below:

Voice Name Voice Description

2000 A large database general purpose unit selection synthesiser

2000f A unit selection synthesiser for the communicator flight information domain

500 A small database unit selection synthesiser

Each voice uses a different subset of the available data, and which subset you choose will determine the characteristics of your voice.

Selecting which voice to build Once you have chosen which voice you want to build, run the following command (replacing VOICE with either 2000, 2000f or 500):

[machine]user: select_voice VOICE

This will create a file called utts.data which contains a list of the utterances that will be used by your voice. You should load this file into an editor (emacs or xemacs is recommended) and view it. This file defines your speech database.

The actual voice is built from speech data in a number of different formats. Some of the formats are prepared in advance for you, others you will have to generate yourself. The different directories of data a listed below.

Directory Description Type

wav Wave files for the utterance provided

pm Pitch mark files provided

mfcc MFCCs for alignment provided

lab Label files from automatic alignment to be made

utt Festival utterance structures to be made

f0 Pitch contours provided

coef MFCCs + f0, for join cost provided

coef2 MFCCs + f0, stripped for join cost to be made

lpc LPC and residuals, used for synthesis provided

Under normal circumstances you would have to generate all of the provided files yourself, but some are provided here to save time.

Building the voice

Labelling the data
The first step in building your voice is to segment your data. This is done using by a forced alignment technique using HTK. Some models have already been trained for you (to save time), all you need to do is run the final alignment stage and generate label files.

Automatic Segmentation First you need to generate an initial label file containing the phone sequences for each utterance.

[machine]user: festival build_unitsel.scm

festival>(make_initial_phone_labs "utts.data" "utts.mlf" 'unilex-rpx)

festival>(exit)

[unilex-rpx is the name of the pronunciation lexicon which is used]
Next you need to run the alignment script to generate an aligned label file:

[machine]user: align_voice

This will take about 20 minutes for the 500 sentence voices, or about 1 hour and 20 minutes for the 2000 and 200f sentence voice. This may be a good time to go for lunch or take a look at the Festival manual.
Then you need to split the label file into individual files for each utterance.

[machine]user: break_mlf aligned.mlf lab

This creates a number of files in the lab directory.

Checking the segmentation
You can examine the label files using wavesurfer. If you find any gross misalignments, you should fix them at this stage if you have time. However be careful to only adjust the times of existing label. Do not insert, delete of modify the label names, or the voice will probably fail to build.

Using Wavesurfer Start Wavesurfer

[machine]user: wavesurfer

Now, from the file menu select chooser, and click on the Load file list... button. Load the file filelist, and select the first file in the list.
This should load the waveform. Now in the waveform window, right click in the dark blue box where the filename is displayed and select apply configuration. Select the configuration Euromasters

Putting the voice together

The next stage is to build utterance files for the database. These files describe the linguistic structure of each utterances in terms of phrases, words, syllables etc. The timings from the aligned label files are incorporated into this structure.

Building the utterance files Festival is used to build the utterance.

[machine]user: festival build_unitsel.scm

festival>(build_utts "utts.data" 'unilex-rpx)

festival>(exit)

This should create a number of files in the utt directory.

If you have changed a label file in such a way that the voice fails to build You can fix it with the following procedure to replace the broken label file with the original.

Fixing broken label files ** You only need to follow this procedure if your voice failed to build ** Create a new label directory

[machine]user: mkdir lab2

Break the mlf file into this new directory

[machine]user: break_mlf aligned.mlf lab2

Copy the file in question (Look at the error generated by the voice building procedure)

[machine]user: cp lab2/ninaXXXX.lab lab/ninaXXXX

Where XXXX completes the filename in question. Now rebuild the voice.

The final step is to generate the join cost coefficients. This step extracts appropriate frames which relate to the join points used by your labelling.

Making the join cost coefficients Run the following script

[machine]user: strip_join_cost_coefs coef coef2 utt/*.utt

Your voice should now be built and is ready for testing.

Testing the voice

To use your voice you need to run festival and then select the voice.

Loading the voice into Festival Start Festival and load the voice:

[machine]user: festival

festival>(voice_em_nina_multisyn)

To make the voice speak, use the SayText command

festival>(SayText "Hello, These are my first words.")

Try different types of sentences and see how the voice behaves. In particular try to generate examples sentence giving flight information. Find someone next to you who has built a different voice and compare the same text with each of these voices.

Tuning the voice
There are a number of parameters which you can change which control the voice. The most obvious ones which you can change allow you to set the level of beam pruning that is done. Other parameters let you change things like the target cost and the back-off rules for substituting missing diphones.

Setting the level of pruning To set the beam width for each list of candidate diphones:

festival>(du_voice.set_ob_pruning_beam currentMultisynVoice N)

Where N is a number between 0 and 1 (-1 switches object pruning off).
To set the beam width to control the paths which are kept at each stage:

festival>(du_voice.set_pruning_beam currentMultisynVoice N)

Where N is a number greater than 0 (-1 switches object pruning off). Both beams default to 0.25

Synthesise utterances with different levels of pruning and compare the output.
You can see more information about which units we chosen by printing the Unit relation.

The Unit relation

festival>(set! utt (SayText "The moon is a balloon"))

festival>(utt.relation.print utt 'Unit)

For each diphone a list of features describes which diphone was chosen and how good it was thought to be.

Identifying problems

You now have sufficient knowledge to identify problems with your voice. Find an example of some badly synthesised speech, and try to track down what the problem is.

Check the Unit relation to see what the problem is.

First check the actual diphones that have been chosen, as substitutions may have been made to accommodate missing diphones.

Secondly check the target and join costs around the area where the problem arises, if these are high, then this was probably the best the synthesiser could do.

Finally look at the source filenames and timings and go and see if the data is labelled badly at this point. If the labelling is particularly bad you could fix it and rebuild the voice, but there is no guarantee that the synthesiser will then pick the same units.

Additional Exercises

These exercises are designed to be more challenging that the main voice building exercise, and you are not necessarily expected to get this far or be able to complete these exercises. Feel free to pick and choose from them if you have time.

Build a second voice, using an different subset of the data from your first voice by following the above instructions (You will need to create a new directory, and run the emcreatedirs script in it. Additionally, after it is built you will need to run festival in this directory to load the data files for the new voice.)

Use the utterance access functions, to find the first phone of each word in an utterance.

Define a function in scheme which prints out an annotated version of a synthesised word sequence, syllable structure and segments, include annotation of stressed syllables and anything else you think is important.

Define a scheme function which prints a readable summary of the diphones chosen, costs and source information.

Useful resources

The Festival Manual can be found at http://www.cstr.ed.ac.uk/projects/festival/manual
The Festvox documentation for building voices can be found at http://www.festvox.org/festvox/festvox_toc.html (This includes a scheme overview and tutorial)
The online version of this document and accompanying slides used to introduce each session can be found at http://data.cstr.ed.ac.uk/euromasters
The full set of tools for building multisyn voices, including further documentation on their use can be found at http://www.cstr.ed.ac.uk/downloads/festival/multisyn_build

About this document ...
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -init_file stuff.perl tutorial.tex
The translation was initiated by Rob Clark on 2005-07-05

Rob Clark 2005-07-05

Voice Name	Voice Description
2000	A large database general purpose unit selection synthesiser
2000f	A unit selection synthesiser for the communicator* flight information domain*
500	A small database unit selection synthesiser

Directory	Description	Type
wav	Wave files for the utterance	provided
pm	Pitch mark files	provided
mfcc	MFCCs for alignment	provided
lab	Label files from automatic alignment	to be made
utt	Festival utterance structures	to be made
f0	Pitch contours	provided
coef	MFCCs + f0, for join cost	provided
coef2	MFCCs + f0, stripped for join cost	to be made
lpc	LPC and residuals, used for synthesis	provided