Synthesising speech with appropiate intonation using unit selection in a limited domain

Rachel Baker

Synthetic speech with contextually appropriate intonation is more  comprehensible, more natural, more believable, and easier to listen to than other synthetic speech.  This project explores using unit selection to synthesise appropriate intonation in limited domain synthesis.  Unit selection involves the selecting one of multiple examples of recorded of each unit, based on how closely it matches the target unit and how smoothly it will join with its neighbouring units.  Limited domain synthesis is a type of unit selection in which a full set of unit combinations is not recorded so only words that were recorded can be synthesised.  This voice is being built using FestVox and is designed for the flight booking domain.  Appropriate intonation is defined in terms of Steedman's theme/rheme theory.  The theme of a phrase is that part which ties it to the previous discourse, and is marked intonationally with an H* pitch accent and an LL% boundary tone.  The rheme is the speaker's new contribution, and is marked with an L+H* pitch accent and LH% boundary tone.  Words are marked with a pitch accent only if they distinguish a phrase from other possibilities that could appear in its place.  Recorded units will be tagged with their pitch accent type (H*,L+H*,or none) and boundary tone type (LL,LH,or none) and words will only synthesised using units that were recorded in an equivalent context.  All prosody arises from the original speaker's prosody.  This method has the advantages that it involves no signal processing and requires no specific prediction of points on the intonation contour.