The main goal for Text-to-Speech systems (TTS) is to create natural and intelligible human voice. As people feel constrained when they encounter a robotic synthesis system, as close the voice is to a real human one, better acceptance the system will get.
Concatenation based synthesis is a commonly used method and constitutes the basis of the presented work. The basic idea of concatenative synthesis is to join together prerecorded diphones (groups of two phonemes) to form the desired utterance.
Nowadays the mainly used synthesis applications are either under big server platforms (for example for telephony) or embedded, consuming these last group as less recourses as possible. The presented system is constrained to have a small footprint and manageable computing requirements.
In this presentation I explain the process followed to the creation of a concatenation based, slot sentences synthesis in Spanish for embedded applications.
Slot sentences synthesis refers to the way the system handles prosody. For the kind of applications targeted here a general prosody module is not created, and instead a words/slots based synthesis is used.
Together with the general synthesizer procedure, some spanish linguistics explanation is given of how the system is tuned for the Castilian Spanish language.