The Spoken Output Labelling Explorer
The SOLE project is researching ways in which to couple natural language generation (NLG) systems with Text-to-Speech (TTS) systems.
A major aspect of SOLE is to couple two existing state-of-the-art systems to form a single speech generation system. The ILEX system, developed jointly by HCRC and AI is the NLG system operating in the museum labelling domain. Festival is the speech synthesis system developed at CSTR. We expect that coupling these two systems will have a positive effect on intonation: ILEX will be able to tell Festival when it's comparing or contrasting two objects, when it's referring to old or new information, when it's using a parenthetical or starting a new paragraph, etc., and Festival will decide, based on this information, that it needs to pause, to emphasise or deemphasise, to modify its pitch range, etc.
SOLE is aimed at providing a general interface between NLG and TTS systems, but for purposes of development and demonstration we have built a domain-specific demonstrator system. Continuing the work on the ILEX project, the domain chosen is one of an automated museum guide whose purpose is to provide intelligent tailored information on exhibits. Eventually the system will be contained in a portable unit so that visitors can recieve spoken information on their tour of a Museum.
A goal of the project is to formalise the ways in which any TTS and NLG systems can be integrated. The experience gained from the integration will be used to aid the development of a speech synthesis markup langauge, SOLEML.
Personnel (University of Edinburgh)
- Janet Hitzeman, RA
Human Communication Research Centre/Centre for Speech Technology Research
- Chris Mellish, PI
Department of Artificial Intelligence
- Jon Oberlander, PI
Human Communication Research Centre
- Paul Taylor, PI
Centre for Speech Technology Research
The project aims to develop and evaluate the effectiveness of a rhetorically-oriented language to a text-to-speech system:
- to adapt the current written-text oriented ILEX system for the sorts of construction more commonly found in spoken language,
- to produce a new intonation module which is suitable for discourse. A module will be written that maps from the representations used in ILEX to the phonological representations of intonation already in use in Festival,
- and to evaluate the results of the above and use them to formalise an enhanced version of SOLEML.
In the first phase of the project, we:
- collected a corpus of ILEX-like descriptive texts and recorded three speakers reading them,
- decided on set of linguistic constructs that contribute to intonation and developed an SGML tagset (SOLEML version 1) based on them,
- wrote a document describing the tagset, the motivation behind it, our hypotheses concerning what each type of tagged construct will contribute to intonation and what sort of interaction we expect when these constructs co-occur,
- annotated the text with SOLEML tags,
- marked accents on the speech by looking at the F0 contours,
- gave the ILEX system a way of sending generated text to Festival (This is the baseline version of the SOLE system, "S0"),
- trained Festival to use the SOLEML tags when it predicts accent placement, and
- began development of the S1 version of the SOLE system, which involves getting ILEX to automatically tag the text it produces with SOLEML tags before sending it to Festival.
We found that the SOLEML tags that are the most useful as
predictors of accent placement are
annotation of noun phrases for syntactic, semantic and reference type)
rhet-emph (our annotation of the phrases within
rhetorical structure that receive emphasis because they are
constrastive, etc.). Our results are reported in the article "On the
Use of Automatically Generated Discourse-Level Information in a
Concept-to-Speech Synthesis System", which will appear in the
Proceedings of the International Conference on Speech and Language
Processing, Australia, December 1998. We are currently completing
phase 1 of the project by comparing different statistical techniques
for accent placement prediction and completing development of the S1
version of the SOLE system.
In the second phase of the project (beginning September 1998), we plan to improve upon our results by adding to the set of SOLEML tags and by predicting accent size and contours as well as accent placement. Specifically, we will:
- annotate the speech with Tilt parameters, which will allow us to predict the size and contours of the accents as well as their positions,
- add SOLEML tags to annotate complete syntactic information and consider other types of annotation, such as annotating verbs for old/new information,
- explore cascading techniques for predicting accent placement/type (Instead of asking whether a particular syllable has an accent, we will ask whether a particular NP, rhetorical structure, etc. has an accent, and, if we predict that it does have an accent, where is that accent placed. In determining accent placement, we will compare statistical techniques with a technique based on metrical trees and their associated accent-placement rules), and
- develop the S2 version of the SOLE system based on our results.
Issues that might be appropriate for other grants under the EPSRC programme include the following:
- Once we've shown that this method can greatly improve the quality of synthesised speech in monologue, the obvious next step is to adapt the method for spoken dialogue systems.
- Because of the improvement we expect in synthesised intonation, the resulting speech will be easier to understand, and will therefore be particularly well-suited to systems for the blind or speech-impaired, where understandability is crucial. To take advantage of this, we would need to link SOLE to existing systems and evaluate the resulting systems.
- In order to test the suitability of SOLEML as a standard, it would be useful to encourage other NLG researchers to generate SOLEML-tagged text and those with other types of synthesis systems to work with SOLEML.
- There are certain gaps in the psycholinguistic literature that describes relationships between linguistic constructs and intonation, and it would be useful to fill in those gaps in order to improve the SOLEML tagset.
- After training, Festival will present us with various hypotheses concerning the effect on intonation when various linguistic factors interact. It would be useful to confirm these hypotheses with psycholinguistic experiments.
This work is being funded by the Engineering and Physical
Science Research Council.
Grant reference GR/L50341
Duration: May 1997 - April 2000
- Poesio, M., R. Henschel, J. Hitzeman, R. Kibble, S. Montague, and K. van Deemter (1999). ``First Results with a Statistical Model of Noun Phrase Generation'', in Proceedings of the Workshop on Architectures and Mechanisms for Language Processing, Edinburgh, September.
- Poesio, M., R. Henschel, J. Hitzeman, R. Kibble, S. Montague and K. van Deemter (1999). Towards An Annotation Scheme For Noun Phrase Generation, Proceedings of the EACL workshop on linguistically interpreted corpora (LINC-99), H. Uszkoreit, T. Brants and B. Krenn (eds.), Bergen, Norway.
- Poesio, M., Renate Henschel, Janet Hitzeman and Rodger Kibble (1999). Statistical NP Generation: A First Report, Proceedings of the ESSLLI Workshop on NP Generation, Utrecht, August.
- Hitzeman, Janet, Alan W. Black, Chris Mellish, Jon Oberlander, Massimo Poesio & Paul Taylor (1999). An Annotation Scheme for Concept-to-Speech Synthesis, Proceedings of the European Workshop on Natural Language Generation, pp. 59-66, Toulouse, France.
- Hitzeman, Janet, Alan W. Black, Paul Taylor, Chris Mellish & Jon Oberlander (1998). On the Use of Automatically Generated Discourse-Level Information in a Concept-to-Speech Synthesis System, ICSLP, Australia.
- Hitzeman, Janet & Massimo Poesio (1998). Long Distance Pronominalisation and Global Focus, ACL/COLING, Montreal.
Contact Janet Hitzeman for more details.