Simple4All

Project Summary

The Simple4All project created speech synthesis technology which learns from data with little or no expert supervision, and continually improves simply by being used. It was one of the first attempts to use lightly-supervised and unsupervised methods to create speech synthesis systems.

Project Details

This page is an archive of the main content from the simple4all.org website.

Consortium

The University of Edinburgh (co-ordinator)
Aalto University
The Technical University of Cluj-Napoca
Universidad Politécnica de Madrid
University of Helsinki

Project aims

In order to be accepted by users, the voice of a spoken interaction system must be natural and appropriate for the content. Using the same voice for every application is not acceptable to users. But creating a speech synthesiser for a new language or domain is too expensive, because current technology relies on labelled data and human expertise. Systems comprise rules, statistical models, and data, requiring careful tuning by experienced engineers.

So, speech synthesis is available from a small number of vendors, offering generic products, not tailored to any application domain. Systems are not portable: creating a bespoke system for a specific application is hard, because it involves substantial effort to re-engineer every component of the system. Take-up by potential end users is limited; the range of feasible applications is narrow. Synthesis is often an off-the-shelf component, providing a highly inappropriate speaking style for applications such as dialogue, speech translation, games, personal assistants, communication aids, SMS-to-speech conversion, e-learning, toys and a multitude of other applications where a specific speaking style is important.

We are developing methods that enable the construction of systems from audio and text data. We are enabling systems to learn after deployment. General purpose or specialised systems for any domain or language will become feasible. Our objectives are:

ADAPTABILITY: create highly portable and adaptable speech synthesis technology suitable for any domain or language
LEARNING FROM DATA AND INTERACTION: provide a complete, consistent framework in which every component of a speech synthesis system can be learned and improved
SPEAKING STYLE: enable the generation of natural, conversational, highly expressive synthetic speech which is appropriate to the wider context
DEMONSTRATION AND EVALUATION: automatic creation of a new speech synthesiser from scratch, and feedback-driven online learning, with perceptual evaluations.

Outputs - public deliverables (reports)

Download all public deliverables in one package

Outputs - tools and software

ALISA - A lightly supervised speech segmentation and alignment tool
Ossian - Language processing (front end) building tools
Norma - Translation-based text normalisation tool
Dexter - Diarisation system, including style-diarisation
LILDA - LDA based language identification tool. Refer to Zhang, Clark & Wang 2014 Unsupervised Language Filtering using the Latent Dirichlet Allocation, Proc. Interspeech 2014, for technical details
MiLex - Unsupervised stress detection and prediction toolkit

Outputs - data

The Tundra Corpus - European language audiobooks
User Feedback Data - Synthetic samples and the spoken user feedback. Demo page
Spanish Speaking Styles Corpora: please contact Prof. Juan Manuel Montero of the Universidad Politecnica de Madrid
Blizzard 2014 Annotations
- BLIZZARD 2014 HUB TASK ANNOTATIONS: A data set with annotation labels generated for the Blizzard Challenge 2014 Hub Task to train and evaluate synthetic voices in six Indian languages.
- BLIZZARD 2014 SPOKE TASK LABELS:A data set with intermediate annotation labels generated for Blizzard Challenge 2014 Spoke Task to build a multilingual synthesis system in six Indian languages.
Text Normalisation Datasets in English, Spanish, and Romanian
Romanian Broadcast News
This speech and text dataset has been collected and processed with the available Simple4All tools in order to demonstrate the performance of speaking style adaptation for the Romanian baseline synthetic voice already created from the standard RSS corpus towards the prosody and expressive style of the main presenter of the broadcast news. Speaker diarization toolkit Dexter is used to automatically select the speech of the main presenter from the broadcasted news (music, noise and speaker discrimination), then the Voice Cloning Toolkit is used for speaking style adaptation.

The dataset contains about 6 hours of Romanian broadcasted news: speech, noise overlapped over speech, music, overlapping speakers. Each of the records is about 10 mins long with the text available for the main presenter. To evaluate the speaker diarization performance, five of the broadcasted news are labelled at the speaker level and the corresponding RTTM labels are provided. The speech of the main presenter is automatically extracted and about 50 mins of speech is obtained and labelled with the available text at the sentence level. This data is used to demonstrate the speaking style adaptation.
Romanian Parliamentary Speeches
The dataset contains speech and text of Romanian parliamentary speeches from various public meetings in the period 2011-2014. The text is aimed to be used for the political style genre modelling and detection. Audio data may be used for creation of a synthetic neutral political voice.

The speech material is labelled at the speaker level and the text has been checked to correspond to the speech. The recordings are realized at the sampling frequency of 44 kHz, 16 bits/sample. The total amount of speech data is: 21 hours of speech, 339 speakers acting in 1031 speech interventions. The audio data has a certain reverberation, which corresponds to the realistic recordings in the parliamentary meeting room.

Each folder contains the following files:
- YYYY_MM_DD.wav (the audio recording of the meeting from DD/MM/YYYY);
- YYYY_MM_DD.txt (the text corresponding to the whole meeting);
- Label_Track.txt (the annotation at the speaker level in the format: t_start t_stop Speaker_Id (the “Speaker_Id” is encoded with: V1, V2, V3,…)
- the files named with “Vi.txt” and “Vi-j.txt” contain the corresponding text of the speaker “Vi” for its first, respectively “j-th” intervention).

Personnel

Simon King

Funding Source

EU Framework VII