P. Swietojanski, A. Ghoshal, and S. Renals. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), December 2013. [ bib | DOI | .pdf ]

We investigate the application of deep neural network (DNN)-hidden Markov model (HMM) hybrid acoustic models for far-field speech recognition of meetings recorded using microphone arrays. We show that the hybrid models achieve significantly better accuracy than conventional systems based on Gaussian mixture models (GMMs). We observe up to 8% absolute word error rate (WER) reduction from a discriminatively trained GMM baseline when using a single distant microphone, and between 4–6% absolute WER reduction when using beamforming on various combinations of array channels. By training the networks on audio from multiple channels, we find the networks can recover significant part of accuracy difference between the single distant microphone and beamformed configurations. Finally, we show that the accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones.

Joris Driesen and Steve Renals. Lightly supervised automatic subtitling of weather forecasts. In Proc. Automatic Speech Recognition and Understanding Workshop, Olomouc, Czech Republic, December 2013. [ bib | DOI | .pdf ]

Since subtitling television content is a costly process, there are large potential advantages to automating it, using automatic speech recognition (ASR). However, training the necessary acoustic models can be a challenge, since the available training data usually lacks verbatim orthographic transcriptions. If there are approximate transcriptions, this problem can be overcome using light supervision methods. In this paper, we perform speech recognition on broadcasts of Weatherview, BBC's daily weather report, as a first step towards automatic subtitling. For training, we use a large set of past broadcasts, using their manually created subtitles as approximate transcriptions. We discuss and and compare two different light supervision methods, applying them to this data. The best training set finally obtained with these methods is used to create a hybrid deep neural networkbased recognition system, which yields high recognition accuracies on three separate Weatherview evaluation sets.

Joris Driesen, Peter Bell, Mark Sinclair, and Steve Renals. Description of the UEDIN system for German ASR. In Proc IWSLT, Heidelberg, Germany, December 2013. [ bib | .pdf ]

In this paper we describe the ASR system for German built at the University of Edinburgh (UEDIN) for the 2013 IWSLT evaluation campaign. For ASR, the major challenge to overcome, was to find suitable acoustic training data. Due to the lack of expertly transcribed acoustic speech data for German, acoustic model training had to be performed on publicly available data crawled from the internet. For evaluation, lack of a manual segmentation into utterances was handled in two different ways: by generating an automatic segmentation, and by treating entire input files as a single segment. Demonstrating the latter method is superior in the current task, we obtained a WER of 28.16% on the dev set and 36.21% on the test set.

C. Bhatt, A. Popescu-Belis, M. Habibi, S. Ingram, S. Masneri, F. McInnes, N. Pappas, and O. Schreer. Multi-factor segmentation for topic visualization and recommendation: the MUST-VIS system. In Proceedings of ACM Multimedia 2013, Barcelona, Spain, October 2013. [ bib | .pdf ]

This paper presents the MUST-VIS system for the MediaMixer/VideoLectures.NET Temporal Segmentation and Annotation Grand Challenge. The system allows users to visualize a lecture as a series of segments represented by keyword clouds, with relations to other similar lectures and segments. Segmentation is performed using a multi-factor algorithm which takes advantage of the audio (through automatic speech recognition and word-based segmentation) and video (through the detection of actions such as writing on the blackboard). The similarity across segments and lectures is computed using a content-based recommendation algorithm. Overall, the graph-based representation of segment similarity appears to be a promising and cost-effective approach to navigating lecture databases.

Atef Ben Youssef, Hiroshi Shimodaira, and David A. Braude. Head motion analysis and synthesis over different tasks. In Proc. Intelligent Virtual Agents, pages 285-294. Springer, September 2013. [ bib | .pdf ]

It is known that subjects vary in their head movements. This paper presents an analysis of this variety over different tasks and speakers and their impact on head motion synthesis. Measured head and articulatory movements acquired by an ElectroMagnetic Articulograph (EMA) synchronously recorded with audio was used. Data set of speech of 12 people recorded on different tasks confirms that the head motion variate over tasks and speakers. Experimental results confirmed that the proposed models were capable of learning and synthesising task-dependent head motions from speech. Subjective evaluation of synthesised head motion using task models shows that trained models on the matched task is better than mismatched one and free speech data provide models that predict preferred motion by the participants compared to read speech data.

C. Valentini-Botinhao, J. Yamagishi, S. King, and Y. Stylianou. Combining perceptually-motivated spectral shaping with loudness and duration modification for intelligibility enhancement of HMM-based synthetic speech in noise. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]

M. Cooke, C. Mayo, and C. Valentini-Botinhao. Intelligibility-enhancing speech modifications: the Hurricane Challenge. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]

Cassia Valentini-Botinhao, Mirjam Wester, Junichi Yamagishi, and Simon King. Using neighbourhood density and selective SNR boosting to increase the intelligibility of synthetic speech in noise. In 8th ISCA Workshop on Speech Synthesis, pages 133-138, Barcelona, Spain, August 2013. [ bib | .pdf ]

Motivated by the fact that words are not equally confusable, we explore the idea of using word-level intelligibility predictions to selectively boost the harder-to-understand words in a sentence, aiming to improve overall intelligibility in the presence of noise. First, the intelligibility of a set of words from dense and sparse phonetic neighbourhoods was evaluated in isolation. The resulting intelligibility scores were used to inform two sentencelevel experiments. In the first experiment the signal-to-noise ratio of one word was boosted to the detriment of another word. Sentence intelligibility did not generally improve. The intelligibility of words in isolation and in a sentence were found to be significantly different, both in clean and in noisy conditions. For the second experiment, one word was selectively boosted while slightly attenuating all other words in the sentence. This strategy was successful for words that were poorly recognised in that particular context. However, a reliable predictor of word-in-context intelligibility remains elusive, since this involves – as our results indicate – semantic, syntactic and acoustic information about the word and the sentence.

Thomas Merritt and Simon King. Investigating the shortcomings of HMM synthesis. In 8th ISCA Workshop on Speech Synthesis, pages 185-190, Barcelona, Spain, August 2013. [ bib | .pdf ]

This paper presents the beginnings of a framework for formal testing of the causes of the current limited quality of HMM (Hidden Markov Model) speech synthesis. This framework separates each of the effects of modelling to observe their independent effects on vocoded speech parameters in order to address the issues that are restricting the progression to highly intelligible and natural-sounding speech synthesis. The simulated HMM synthesis conditions are performed on spectral speech parameters and tested via a pairwise listening test, asking listeners to perform a “same or different” judgement on the quality of the synthesised speech produced between these conditions. These responses are then processed using multidimensional scaling to identify the qualities in modelled speech that listeners are attending to and thus forms the basis of why they are distinguishable from natural speech. The future improvements to be made to the framework will finally be discussed which include the extension to more of the parameters modelled during speech synthesis.

Nicholas W D Evans, Tomi Kinnunen, and Junichi Yamagishi. Spoofing and countermeasures for automatic speaker verification. In Interspeech 2013, 14th Annual Conference of the International Speech Communication Association, August 25-29, 2013, Lyon, France, Lyon, FRANCE, August 2013. [ bib | .pdf ]

Maria Astrinaki, Alexis Moinet, Junichi Yamagishi, Korin Richmond, Zhen-Hua Ling, Simon King, and Thierry Dutoit. Mage - reactive articulatory feature control of HMM-based parametric speech synthesis. In 8th ISCA Workshop on Speech Synthesis, pages 227-231, Barcelona, Spain, August 2013. [ bib | .pdf ]

Qiong Hu, Korin Richmond, Junichi Yamagishi, and Javier Latorre. An experimental comparison of multiple vocoder types. In 8th ISCA Workshop on Speech Synthesis, pages 155-160, Barcelona, Spain, August 2013. [ bib | .pdf ]

This paper presents an experimental comparison of a broad range of the leading vocoder types which have been previously described. We use a reference implementation of each of these to create stimuli for a listening test using copy synthesis. The listening test is performed using both Lombard and normal read speech stimuli, and with two types of question for comparison. Multi-dimensional Scaling (MDS) is conducted on the listener responses to analyse similarities in terms of quality between the vocoders. Our MDS and clustering results show that the vocoders which use a sinusoidal synthesis approach are perceptually distinguishable from the source-filter vocoders. To help further interpret the axes of the resulting MDS space, we test for correlations with standard acoustic quality metrics and find one axis is strongly correlated with PESQ scores. We also find both speech style and the format of the listening test question may influence test results. Finally, we also present preference test results which compare each vocoder with the natural speech.

Peter Bell, Hitoshi Yamamoto, Pawel Swietojanski, Youzheng Wu, Fergus McInnes, Chiori Hori, and Steve Renals. A lecture transcription system combining neural network acoustic and language models. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]

This paper presents a new system for automatic transcription of lectures. The system combines a number of novel features, including deep neural network acoustic models using multi-level adaptive networks to incorporate out-of-domain information, and factored recurrent neural network language models. We demonstrate that the system achieves large improvements on the TED lecture transcription task from the 2012 IWSLT evaluation - our results are currently the best reported on this task, showing an relative WER reduction of more than 16% compared to the closest competing system from the evaluation.

Adriana Stan, Peter Bell, Junichi Yamagishi, and Simon King. Lightly supervised discriminative training of grapheme models for improved sentence-level alignment of speech and text data. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]

This paper introduces a method for lightly supervised discriminative training using MMI to improve the alignment of speech and text data for use in training HMM-based TTS systems for low-resource languages. In TTS applications, due to the use of long-span contexts, it is important to select training utterances which have wholly correct transcriptions. In a low-resource setting, when using poorly trained grapheme models, we show that the use of MMI discriminative training at the grapheme-level enables us to increase the amount of correctly aligned data by 40%, while maintaining a 7% sentence error rate and 0.8% word error rate. We present the procedure for lightly supervised discriminative training with regard to the objective of minimising sentence error rate.

H. Christensen, M. Aniol, P. Bell, P. Green, T. Hain, S. King, and P. Swietojanski. Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]

Recently there has been increasing interest in ways of using out-of-domain (OOD) data to improve automatic speech recognition performance in domains where only limited data is available. This paper focuses on one such domain, namely that of disordered speech for which only very small databases exist, but where normal speech can be considered OOD. Standard approaches for handling small data domains use adaptation from OOD models into the target domain, but here we investigate an alternative approach with its focus on the feature extraction stage: OOD data is used to train feature-generating deep belief neural networks. Using AMI meeting and TED talk datasets, we investigate various tandem-based speaker independent systems as well as maximum a posteriori adapted speaker dependent systems. Results on the UAspeech isolated word task of disordered speech are very promising with our overall best system (using a combination of AMI and TED data) giving a correctness of 62.5%; an increase of 15% on previously best published results based on conventional model adaptation. We show that the relative benefit of using OOD data varies considerably from speaker to speaker and is only loosely correlated with the severity of a speaker's impairments.

Kayoko Yanagisawa, Javier Latorre, Vincent Wan, Mark J. F. Gales, and Simon King. Noise robustness in HMM-TTS speaker adaptation. In 8th ISCA Workshop on Speech Synthesis, pages 139-144, Barcelona, Spain, August 2013. [ bib | .pdf ]

Speaker adaptation for TTS applications has been receiving more attention in recent years for applications such as voice customisation or voice banking. If these applications are offered as an Internet service, there is no control on the quality of the data that can be collected. It can be noisy with people talking in the background or recorded in a reverberant environment. This makes the adaptation more difficult. This paper explores the effect of different levels of additive and convolutional noise on speaker adaptation techniques based on cluster adaptive training (CAT) and average voice model (AVM). The results indicate that although both techniques suffer degradation to some extent, CAT is in general more robust than AVM.

Rubén San-Segundo, Juan Manuel Montero, Mircea Giurgiu, Ioana Muresan, and Simon King. Multilingual number transcription for text-to-speech conversion. In 8th ISCA Workshop on Speech Synthesis, pages 85-89, Barcelona, Spain, August 2013. [ bib | .pdf ]

This paper describes the text normalization module of a text to speech fully-trainable conversion system and its application to number transcription. The main target is to generate a language independent text normalization module, based on data instead of on expert rules. This paper proposes a general architecture based on statistical ma- chine translation techniques. This proposal is composed of three main modules: a tokenizer for splitting the text input into a token graph, a phrase-based translation module for token translation, and a post-processing module for removing some tokens. This architecture has been evaluated for number transcription in several languages: English, Spanish and Romanian. Number transcription is an important aspect in the text normalization problem.

Heng Lu, Simon King, and Oliver Watts. Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis. In 8th ISCA Workshop on Speech Synthesis, pages 281-285, Barcelona, Spain, August 2013. [ bib | .pdf ]

Conventional statistical parametric speech synthesis relies on decision trees to cluster together similar contexts, result- ing in tied-parameter context-dependent hidden Markov models (HMMs). However, decision tree clustering has a major weak- ness: it use hard division and subdivides the model space based on one feature at a time, fragmenting the data and failing to exploit interactions between linguistic context features. These linguistic features themselves are also problematic, being noisy and of varied relevance to the acoustics. We propose to combine our previous work on vector-space representations of linguistic context, which have the added ad- vantage of working directly from textual input, and Deep Neural Networks (DNNs), which can directly accept such continuous representations as input. The outputs of the network are probability distributions over speech features. Maximum Likelihood Parameter Generation is then used to create parameter trajectories, which in turn drive a vocoder to generate the waveform. Various configurations of the system are compared, using both conventional and vector space context representations and with the DNN making speech parameter predictions at two dif- ferent temporal resolutions: frames, or states. Both objective and subjective results are presented.

H. Bourlard, M. Ferras, N. Pappas, A. Popescu-Belis, S. Renals, F. McInnes, P. Bell, S. Ingram, and M. Guillemot. Processing and linking audio events in large multimedia archives: The EU inEvent project. In Proceedings of SLAM 2013 (First Workshop on Speech, Language and Audio in Multimedia), Marseille, France, August 2013. [ bib | .pdf ]

In the inEvent EU project, we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected "hyper-events" (a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal. We present initial results for feature extraction from lecture recordings using the TED talks.

Yoshitaka Mamiya, Adriana Stan, Junichi Yamagishi, Peter Bell, Oliver Watts, Robert Clark, and Simon King. Using adaptation to improve speech transcription alignment in noisy and reverberant environments. In 8th ISCA Workshop on Speech Synthesis, pages 61-66, Barcelona, Spain, August 2013. [ bib | .pdf ]

When using data retrieved from the internet to create new speech databases, the recording conditions can often be highly variable within and between sessions. This variance influences the overall performance of any automatic speech and text alignment techniques used to process this data. In this paper we discuss the use of speaker adaptation methods to address this issue. Starting from a baseline system for automatic sentence-level segmentation and speech and text alignment based on GMMs and grapheme HMMs, respectively, we employ Maximum A Posteriori (MAP) and Constrained Maximum Likelihood Linear Regression (CMLLR) techniques to model the variation in the data in order to increase the amount of confidently aligned speech. We tested 29 different scenarios, which include reverberation, 8 talker babble noise and white noise, each in various combinations and SNRs. Results show that the MAP-based segmentation's performance is very much influenced by the noise type, as well as the presence or absence of reverberation. On the other hand, the CMLLR adaptation of the acoustic models gives an average 20% increase in the aligned data percentage for the majority of the studied scenarios.

Oliver Watts, Adriana Stan, Rob Clark, Yoshitaka Mamiya, Mircea Giurgiu, Junichi Yamagishi, and Simon King. Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from 'found' data: evaluation and analysis. In 8th ISCA Workshop on Speech Synthesis, pages 121-126, Barcelona, Spain, August 2013. [ bib | .pdf ]

This paper presents techniques for building text-to-speech front-ends in a way that avoids the need for language-specific expert knowledge, but instead relies on universal resources (such as the Unicode character database) and unsupervised learning from unannotated data to ease system development. The acquisition of expert language-specific knowledge and expert annotated data is a major bottleneck in the development of corpus-based TTS systems in new languages. The methods presented here side-step the need for such resources as pronunciation lexicons, phonetic feature sets, part of speech tagged data, etc. The paper explains how the techniques introduced are applied to the 14 languages of a corpus of `found' audiobook data. Results of an evaluation of the intelligibility of the systems resulting from applying these novel techniques to this data are presented.

Jaime Lorenzo-Trueba, Roberto Barra-Chicote, Junichi Yamagishi, Oliver Watts, and Juan M. Montero. Towards speaking style transplantation in speech synthesis. In 8th ISCA Workshop on Speech Synthesis, pages 179-183, Barcelona, Spain, August 2013. [ bib | .pdf ]

One of the biggest challenges in speech synthesis is the production of naturally sounding synthetic voices. This means that the resulting voice must be not only of high enough quality but also that it must be able to capture the natural expressiveness imbued in human speech. This paper focus on solving the expressiveness problem by proposing a set of different techniques that could be used for extrapolating the expressiveness of proven high quality speaking style models into neutral speakers in HMM-based synthesis. As an additional advantage, the proposed techniques are based on adaptation approaches, which means that they can be used with little training data (around 15 minutes of training data are used in each style for this pa- per). For the final implementation, a set of 4 speaking styles were considered: news broadcasts, live sports commentary, interviews and parliamentary speech. Finally, the implementation of the 5 techniques were tested through a perceptual evaluation that proves that the deviations between neutral and speaking style average models can be learned and used to imbue expressiveness into target neutral speakers as intended.

Adriana Stan, Oliver Watts, Yoshitaka Mamiya, Mircea Giurgiu, Rob Clark, Junichi Yamagishi, and Simon King. TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]

Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained with a lightly-supervised process. Future versions of the corpus will include finer-grained alignment and prosodic annotation, all of which will be made freely available. This paper gives a general outline of the data collected so far, as well as a detailed description of how this has been done, emphasizing the minimal language-specific knowledge and manual intervention used to compile the corpus. To demonstrate its potential use, text-to-speech systems have been built for all languages using unsupervised or lightly supervised methods, also briefly presented in the paper.

Oliver Watts, Adriana Stan, Yoshitaka Mamiya, Antti Suni, José Martín Burgos, and Juan Manuel Montero. The Simple4All entry to the Blizzard Challenge 2013. In Proc. Blizzard Challenge 2013, August 2013. [ bib | .pdf ]

We describe the synthetic voices entered into the 2013 Blizzard Challenge by the SIMPLE4ALL consortium. The 2013 Blizzard Challenge presents an opportunity to test and benchmark some of the tools we have been developing to address two problems of interest: 1) how best to learn from plentiful 'found' data, and 2) how to produce systems in arbitrary new languages with minimal annotated data and language-specific expertise on the part of the system builders. We here explain how our tools were used to address these problems on the different tasks of the challenge, and provide some discussion of the evaluation results.

Atef Ben Youssef, Hiroshi Shimodaira, and David A. Braude. Articulatory features for speech-driven head motion synthesis. In Proc. Interspeech, pages 2758-2762, Lyon, France, August 2013. [ bib | .pdf ]

This study investigates the use of articulatory features for speech-driven head motion synthesis as opposed to prosody features such as F0 and energy which have been mainly used in the literature. In the proposed approach, multi-stream HMMs are trained jointly on the synchronous streams of speech and head motion data. Articulatory features can be regarded as an intermediate parametrisation of speech that are expected to have a close link with head movement. Measured head and articulatory movements acquired by EMA were synchronously recorded with speech. Measured articulatory data was compared to those predicted from speech using an HMM-based inversion mapping system trained in a semi-supervised fashion. Canonical correlation analysis (CCA) on a data set of free speech of 12 people shows that the articulatory features are more correlated with head rotation than prosodic and/or cepstral speech features. It is also shown that the synthesised head motion using articulatory features give higher correlations with the original head motion than when only prosodic features are used.

David A. Braude, Hiroshi Shimodaira, and Atef Ben Youssef. Template-warping based speech driven head motion synthesis. In Proc. Interspeech, pages 2763-2767, Lyon, France, August 2013. [ bib | .pdf ]

We propose a method for synthesising head motion from speech using a combination of an Input-Output Markov model (IOMM) and Gaussian mixture models trained in a supervised manner. A key difference of this approach compared to others is to model the head motion in each angle as a series of templates of motion rather than trying to recover a frame-wise function. The templates were chosen to reflect natural patterns in the head motion, and states for the IOMM were chosen based on statistics of the templates. This reduces the search space for the trajectories and stops impossible motions such as discontinuities from being possible. For synthesis our system warps the templates to account for the acoustic features and the other angles’ warping parameters. We show our system is capable of recovering the statistics of the motion that were chosen for the states. Our system was then compared to a baseline that used a frame-wise mapping that is based on previously published work. A subjective preference test that includes multiple speakers showed participants have a preference for the segment based approach. Both of these systems were trained on storytelling free speech.

Karel Vesely, Arnab Ghoshal, Lukáš Burget, and Daniel Povey. Sequence-discriminative training of deep neural networks. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), Lyon, France, August 2013. [ bib | .pdf ]

Sequence-discriminative training of deep neural networks (DNNs) is investigated on a 300 hour American English conversational telephone speech task. Different sequence-discriminative criteria - maximum mutual information (MMI), minimum phone error (MPE), state-level minimum Bayes risk (sMBR), and boosted MMI - are compared. Two different heuristics are investigated to improve the performance of the DNNs trained using sequence-based criteria - lattices are re-generated after the first iteration of training; and, for MMI and BMMI, the frames where the numerator and denominator hypotheses are disjoint are removed from the gradient computation. Starting from a competitive DNN baseline trained using cross-entropy, different sequence-discriminative criteria are shown to lower word error rates by 8-9% relative, on average. Little difference is noticed between the different sequence-based criteria that are investigated. The experiments are done using the open-source Kaldi toolkit, which makes it possible for the wider community to reproduce these results.

Keywords: myPubs, neural networks, discriminative training

Korin Richmond, Zhenhua Ling, Junichi Yamagishi, and Benigno Uría. On the evaluation of inversion mapping performance in the acoustic domain. In Proc. Interspeech, pages 1012-1016, Lyon, France, August 2013. [ bib | .pdf ]

The two measures typically used to assess the performance of an inversion mapping method, where the aim is to estimate what articulator movements gave rise to a given acoustic signal, are root mean squared (RMS) error and correlation. In this paper, we investigate whether “task-based” evaluation using an articulatory-controllable HMM-based speech synthesis system can give useful additional information to complement these measures. To assess the usefulness of this evaluation approach, we use articulator trajectories estimated by a range of different inversion mapping methods as input to the synthesiser, and measure their performance in the acoustic domain in terms of RMS error of the generated acoustic parameters and with a listening test involving 30 participants. We then compare these results with the standard RMS error and correlation measures calculated in the articulatory domain. Interestingly, in the acoustic evaluation we observe one method performs with no statistically significant difference from measured articulatory data, and cases where statistically significant differences between methods exist which are not reflected in the results of the two standard measures. From our results, we conclude such task-based evaluation can indeed provide interesting extra information, and gives a useful way to compare inversion methods.

Keywords: Inversion mapping, evaluation, HMM synthesis

James Scobbie, Alice Turk, Christian Geng, Simon King, Robin Lickley, and Korin Richmond. The Edinburgh speech production facility DoubleTalk corpus. In Proc. Interspeech, Lyon, France, August 2013. [ bib | .pdf ]

The DoubleTalk articulatory corpus was collected at the Edinburgh Speech Production Facility (ESPF) using two synchronized Carstens AG500 electromagnetic articulometers. The first release of the corpus comprises orthographic transcriptions aligned at phrasal level to EMA and audio data for each of 6 mixed-dialect speaker pairs. It is available from the ESPF online archive. A variety of tasks were used to elicit a wide range of speech styles, including monologue (a modified Comma Gets a Cure and spontaneous story-telling), structured spontaneous dialogue (Map Task and Diapix), a wordlist task, a memory-recall task, and a shadowing task. In this session we will demo the corpus with various examples.

Keywords: discourse, EMA, spontaneous speech

Maria Astrinaki, Alexis Moinet, Junichi Yamagishi, Korin Richmond, Zhen-Hua Ling, Simon King, and Thierry Dutoit. Mage - HMM-based speech synthesis reactively controlled by the articulators. In 8th ISCA Workshop on Speech Synthesis, page 243, Barcelona, Spain, August 2013. [ bib | .pdf ]

In this paper, we present the recent progress in the MAGE project. MAGE is a library for realtime and interactive (reactive) parametric speech synthesis using hidden Markov models (HMMs). Here, it is broadened in order to support not only the standard acoustic features (spectrum and f0) to model and synthesize speech but also to combine acoustic and articulatory features, such as tongue, lips and jaw positions. Such an integration enables the user to have a straight forward and meaningful control space to intuitively modify the synthesized phones in real time only by configuring the position of the articulators.

Keywords: speech synthesis, reactive, articulators

Chee-Ming Ting, Simon King, Sh-Hussain Salleh, and A. K. Ariff. Discriminative tandem features for HMM-based EEG classification. In Proc. 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 13), Osaka, Japan, July 2013. [ bib | .pdf ]

We investigate the use of discriminative feature extractors in tandem configuration with generative EEG classification system. Existing studies on dynamic EEG classification typically use hidden Markov models (HMMs) which lack discriminative capability. In this paper, a linear and a non-linear classifier are discriminatively trained to produce complementary input features to the conventional HMM system. Two sets of tandem features are derived from linear discriminant analysis (LDA) projection output and multilayer perceptron (MLP) class-posterior probability, before appended to the standard autoregressive (AR) features. Evaluation on a two-class motor-imagery classification task shows that both the proposed tandem features yield consistent gains over the AR baseline, resulting in significant relative improvement of 6.2% and 11.2% for the LDA and MLP features respectively. We also explore portability of these features across different subjects.

Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Today, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. Speech synthesis based on hidden markov models. Proceedings of the IEEE, 101(6), June 2013. (in press). [ bib ]

This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described.

C. Valentini-Botinhao, E. Godoy, Y. Stylianou, B. Sauert, S. King, and J. Yamagishi. Improving intelligibility in noise of HMM-generated speech via noise-dependent and -independent methods. In Proc. ICASSP, Vancouver, Canada, May 2013. [ bib | .pdf ]

H. Lu and S. King. Factorized context modelling for text-to-speech synthesis. In Proc. ICASSP, Vancouver, Canada, May 2013. [ bib | .pdf ]

Because speech units are so context-dependent, a large number of linguistic context features are generally used by HMM- based Text-to-Speech (TTS) speech synthesis systems, via context-dependent models. Since it is impossible to train separate models for every context, decision trees are used to discover the most important combinations of features that should be modelled. The task of the decision tree is very hard to generalize from a very small observed part of the context feature space to the rest and they have a major weakness: they cannot directly take advantage of factorial properties: they subdivide the model space based on one feature at a time. We propose a Dynamic Bayesian Network (DBN) based Mixed Memory Markov Model (MMMM) to provide factorization of the context space. The results of a listening test are provided as evidence that the model successfully learns the factorial nature of this space.

Mark Sinclair and Simon King. Where are the challenges in speaker diarization? In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, Vancouver, British Columbia, USA, May 2013. [ bib | .pdf ]

We present a study on the contributions to Diarization Error Rate by the various components of speaker diarization system. Following on from an earlier study by Huijbregts and Wooters, we extend into more areas and draw somewhat different conclusions. From a series of experiments combining real, oracle and ideal system components, we are able to conclude that the primary cause of error in diarization is the training of speaker models on impure data, something that is in fact done in every current system. We conclude by suggesting ways to improve future systems, including a focus on training the speaker models from smaller quantities of pure data instead of all the data, as is currently done.

Ramya Rasipuram, Peter Bell, and Mathew Magimai.-Doss. Grapheme and multilingual posterior features for under-resourced speech recognition: a study on Scottish Gaelic. In Proc. ICASSP, Vancouver, Canada, May 2013. [ bib | .pdf ]

Standard automatic speech recognition (ASR) systems use phonemes as subword units. Thus, one of the primary resources required to build a good ASR system is a well developed phoneme pronunciation lexicon. However, under-resourced languages typically lack such lexical resources. In this paper, we investigate recently proposed grapheme-based ASR in the framework of Kullback-Leibler divergence based hidden Markov model (KL-HMM) for under-resourced languages, particularly Scottish Gaelic which has no lexical resources. More specifically, we study the use of grapheme and multilingual phoneme class conditional probabilities (posterior features) as feature observations in the KL-HMM. ASR studies conducted show that the proposed approach yields better system compared to the conventional HMM/GMM approach using cepstral features. Furthermore, grapheme posterior features estimated using both auxiliary data and Gaelic data yield the best system.

Peter Bell, Pawel Swietojanski, and Steve Renals. Multi-level adaptive networks in tandem and hybrid ASR systems. In Proc. ICASSP, Vancouver, Canada, May 2013. [ bib | DOI | .pdf ]

In this paper we investigate the use of Multi-level adaptive networks (MLAN) to incorporate out-of-domain data when training large vocabulary speech recognition systems. In a set of experiments on multi-genre broadcast data and on TED lecture recordings we present results using of out-of-domain features in a hybrid DNN system and explore tandem systems using a variety of input acoustic features. Our experiments indicate using the MLAN approach in both hybrid and tandem systems results in consistent reductions in word error rate of 5-10% relative.

Mark Hartswood, Maria Wolters, Jenny Ure, Stuart Anderson, and Marina Jirotka. Socio-material design for computer mediated social sensemaking. In Proc. CHI Workshop on Explorations in Social Interaction Design, April 2013. [ bib | .pdf ]

Telemonitoring healthcare solutions often struggle to provide the hoped for efficiency improvements in managing chronic illness because of the difficulty interpreting sensor data remotely. Computer-Mediated Social Sensemaking (CMSS) is an approach to solving this problem that leverages the patient's social network to supply the missing contextual detail so that remote doctors can make more accurate decisions. However, implementing CMSS solutions is difficult because users need to know who can see which information, and whether private and confidential information is adequately protected. In this paper, we wish to explore how socio-material design solutions might offer ways of making properties of a CMSS solution tangible to participants so that they can control and understand the implications of their participation.

John Dines, Hui Liang, Lakshmi Saheer, Matthew Gibson, William Byrne, Keiichiro Oura, Keiichi Tokuda, Junichi Yamagishi, Simon King, Mirjam Wester, Teemu Hirsimäki, Reima Karhila, and Mikko Kurimo. Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. Computer Speech and Language, 27(2):420-437, February 2013. [ bib | DOI | http ]

In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics.

Keywords: Speech-to-speech translation, Cross-lingual speaker adaptation, HMM-based speech synthesis, Speaker adaptation, Voice conversion

Cassia Valentini-Botinhao. Intelligibility enhancement of synthetic speech in noise. PhD thesis, University of Edinburgh, 2013. [ bib | .pdf ]

Speech technology can facilitate human-machine interaction and create new communication interfaces. Text-To-Speech (TTS) systems provide speech output for dialogue, notification and reading applications as well as personalized voices for people that have lost the use of their own. TTS systems are built to produce synthetic voices that should sound as natural, expressive and intelligible as possible and if necessary be similar to a particular speaker. Although naturalness is an important requirement, providing the correct information in adverse conditions can be crucial to certain applications. Speech that adapts or reacts to different listening conditions can in turn be more expressive and natural. In this work we focus on enhancing the intelligibility of TTS voices in additive noise. For that we adopt the statistical parametric paradigm for TTS in the shape of a hidden Markov model (HMM-) based speech synthesis system that allows for flexible enhancement strategies. Little is known about which human speech production mechanisms actually increase intelligibility in noise and how the choice of mechanism relates to noise type, so we approached the problem from another perspective: using mathematical models for hearing speech in noise. To find which models are better at predicting intelligibility of TTS in noise we performed listening evaluations to collect subjective intelligibility scores which we then compared to the models’ predictions. In these evaluations we observed that modifications performed on the spectral envelope of speech can increase intelligibility significantly, particularly if the strength of the modification depends on the noise and its level. We used these findings to inform the decision of which of the models to use when automatically modifying the spectral envelope of the speech according to the noise. We devised two methods, both involving cepstral coefficient modifications. The first was applied during extraction while training the acoustic models and the other when generating a voice using pre-trained TTS models. The latter has the advantage of being able to address fluctuating noise. To increase intelligibility of synthetic speech at generation time we proposed a method for Mel cepstral coefficient modification based on the glimpse proportion measure, the most promising of the models of speech intelligibility that we evaluated. An extensive series of listening experiments demonstrated that this method brings significant intelligibility gains to TTS voices while not requiring additional recordings of clear or Lombard speech. To further improve intelligibility we combined our method with noise-independent enhancement approaches based on the acoustics of highly intelligible speech. This combined solution was as effective for stationary noise as for the challenging competing speaker scenario, obtaining up to 4dB of equivalent intensity gain. Finally, we proposed an extension to the speech enhancement paradigm to account for not only energetic masking of signals but also for linguistic confusability of words in sentences. We found that word level confusability, a challenging value to predict, can be used as an additional prior to increase intelligibility even for simple enhancement methods like energy reallocation between words. These findings motivate further research into solutions that can tackle the effect of energetic masking on the auditory system as well as on higher levels of processing.

Y. Tang, M. Cooke, and C. Valentini-Botinhao. A distortion-weighted glimpse-based intelligibility metric for modified and synthetic speech. In Proc. SPIN, 2013. [ bib | .pdf ]

M. Cooke, C. Mayo, C. Valentini-Botinhao, Y. Stylianou, B. Sauert, and Y. Tang. Evaluating the intelligibility benefit of speech modifications in known noise conditions. Speech Communication, 55:572-585, 2013. [ bib | .pdf ]

The use of live and recorded speech is widespread in applications where correct message reception is important. Furthermore, the deployment of synthetic speech in such applications is growing. Modifications to natural and synthetic speech have therefore been proposed which aim at improving intelligibility in noise. The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech. Listeners identified keywords in phonetically-balanced sentences representing ten different types of speech: plain and Lombard speech, five types of modified speech, and three forms of synthetic speech. Sentences were masked by either a stationary or a competing speech masker. Modification methods varied in the manner and degree to which they exploited estimates of the masking noise. The best-performing modifications led to equivalent intensity changes of around 5 dB in moderate and high noise levels for the stationary masker, and 3-4 dB in the presence of competing speech. These gains exceed those produced by Lombard speech. Synthetic speech in noise was always less intelligible than plain natural speech, but modified synthetic speech reduced this deficit by a significant amount.

Liang Lu, KK Chin, Arnab Ghoshal, and Steve Renals. Joint uncertainty decoding for noise robust subspace Gaussian mixture models. IEEE Transactions on Audio, Speech and Language Processing, 21(9):1791-1804, 2013. [ bib | DOI | .pdf ]

Joint uncertainty decoding (JUD) is a model-based noise compensation technique for conventional Gaussian Mixture Model (GMM) based speech recognition systems. Unlike vector Taylor series (VTS) compensation which operates on the individual Gaussian components in an acoustic model, JUD clusters the Gaussian components into a smaller number of classes, sharing the compensation parameters for the set of Gaussians in a given class. This significantly reduces the computational cost. In this paper, we investigate noise compensation for subspace Gaussian mixture model (SGMM) based speech recognition systems using JUD. The total number of Gaussian components in an SGMM is typically very large. Therefore direct compensation of the individual Gaussian components, as performed by VTS, is computationally expensive. In this paper we show that JUD-based noise compensation can be successfully applied to SGMMs in a computationally efficient way. We evaluate the JUD/SGMM technique on the standard Aurora 4 corpus. Our experimental results indicate that the JUD/SGMM system results in lower word error rates compared with a conventional GMM system with either VTS-based or JUD-based noise compensation.

Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. Revisiting hybrid and GMM-HMM system combination techniques. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013. [ bib | DOI | .pdf ]

In this paper we investigate techniques to combine hybrid HMM-DNN (hidden Markov model - deep neural network) and tandem HMM-GMM (hidden Markov model - Gaussian mixture model) acoustic models using: (1) model averaging, and (2) lattice combination with Minimum Bayes Risk decoding. We have performed experiments on the “TED Talks” task following the protocol of the IWSLT-2012 evaluation. Our experimental results suggest that DNN-based and GMM- based acoustic models are complementary, with error rates being reduced by up to 8% relative when the DNN and GMM systems are combined at model-level in a multi-pass auto- matic speech recognition (ASR) system. Additionally, further gains were obtained by combining model-averaged lat- tices with the one obtained from baseline systems.

Arnab Ghoshal, Pawel Swietojanski, and Steve Renals. Multilingual training of deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013. [ bib | DOI | .pdf ]

We investigate multilingual modeling in the context of a deep neural network (DNN) - hidden Markov model (HMM) hy- brid, where the DNN outputs are used as the HMM state like- lihoods. By viewing neural networks as a cascade of fea- ture extractors followed by a logistic regression classifier, we hypothesise that the hidden layers, which act as feature ex- tractors, will be transferable between languages. As a corol- lary, we propose that training the hidden layers on multiple languages makes them more suitable for such cross-lingual transfer. We experimentally confirm these hypotheses on the GlobalPhone corpus using seven languages from three dif- ferent language families: Germanic, Romance, and Slavic. The experiments demonstrate substantial improvements over a monolingual DNN-HMM hybrid baseline, and hint at av- enues of further exploration.

Z. Ling, K. Richmond, and J. Yamagishi. Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. Audio, Speech, and Language Processing, IEEE Transactions on, 21(1):207-219, 2013. [ bib | DOI | .pdf ]

In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesiser. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous explanatory variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.

Àngel Calzada Defez, Joan Claudi Socoró Carrié, and Robert Clark. Parametric model for vocal effort interpolation with harmonics plus noise models. In Proc. 8th ISCA Speech Synthesis Workshop, pages 25-30, 2013. [ bib | .pdf ]

It is known that voice quality plays an important role in expressive speech. In this paper, we present a methodology for modifying vocal effort level, which can be applied by text-to-speech (TTS) systems to provide the flexibility needed to improve the naturalness of synthesized speech. This extends previous work using low order Linear Prediction Coefficients (LPC) where the flexibility was constrained by the amount of vocal effort levels available in the corpora. The proposed methodology overcomes these limitations by replacing the low order LPC by ninth order polynomials to allow not only vocal effort to be modified towards the available templates, but also to allow the generation of intermediate vocal effort levels between levels available in training data. This flexibility comes from the combination of Harmonics plus Noise Models and using a parametric model to represent the spectral envelope. The conducted perceptual tests demonstrate the effectiveness of the proposed technique in per- forming vocal effort interpolations while maintaining the signal quality in the final synthesis. The proposed technique can be used in unit-selection TTS systems to reduce corpus size while increasing its flexibility, and the techniques could potentially be employed by HMM based speech synthesis systems if appropriate acoustic features are being used.

Sarah Creer, Stuart Cunningham, Phil Green, and Junichi Yamagishi. Building personalised synthetic voices for individuals with severe speech impairment. Computer Speech & Language, 27(6):1178 - 1193, 2013. <ce:title>Special Issue on Speech and Language Processing for Assistive Technology</ce:title>. [ bib | DOI | http ]

Keywords: Voice output communication aid

Catherine Lai, Jean Carletta, and Steve Renals. Detecting summarization hot spots in meetings using group level involvement and turn-taking features. In Proc. Interspeech 2013, Lyon, France, 2013. [ bib | .pdf ]

In this paper we investigate how participant involvement and turn-taking features relate to extractive summarization of meeting dialogues. In particular, we examine whether automatically derived measures of group level involvement, like participation equality and turn-taking freedom, can help detect where summarization relevant meeting segments will be. Results show that classification using turn-taking features performed better than the majority class baseline for data from both AMI and ICSI meeting corpora in identifying whether meeting segments contain extractive summary dialogue acts. The feature based approach also provided better recall than using manual ICSI involvement hot spot annotations. Turn-taking features were additionally found to be predictive of the amount of extractive summary content in a segment. In general, we find that summary content decreases with higher participation equality and overlap, while it increases with the number of very short utterances. Differences in results between the AMI and ICSI data sets suggest how group participatory structure can be used to understand what makes meetings easy or difficult to summarize.

Catherine Lai, Jean Carletta, and Steve Renals. Modelling participant affect in meetings with turn-taking features. In Proceedings of WASSS 2013, Grenoble, France, 2013. [ bib | .pdf ]

This paper explores the relationship between turn-taking and meeting affect. To investigate this, we model post-meeting ratings of satisfaction, cohesion and leadership from participants of AMI corpus meetings using group and individual turn-taking features. The results indicate that participants gave higher satisfaction and cohesiveness ratings to meetings with greater group turn-taking freedom and individual very short utterance rates, while lower ratings were associated with more silence and speaker overlap. Besides broad applicability to satisfaction ratings, turn-taking freedom was found to be a better predictor than equality of speaking time when considering whether participants felt that everyone they had a chance to contribute. If we include dialogue act information, we see that substantive feedback type turns like assessments are more predictive of meeting affect than information giving acts or backchannels. This work highlights the importance of feedback turns and modelling group level activity in multiparty dialogue for understanding the social aspects of speech.

Catherine Lai, Keelan Evanini, and Klaus Zechner. Applying rhythm metrics to non-native spontaneous speech. In Proceedings of SLaTE 2013, Grenoble, France, 2013. [ bib | .pdf ]

This study investigates a variety of rhythm metrics on two corpora of non-native spontaneous speech and compares the nonnative distributions to values from a corpus of native speech. Several of the metrics are shown to differentiate well between native and non-native speakers and to also have moderate correlations with English proficiency scores that were assigned to the non-native speech. The metric that had the highest correlation with English proficiency scores (apart from speaking rate) was rPVIsyl (the raw Pairwise Variability Index for syllables), with r = 0.43.

P. Lanchantin, P. Bell, M. Gales, T. Hain, X. Liu, Y. Long, J. Quinnell, S. Renals, O. Saz, M. Seigel, P. Swietojanski, and P. Woodland. Automatic transcription of multi-genre media archives. In Proc. Workshop on Speech, Language and Audio in Multimedia, Marseille, France, 2013. [ bib | .pdf ]

This paper describes some recent results of our collaborative work on developing a speech recognition system for the automatic transcription or media archives from the British Broadcasting Corporation (BBC). Material includes a high diversity of shows with their associated transcriptions. The latter are highly diverse in terms of completeness, reliability and accuracy. First, we investigate how to improve lightly supervised acoustic training when time-stamps information is inaccurate or when speech deviates significantly from the transcription. To address the last issue, word and segment level combination approaches are used between the lightly supervised transcripts and the original programme scripts which yield improved transcriptions. Experimental results show that systems trained using these improved transcriptions consistently outperform those trained using only the original lightly supervised decoding hypotheses. Secondly, we show that the recognition task may benefit from systems trained on a combination of in-domain and out-of-domain data. Working with tandem HMMs, we present Multi-level Adaptive Networks, a novel technique for incorporating information from out-of domain posterior features using deep neural network. We show that it provides a substantial reduction in WER over other systems including PLP baseline, in-domain tandem features and best out-of-domain tandem features.

David Adam Braude, Hiroshi Shimodaira, and Atef Ben Youssef. Template-warping based speech driven head motion synthesis. In Interspeech, pages 2763 - 2767, 2013. [ bib | .pdf ]

We propose a method for synthesising head motion from speech using a combination of an Input-Output Markov model (IOMM) and Gaussian mixture models trained in a supervised manner. A key difference of this approach compared to others is to model the head motion in each angle as a series of templates of motion rather than trying to recover a frame-wise function. The templates were chosen to reflect natural patterns in the head motion, and states for the IOMM were chosen based on statistics of the templates. This reduces the search space for the trajectories and stops impossible motions such as discontinuities from being possible. For synthesis our system warps the templates to account for the acoustic features and the other angles' warping parameters. We show our system is capable of recovering the statistics of the motion that were chosen for the states. Our system was then compared to a baseline that used a frame-wise mapping that is based on previously published work. A subjective preference test that includes multiple speakers showed participants have a preference for the segment based approach. Both of these systems were trained on storytelling free speech.

Keywords: Head motion synthesis, GMMs, IOMM

Javier Tejedor, Doroteo T. Toledano, Dong Wang, Simon King, and Jose Colas. Feature analysis for discriminative confidence estimation in spoken term detection. Computer Speech and Language, To appear, 2013. [ bib | .pdf ]

Discriminative confidence based on multi-layer perceptrons (MLPs) and multiple features has shown significant advantage compared to the widely used lattice-based confidence in spoken term detection (STD). Although the MLP-based framework can handle any features derived from a multitude of sources, choosing all possible features may lead to over complex models and hence less generality. In this paper, we design an extensive set of features and analyze their contribution to STD individually and as a group. The main goal is to choose a small set of features that are sufficiently informative while keeping the model simple and generalizable. We employ two established models to conduct the analysis: one is linear regression which targets for the most relevant features and the other is logistic linear regression which targets for the most discriminative features. We find the most informative features are comprised of those derived from diverse sources (ASR decoding, duration and lexical properties) and the two models deliver highly consistent feature ranks. STD experiments on both English and Spanish data demonstrate significant performance gains with the proposed feature sets.

P. Lal and S. King. Cross-lingual automatic speech recognition using tandem features. IEEE Transactions on Audio, Speech, and Language Processing, To appear, 2013. [ bib | DOI | .pdf ]

Automatic speech recognition depends on large amounts of transcribed speech recordings in order to estimate the parameters of the acoustic model. Recording such large speech corpora is time-consuming and expensive; as a result, sufficient quantities of data exist only for a handful of languages — there are many more languages for which little or no data exist. Given that there are acoustic similarities between speech in different languages, it may be fruitful to use data from a well-resourced source language to estimate the acoustic models for a recogniser in a poorly-resourced target language. Previous approaches to this task have often involved making assumptions about shared phonetic inventories between the languages. Unfortunately pairs of languages do not generally share a common phonetic inventory. We propose an indirect way of transferring information from a source language acoustic model to a target language acoustic model without having to make any assumptions about the phonetic inventory overlap. To do this, we employ tandem features, in which class-posteriors from a separate classifier are decorrelated and appended to conventional acoustic features. Tandem features have the advantage that the language of the speech data used to train the classifier need not be the same as the target language to be recognised. This is because the class-posteriors are not used directly, so do not have to be over any particular set of classes. We demonstrate the use of tandem features in cross-lingual settings, including training on one or several source languages. We also examine factors which may predict a priori how much relative improvement will be brought about by using such tandem features, for a given source and target pair. In addition to conventional phoneme class-posteriors, we also investigate whether articulatory features (AFs) - a multistream, discrete, multi-valued labelling of speech — can be used instead. This is motivated by an assumption that AFs are less language-specific than a phoneme set.

Keywords: Acoustics;Data models;Hidden Markov models;Speech;Speech recognition;Training;Transforms;Automatic speech recognition;Multilayer perceptrons

Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts, Robert A.J. Clark, Simon King, and Adriana Stan. Lightly supervised gmm vad to use audiobook for speech synthesiser. In Proc. ICASSP, 2013. [ bib | .pdf ]

Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semi-automatically build TTS systems from audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on audiobooks.

Liang Lu, Arnab Ghoshal, and Steve Renals. Noise adaptive training for subspace Gaussian mixture models. In Proc. Interspeech, 2013. [ bib | .pdf ]

Noise adaptive training (NAT) is an effective approach to normalise environmental distortions when training a speech recogniser on noise-corrupted speech. This paper investigates the model-based NAT scheme using joint uncertainty decoding (JUD) for subspace Gaussian mixture models (SGMMs). A typical SGMM acoustic model has much larger number of surface Gaussian components, which makes it computationally infeasible to compensate each Gaussian explicitly. JUD tackles this problem by sharing the compensation parameters among the Gaussians and hence reduces the computational and memory demands. For noise adaptive training, JUD is reformulated into a generative model, which leads to an efficient expectation-maximisation (EM) based algorithm to update the SGMM acoustic model parameters. We evaluated the SGMMs with NAT on the Aurora 4 database, and obtained higher recognition accuracy compared to systems without adaptive training.

Liang Lu, Arnab Ghoshal, and Steve Renals. Acoustic data-driven pronunciation lexicon for large vocabulary speech recognition. In Proc. ASRU, 2013. [ bib | DOI | .pdf ]

Speech recognition systems normally use handcrafted pronunciation lexicons designed by linguistic experts. Building and maintaining such a lexicon is expensive and time consuming. This paper concerns automatically learning a pronunciation lexicon for speech recognition. We assume the availability of a small seed lexicon and then learn the pronunciations of new words directly from speech that is transcribed at word-level. We present two implementations for refining the putative pronunciations of new words based on acoustic evidence. The first one is an expectation maximization (EM) algorithm based on weighted finite state transducers (WFSTs) and the other is its Viterbi approximation. We carried out experiments on the Switchboard corpus of conversational telephone speech. The expert lexicon has a size of more than 30,000 words, from which we randomly selected 5,000 words to form the seed lexicon. By using the proposed lexicon learning method, we have significantly improved the accuracy compared with a lexicon learned using a grapheme-to-phoneme transformation, and have obtained a word error rate that approaches that achieved using a fully handcrafted lexicon.

Liang Lu. Subspace Gaussian Mixture Models for Automatic Speech Recognition. PhD thesis, University of Edinburgh, 2013. [ bib | .pdf ]

In most of state-of-the-art speech recognition systems, Gaussian mixture models (GMMs) are used to model the density of the emitting states in the hidden Markov models (HMMs). In a conventional system, the model parameters of each GMM are estimated directly and independently given the alignment. This results a large number of model parameters to be estimated, and consequently, a large amount of training data is required to fit the model. In addition, different sources of acoustic variability that impact the accuracy of a recogniser such as pronunciation variation, accent, speaker factor and environmental noise are only weakly modelled and factorized by adaptation techniques such as maximum likelihood linear regression (MLLR), maximum a posteriori adaptation (MAP) and vocal tract length normalisation (VTLN). In this thesis, we will discuss an alternative acoustic modelling approach - the subspace Gaussian mixture model (SGMM), which is expected to deal with these two issues better. In an SGMM, the model parameters are derived from low-dimensional model and speaker subspaces that can capture phonetic and speaker correlations. Given these subspaces, only a small number of state-dependent parameters are required to derive the corresponding GMMs. Hence, the total number of model parameters can be reduced, which allows acoustic modelling with a limited amount of training data. In addition, the SGMM-based acoustic model factorizes the phonetic and speaker factors and within this framework, other source of acoustic variability may also be explored. In this thesis, we propose a regularised model estimation for SGMMs, which avoids overtraining in case that the training data is sparse. We will also take advantage of the structure of SGMMs to explore cross-lingual acoustic modelling for low-resource speech recognition. Here, the model subspace is estimated from out-domain data and ported to the target language system. In this case, only the state-dependent parameters need to be estimated which relaxes the requirement of the amount of training data. To improve the robustness of SGMMs against environmental noise, we propose to apply the joint uncertainty decoding (JUD) technique that is shown to be efficient and effective. We will report experimental results on the Wall Street Journal (WSJ) database and GlobalPhone corpora to evaluate the regularisation and cross-lingual modelling of SGMMs. Noise compensation using JUD for SGMM acoustic models is evaluated on the Aurora 4 database.

David A. Braude, Hiroshi Shimodaira, and Atef Ben Youssef. The University of Edinburgh head-motion and audio storytelling (UoE-HaS) dataset. In Proc. Intelligent Virtual Agents, pages 466-467. Springer, 2013. [ bib | .pdf ]

In this paper we announce the release of a large dataset of storytelling monologue with motion capture for the head and body. Initial tests on the dataset indicate that head motion is more dependant on the speaker than the style of speech.

Elizabeth Godoy, Catherine Mayo, and Yannis Stylianou. Linking loudness increases in normal and Lombard speech to decreasing vowel formant separation. In Proc. Interspeech, 2013. [ bib | .PDF ]

The increased vocal effort associated with the Lombard reflex produces speech that is perceived as louder and judged to be more intelligible in noise than normal speech. Previous work illustrates that, on average, Lombard increases in loudness result from boosting spectral energy in a frequency band spanning the range of formants F1-F3, particularly for voiced speech. Observing additionally that increases in loudness across spoken sentences are spectro-temporally localized, the goal of this work is to further isolate these regions of maximal loudness by linking them to specific formant trends, explicitly considering here the vowel formant separation. For both normal and Lombard speech, this work illustrates that, as loudness increases in frequency bands containing formants (e.g. F1-F2 or F2-F3), the observed separation between formant frequencies decreases. From a production standpoint, these results seem to highlight a physiological trait associated with how humans increase the loudness of their speech, namely moving vocal tract resonances closer together. Particularly, for Lombard speech, this phenomena is exaggerated: that is, the Lombard speech is louder and formants in corresponding spectro-temporal regions are even closer together

Catherine Mayo, Fiona Gibbon, and Robert A. J. Clark. Phonetically trained and untrained adults' transcription of place of articulation for intervocalic lingual stops with intermediate acoustic cues. Journal of Speech, Language and Hearing Research, 56:779-791, 2013. [ bib | DOI ]

Purpose: In this study, the authors aimed to investigate how listener training and the presence of intermediate acoustic cues influence transcription variability for conflicting cue speech stimuli. Method: Twenty listeners with training in transcribing disordered speech, and 26 untrained listeners, were asked to make forced-choice labeling decisions for synthetic vowel–consonant–vowel (VCV) sequences "a doe" and "a go". Both the VC and CV transitions in these stimuli ranged through intermediate positions, from appropriate for /d/ to appropriate for /g/. Results: Both trained and untrained listeners gave more weight to the CV transitions than to the VC transitions. However, listener behavior was not uniform: The results showed a high level of inter- and intratranscriber inconsistency, with untrained listeners showing a nonsignificant tendency to be more influenced than trained listeners by CV transitions. Conclusions: Listeners do not assign consistent categorical labels to the type of intermediate, conflicting transitional cues that were present in the stimuli used in the current study and that are also present in disordered articulations. Although listener inconsistency in assigning labels to intermediate productions is not increased as a result of phonetic training, neither is it reduced by such training.

Keywords: speech perception, intermediate acoustic cues, phonetic transcription, multilevel logistic regression

Christian Geng, Alice Turk, James M. Scobbie, Cedric Macmartin, Philip Hoole, Korin Richmond, Alan Wrench, Marianne Pouplier, Ellen Gurman Bard, Ziggy Campbell, Catherine Dickie, Eddie Dubourg, William Hardcastle, Evia Kainada, Simon King, Robin Lickley, Satsuki Nakai, Steve Renals, Kevin White, and Ronny Wiegand. Recording speech articulation in dialogue: Evaluating a synchronized double electromagnetic articulography setup. Journal of Phonetics, 41(6):421 - 431, 2013. [ bib | DOI | http | .pdf ]

Abstract We demonstrate the workability of an experimental facility that is geared towards the acquisition of articulatory data from a variety of speech styles common in language use, by means of two synchronized electromagnetic articulography (EMA) devices. This approach synthesizes the advantages of real dialogue settings for speech research with a detailed description of the physiological reality of speech production. We describe the facility's method for acquiring synchronized audio streams of two speakers and the system that enables communication among control room technicians, experimenters and participants. Further, we demonstrate the feasibility of the approach by evaluating problems inherent to this specific setup: The first problem is the accuracy of temporal synchronization of the two {EMA} machines, the second is the severity of electromagnetic interference between the two machines. Our results suggest that the synchronization method used yields an accuracy of approximately 1 ms. Electromagnetic interference was derived from the complex-valued signal amplitudes. This dependent variable was analyzed as a function of the recording status - i.e. on/off - of the interfering machine's transmitters. The intermachine distance was varied between 1 m and 8.5 m. Results suggest that a distance of approximately 6.5 m is appropriate to achieve data quality comparable to that of single speaker recordings.

Ingmar Steiner, Korin Richmond, and Slim Ouni. Speech animation using electromagnetic articulography as motion capture data. In Proc. 12th International Conference on Auditory-Visual Speech Processing, pages 55-60, Annecy, France, 2013. [ bib | .pdf ]

Electromagnetic articulography (EMA) captures the position and orientation of a number of markers, attached to the articulators, during speech. As such, it performs the same function for speech that conventional motion capture does for full-body movements acquired with optical modalities, a long-time staple technique of the animation industry. In this paper, EMA data is processed from a motion-capture perspective and applied to the visualization of an existing multimodal corpus of articulatory data, creating a kinematic 3D model of the tongue and teeth by adapting a conventional motion capture based animation paradigm. This is accomplished using off-the-shelf, open-source software. Such an animated model can then be easily integrated into multimedia applications as a digital asset, allowing the analysis of speech production in an intuitive and accessible manner. The processing of the EMA data, its co-registration with 3D data from vocal tract magnetic resonance imaging (MRI) and dental scans, and the modeling workflow are presented in detail, and several issues discussed.

Keywords: speech production, articulatory data, electromagnetic articulography, vocal tract, motion capture, visualization

Peter Bell, Fergus McInnes, Siva Reddy Gangireddy, Mark Sinclair, Alexandra Birch, and Steve Renals. The UEDIN english ASR system for the IWSLT 2013 evaluation. In Proc. International Workshop on Spoken Language Translation, 2013. [ bib | .pdf ]

This paper describes the University of Edinburgh (UEDIN) English ASR system for the IWSLT 2013 Evaluation. Notable features of the system include deep neural network acoustic models in both tandem and hybrid configuration, cross-domain adaptation with multi-level adaptive networks, and the use of a recurrent neural network language model. Improvements to our system since the 2012 evaluation - which include the use of a significantly improved n-gram language model - result in a 19% relative WER reduction on the set.

E. Zwyssig, F. Faubel, S. Renals, and M. Lincoln. Recognition of overlapping speech using digital MEMS microphone arrays. In Proc IEEE ICASSP, 2013. [ bib | DOI | .pdf ]

This paper presents a new corpus comprising single and overlapping speech recorded using digital MEMS and analogue microphone arrays. In addition to this, the paper presents results from speech separation and recognition experiments on this data. The corpus is a reproduction of the multi-channel Wall Street Journal audio-visual corpus (MC-WSJ-AV), containing recorded speech in both a meeting room and an anechoic chamber using two different microphone types as well as two different array geometries. The speech separation and speech recognition experiments were performed using SRP-PHAT-based speaker localisation, superdirective beamforming and multiple post-processing schemes, such as residual echo suppression and binary masking. Our simple, cMLLR-based recognition system matches the performance of state-of-the-art ASR systems on the single speaker task and outperforms them on overlapping speech. The corpus will be made publicly available via the LDC in spring 2013.