P. Swietojanski and S. Renals. Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In Proc. IEEE Workshop on Spoken Language Technology, Lake Tahoe, USA, December 2014. [ bib | .pdf ]
This paper proposes a simple yet effective model-based neural network speaker adaptation technique that learns speaker- speciﬁc hidden unit contributions given adaptation data, without requiring any form of speaker-adaptive training, or labelled adaptation data. An additional amplitude parameter is deﬁned for each hidden unit; the amplitude parameters are tied for each speaker, and are learned using unsupervised adaptation. We conducted experiments on the TED talks data, as used in the International Workshop on Spoken Language Translation (IWSLT) evaluations. Our results indicate that the approach can reduce word error rates on standard IWSLT test sets by about 8–15% relative compared to unadapted systems, with a further reduction of 4–6% relative when combined with feature-space maximum likelihood linear re- gression (fMLLR). The approach can be employed in most existing feed-forward neural network architectures, and we report results using various hidden unit activation functions: sigmoid, maxout, and rectifying linear units (ReLU).
Peter Bell, Pawel Swietojanski, Joris Driesen, Mark Sinclair, Fergus McInnes, and Steve Renals. The UEDIN ASR systems for the IWSLT 2014 evaluation. In Proc. IWSLT, South Lake Tahoe, USA, December 2014. [ bib | .pdf ]
This paper describes the University of Edinburgh (UEDIN) ASR systems for the 2014 IWSLT Evaluation. Notable features of the English system include deep neural network acoustic models in both tandem and hybrid configuration with the use of multi-level adaptive networks, LHUC adaptation and Maxout units. The German system includes lightly supervised training and a new method for dictionary generation. Our voice activity detection system now uses a semi-Markov model to incorporate a prior on utterance lengths. There are improvements of up to 30% relative WER on the tst2013 English test set.
Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King. Intelligibility enhancement of speech in noise. In Proceedings of the Institute of Acoustics, volume 36, pages 96-103, Birmingham, UK, October 2014. [ bib | .pdf ]
To maintain communication success, humans change the way they speak and hear according to many factors, like the age, gender, native language and social relationship between talker and listener. Other factors are dictated by how communication takes place, such as environmental factors like an active competing speaker or limitations on the communication channel. As in natural interaction, we expect to communicate with and use synthetic voices that can also adapt to different listening scenarios and keep the level of intelligibility high. Research in speech technology needs to account for this to change the way we transmit, store and artificially generate speech accordingly.
P. Swietojanski, A. Ghoshal, and S. Renals. Convolutional neural networks for distant speech recognition. Signal Processing Letters, IEEE, 21(9):1120-1124, September 2014. [ bib | DOI | .pdf ]
We investigate convolutional neural networks (CNNs) for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM). In the MDM case we explore a beamformed signal input representation compared with the direct use of multiple acoustic channels as a parallel input to the CNN. We have explored different weight sharing approaches, and propose a channel-wise convolution with two-way pooling. Our experiments, using the AMI meeting corpus, found that CNNs improve the word error rate (WER) by 6.5% relative compared to conventional deep neural network (DNN) models and 15.7% over a discriminatively trained Gaussian mixture model (GMM) baseline. For cross-channel CNN training, the WER improves by 3.5% relative over the comparable DNN structure. Compared with the best beamformed GMM system, cross-channel convolution reduces the WER by 9.7% relative, and matches the accuracy of a beamformed DNN.
C. Valentini-Botinhao and M. Wester. Using linguistic predictability and the Lombard effect to increase the intelligibility of synthetic speech in noise. In Proc. Interspeech, pages 2063-2067, Singapore, September 2014. [ bib | .pdf ]
In order to predict which words in a sentence are harder to understand in noise it is necessary to consider not only audibility but also semantic or linguistic information. This paper focuses on using linguistic predictability to inform an intelligibility enhancement method that uses Lombard-adapted synthetic speech to modify low predictable words in Speech Perception in Noise (SPIN) test sentences. Word intelligibility in the presence of speech-shaped noise was measured using plain, Lombard and a combination of the two synthetic voices. The findings show that the Lombard voice increases intelligibility in noise but the intelligibility gap between words in a high and low predictable context still remains. Using a Lombard voice when a word is unpredictable is a good strategy, but if a word is predictable from its context the Lombard benefit only occurs when other words in the sentence are also modified.
Antti Suni, Tuomo Raitio, Dhananjaya Gowda, Reima Karhila, Matt Gibson, and Oliver Watts. The Simple4All entry to the Blizzard Challenge 2014. In Proc. Blizzard Challenge 2014, September 2014. [ bib | .pdf ]
We describe the synthetic voices entered into the 2014 Blizzard Challenge by the SIMPLE4ALL consortium. The 2014 Blizzard Challenge presents an opportunity to test and benchmark some of the tools we have been developing to address the problem of how to produce systems in arbitrary new languages with minimal annotated data and language-specific expertise on the part of the system builders. We here explain how our tools were used to address these problems on the different tasks of the challenge, and provide some discussion of the evaluation results. Several additions to the system used to build voices for the previous Challenge are described: naive alphabetisation, unsupervised syllabification, and glottal flow pulse prediction using deep neural networks.
Thomas Merritt, Tuomo Raitio, and Simon King. Investigating source and filter contributions, and their interaction, to statistical parametric speech synthesis. In Proc. Interspeech, pages 1509-1513, Singapore, September 2014. [ bib | .pdf ]
This paper presents an investigation of the separate perceptual degradations introduced by the modelling of source and filter features in statistical parametric speech synthesis. This is achieved using stimuli in which various permutations of natural, vocoded and modelled source and filter are combined, optionally with the addition of filter modifications (e.g. global variance or modulation spectrum scaling). We also examine the assumption of independence between source and filter parameters. Two complementary perceptual testing paradigms are adopted. In the first, we ask listeners to perform “same or different quality” judgements between pairs of stimuli from different configurations. In the second, we ask listeners to give an opinion score for individual stimuli. Combining the findings from these tests, we draw some conclusions regarding the relative contributions of source and filter to the currently rather limited naturalness of statistical parametric synthetic speech, and test whether current independence assumptions are justified.
Qiong Hu, Yannis Stylianou, Ranniery Maia, Korin Richmond, Junichi Yamagishi, and Javier Latorre. An investigation of the application of dynamic sinusoidal models to statistical parametric speech synthesis. In Proc. Interspeech, pages 780-784, Singapore, September 2014. [ bib | .pdf ]
This paper applies a dynamic sinusoidal synthesis model to statistical parametric speech synthesis (HTS). For this, we utilise regularised cepstral coefficients to represent both the static amplitude and dynamic slope of selected sinusoids for statistical modelling. During synthesis, a dynamic sinusoidal model is used to reconstruct speech. A preference test is conducted to compare the selection of different sinusoids for cepstral representation. Our results show that when integrated with HTS, a relatively small number of sinusoids selected according to a perceptual criterion can produce quality comparable to using all harmonics. A Mean Opinion Score (MOS) test shows that our proposed statistical system is preferred to one using mel-cepstra from pitch synchronous spectral analysis.
L.-H. Chen, T. Raitio, C. Valentini-Botinhao, J. Yamagishi, and Z.-H. Ling. DNN-Based Stochastic Postfilter for HMM-Based Speech Synthesis. In Proc. Interspeech, pages 1954-1958, Singapore, September 2014. [ bib | .pdf ]
In this paper we propose a deep neural network to model the conditional probability of the spectral differences between natural and synthetic speech. This allows us to reconstruct the spectral fine structures in speech generated by HMMs. We compared the new stochastic data-driven postfilter with global variance based parameter generation and modulation spectrum enhancement. Our results confirm that the proposed method significantly improves the segmental quality of synthetic speech compared to the conventional methods.
C. Valentini-Botinhao, M. Toman, M. Pucher, D. Schabus, and J. Yamagishi. Intelligibility Analysis of Fast Synthesized Speech. In Proc. Interspeech, pages 2922-2926, Singapore, September 2014. [ bib | .pdf ]
In this paper we analyse the effect of speech corpus and compression method on the intelligibility of synthesized speech at fast rates. We recorded English and German language voice talents at a normal and a fast speaking rate and trained an HSMM-based synthesis system based on the normal and the fast data of each speaker. We compared three compression methods: scaling the variance of the state duration model, interpolating the duration models of the fast and the normal voices, and applying a linear compression method to generated speech. Word recognition results for the English voices show that generating speech at normal speaking rate and then applying linear compression resulted in the most intelligible speech at all tested rates. A similar result was found when evaluating the intelligibility of the natural speech corpus. For the German voices, interpolation was found to be better at moderate speaking rates but the linear method was again more successful at very high rates, for both blind and sighted participants. These results indicate that using fast speech data does not necessarily create more intelligible voices and that linear compression can more reliably provide higher intelligibility, particularly at higher rates.
Siva Reddy Gangireddy, Fergus McInnes, and Steve Renals. Feed forward pre-training for recurrent neural network language models. In Proc. Interspeech, pages 2620-2624, September 2014. [ bib | .pdf ]
The recurrent neural network language model (RNNLM) has been demonstrated to consistently reduce perplexities and automatic speech recognition (ASR) word error rates across a variety of domains. In this paper we propose a pre-training method for the RNNLM, by sharing the output weights of the feed forward neural network language model (NNLM) with the RNNLM. This is accomplished by ﬁrst ﬁne-tuning the weights of the NNLM, which are then used to initialise the output weights of an RNNLM with the same number of hidden units. We have carried out text-based experiments on the Penn Treebank Wall Street Journal data, and ASR experiments on the TED talks data used in the International Workshop on Spoken Language Translation (IWSLT) evaluation campaigns. Across the experiments, we observe small improvements in perplexity and ASR word error rate.
Mark Sinclair, Peter Bell, Alexandra Birch, and Fergus McInnes. A semi-markov model for speech segmentation with an utterance-break prior. In Proc. Interspeech, September 2014. [ bib | .pdf ]
Speech segmentation is the problem of finding the end points of a speech utterance for passing to an automatic speech recognition (ASR) system. The quality of this segmentation can have a large impact on the accuracy of the ASR system; in this paper we demonstrate that it can have an even larger impact on downstream natural language processing tasks – in this case, machine translation. We develop a novel semi-Markov model which allows the segmentation of audio streams into speech utterances which are optimised for the desired distribution of sentence lengths for the target domain. We compare this with existing state-of-the-art methods and show that it is able to achieve not only improved ASR performance, but also to yield significant benefits to a speech translation task.
Gustav Eje Henter, Thomas Merritt, Matt Shannon, Catherine Mayo, and Simon King. Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In Proc. Interspeech, volume 15, pages 1504-1508, September 2014. [ bib | .pdf ]
Acoustic models used for statistical parametric speech synthesis typically incorporate many modelling assumptions. It is an open question to what extent these assumptions limit the naturalness of synthesised speech. To investigate this question, we recorded a speech corpus where each prompt was read aloud multiple times. By combining speech parameter trajectories extracted from different repetitions, we were able to quantify the perceptual effects of certain commonly used modelling assumptions. Subjective listening tests show that taking the source and filter parameters to be conditionally independent, or using diagonal covariance matrices, significantly limits the naturalness that can be achieved. Our experimental results also demonstrate the shortcomings of mean-based parameter generation.
Keywords: speech synthesis, acoustic modelling, stream independence, diagonal covariance matrices, repeated speech
Matthew Aylett, Rasmus Dall, Arnab Ghoshal, Gustav Eje Henter, and Thomas Merritt. A flexible front-end for HTS. In Proc. Interspeech, pages 1283-1287, September 2014. [ bib | .pdf ]
Parametric speech synthesis techniques depend on full context acoustic models generated by language front-ends, which analyse linguistic and phonetic structure. HTS, the leading parametric synthesis system, can use a number of different front-ends to generate full context models for synthesis and training. In this paper we explore the use of a new text processing front-end that has been added to the speech recognition toolkit Kaldi as part of an ongoing project to produce a new parametric speech synthesis system, Idlak. The use of XML specification files, a modular design, and modern coding and testing approaches, make the Idlak front-end ideal for adding, altering and experimenting with the contexts used in full context acoustic models. The Idlak front-end was evaluated against the standard Festival front-end in the HTS system. Results from the Idlak front-end compare well with the more mature Festival front-end (Idlak - 2.83 MOS vs Festival - 2.85 MOS), although a slight reduction in naturalness perceived by non-native English speakers can be attributed to Festival’s insertion of non-punctuated pauses.
Wei Zhang, Robert A. J. Clark, and Yongyuan Wang. Unsupervised language filtering using the latent Dirichlet allocation. In Proc. Interspeech, pages 1268-1272, September 2014. [ bib | .pdf ]
To automatically build from scratch the language processing component for a speech synthesis system in a new language a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n-gram count as features without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation topic model is reformulated for the language identification task and Collapsed Gibbs Sampling is used to train an unsupervised language identification model. We show that such a model is highly capable of identifying the primary language in a corpus and filtering out other languages present.
Susana Palmaz López-Peláez and Robert A. J. Clark. Speech synthesis reactive to dynamic noise environmental conditions. In Proc. Interspeech, pages 2927-2931, September 2014. [ bib | .pdf ]
This paper addresses the issue of generating synthetic speech in changing noise conditions. We will investigate the potential improvements that can be introduced by using a speech synthesiser that is able to modulate between a normal speech style and a speech style produced in a noisy environment according to a changing level of noise. We demonstrate that an adaptive system where the speech style is changed to suit the noise conditions maintains intelligibility and improves naturalness compared to traditional systems.
Philip N Garner, Rob Clark, Jean-Philippe Goldman, Pierre-Edouard Honnet, Maria Ivanova, Alexandros Lazaridis, Hui Liang, Beat Pfister, Manuel Sam Ribeiro, Eric Wehrli, et al. Translation and prosody in swiss languages. In Nouveaux cahiers de linguistique francaise, 31. 3rd Swiss Workshop on Prosody, Geneva, Switzerland, September 2014. [ bib | .pdf ]
The SIWIS project aims to investigate spoken language translation, where both the speaker characteristics and prosody are translated. This means the translation carries not only spoken content, but also speaker identification, emotion and intent. We describe the background of the project, and present some initial approaches and results. These include the design and collection of a Swiss bilingual database that both enables research in Swiss accented speech processing, and facilitates reliable evaluation.
Nicholas W D Evans, Tomi Kinnunen, Junichi Yamagishi, Zhizheng Wu, Federico Alegre, and Phillip De Leon. Speaker recognition anti-spoofing. Book Chapter in "Handbook of Biometric Anti-spoofing", Springer, S. Marcel, S. Li and M. Nixon, Eds., 2014, June 2014. [ bib | DOI | .pdf ]
Progress in the development of spoofing countermeasures for automatic speaker recognition is less advanced than equivalent work related to other biometric modalities. This chapter outlines the potential for even state-of-the-art automatic speaker recognition systems to be spoofed. While the use of a multitude of different datasets, protocols and metrics complicates the meaningful comparison of different vulnerabilities, we review previous work related to impersonation, replay, speech synthesis and voice conversion spoofing attacks. The article also presents an analysis of the early work to develop spoofing countermeasures. The literature shows that there is significant potential for automatic speaker verification systems to be spoofed, that significant further work is required to develop generalised countermeasures, that there is a need for standard datasets, evaluation protocols and metrics and that greater emphasis should be placed on text-dependent scenarios.
Atef Ben Youssef, Hiroshi Shimodaira, and David Braude. Speech driven talking head from estimated articulatory features. In Proc. ICASSP, pages 4606-4610, Florence, Italy, May 2014. [ bib | .pdf ]
In this paper, we present a talking head in which the lips and head motion are controlled using articulatory movements estimated from speech. A phonesize HMM-based inversion mapping is employed and trained in a semi-supervised fashion. The advantage of the use of articulatory features is that they can drive the lips motions and they have a close link with head movements. Speech inversion normally requires the training data recorded with electromagnetic articulograph (EMA), which restricts the naturalness of head movements. The present study considers a more realistic recording condition where the training data for the target speaker are recorded with a usual motion capture system rather than EMA. Different temporal clustering techniques are investigated for HMM-based mapping as well as a GMM-based frame-wise mapping as a baseline system. Objective and subjective experiments show that the synthesised motions are more natural using an HMM system than a GMM one, and estimated EMA features outperform prosodic features.
Mirjam Wester and Cassie Mayo. Accent rating by native and non-native listeners. In Proceedings of ICASSP, pages 7749-7753, Florence, Italy, May 2014. [ bib | .pdf ]
This study investigates the influence of listener native language with respect to talker native language on perception of degree of foreign accent in English. Listeners from native English, Finnish, German and Mandarin backgrounds rated the accentedness of native English, Finnish, German and Mandarin talkers producing a controlled set of English sentences. Results indicate that non-native listeners, like native listeners, are able to classify non-native talkers as foreign-accented, and native talkers as unaccented. However, while non-native talkers received higher accentedness ratings than native talkers from all listener groups, non-native listeners judged talkers with non-native accents less harshly than did native English listeners. Similarly, non-native listeners assigned higher degrees of foreign accent to native English talkers than did native English listeners. It seems that non-native listeners give accentedness ratings that are less extreme, or closer to the centre of the rating scale in both directions, than those used by native listeners.
Tiberiu Boroș, Adriana Stan, Oliver Watts, and Stefan Daniel Dumitrescu. RSS-TOBI - a prosodically enhanced Romanian speech corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, May 2014. [ bib | .pdf ]
This paper introduces a recent development of a Romanian Speech corpus to include prosodic annotations of the speech data in the form of ToBI labels. We describe the methodology of determining the required pitch patterns that are common for the Romanian language, annotate the speech resource, and then provide a comparison of two text-to-speech synthesis systems to establish the benefits of using this type of information to our speech resource. The result is a publicly available speech dataset which can be used to further develop speech synthesis systems or to automatically learn the prediction of ToBI labels from text in Romanian language.
Oliver Watts, Siva Gangireddy, Junichi Yamagishi, Simon King, Steve Renals, Adriana Stan, and Mircea Giurgiu. Neural net word representations for phrase-break prediction without a part of speech tagger. In Proc. ICASSP, pages 2618-2622, Florence, Italy, May 2014. [ bib | .pdf ]
The use of shared projection neural nets of the sort used in language modelling is proposed as a way of sharing parameters between multiple text-to-speech system components. We experiment with pretraining the weights of such a shared projection on an auxiliary language modelling task and then apply the resulting word representations to the task of phrase-break prediction. Doing so allows us to build phrase-break predictors that rival conventional systems without any reliance on conventional knowledge-based resources such as part of speech taggers.
Rasmus Dall, Junichi Yamagishi, and Simon King. Rating naturalness in speech synthesis: The effect of style and expectation. In Proc. Speech Prosody, May 2014. [ bib | .pdf ]
In this paper we present evidence that speech produced spontaneously in a conversation is considered more natural than read prompts. We also explore the relationship between participants' expectations of the speech style under evaluation and their actual ratings. In successive listening tests subjects rated the naturalness of either spontaneously produced, read aloud or written sentences, with instructions toward either conversational, reading or general naturalness. It was found that, when presented with spontaneous or read aloud speech, participants consistently rated spontaneous speech more natural - even when asked to rate naturalness in the reading case. Presented with only text, participants generally preferred transcriptions of spontaneous utterances, except when asked to evaluate naturalness in terms of reading aloud. This has implications for the application of MOS-scale naturalness ratings in Speech Synthesis, and potentially on the type of data suitable for use both in general TTS, dialogue systems and specifically in Conversational TTS, in which the goal is to reproduce speech as it is produced in a spontaneous conversational setting.
Qiong Hu, Yannis Stylianou, Korin Richmond, Ranniery Maia, Junichi Yamagishi, and Javier Latorre. A fixed dimension and perceptually based dynamic sinusoidal model of speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6311-6315, Florence, Italy, May 2014. [ bib | .pdf ]
This paper presents a fixed- and low-dimensional, perceptually based dynamic sinusoidal model of speech referred to as PDM (Perceptual Dynamic Model). To decrease and fix the number of sinusoidal components typically used in the standard sinusoidal model, we propose to use only one dynamic sinusoidal component per critical band. For each band, the sinusoid with the maximum spectral amplitude is selected and associated with the centre frequency of that critical band. The model is expanded at low frequencies by incorporating sinusoids at the boundaries of the corresponding bands while at the higher frequencies a modulated noise component is used. A listening test is conducted to compare speech reconstructed with PDM and state-of-the-art models of speech, where all models are constrained to use an equal number of parameters. The results show that PDM is clearly preferred in terms of quality over the other systems.
L. Saheer, J. Yamagishi, P.N. Garner, and J. Dines. Combining vocal tract length normalization with hierarchical linear transformations. Selected Topics in Signal Processing, IEEE Journal of, 8(2):262-272, April 2014. [ bib | DOI ]
Keywords: Bayes methods;regression analysis;speaker recognition;speech synthesis;ASR system;CSMAPLR adaptation;MLLR-based adaptation techniques;TTS synthesis;VTLN;age information;automatic speech recognition system;combination techniques;constrained structural maximum a posteriori linear regression adaptation;gender information;hierarchical Bayesian framework;hierarchical linear transformations;mismatched conditions;speaker similarity;speaker specific characteristics;statistical parametric speech synthesis;text-to-speech synthesis;vocal tract length normalization;Adaptation models;Estimation;Hidden Markov models;Regression tree analysis;Speech;Speech synthesis;Transforms;Constrained structural maximum a posteriori linear regression;hidden Markov models;speaker adaptation;statistical parametric speech synthesis;vocal tract length normalization
J.P. Cabral, K. Richmond, J. Yamagishi, and S. Renals. Glottal spectral separation for speech synthesis. Selected Topics in Signal Processing, IEEE Journal of, 8(2):195-208, April 2014. [ bib | DOI | .pdf ]
This paper proposes an analysis method to separate the glottal source and vocal tract components of speech that is called Glottal Spectral Separation (GSS). This method can produce high-quality synthetic speech using an acoustic glottal source model. In the source-filter models commonly used in speech technology applications it is assumed the source is a spectrally flat excitation signal and the vocal tract filter can be represented by the spectral envelope of speech. Although this model can produce high-quality speech, it has limitations for voice transformation because it does not allow control over glottal parameters which are correlated with voice quality. The main problem with using a speech model that better represents the glottal source and the vocal tract filter is that current analysis methods for separating these components are not robust enough to produce the same speech quality as using a model based on the spectral envelope of speech. The proposed GSS method is an attempt to overcome this problem, and consists of the following three steps. Initially, the glottal source signal is estimated from the speech signal. Then, the speech spectrum is divided by the spectral envelope of the glottal source signal in order to remove the glottal source effects from the speech signal. Finally, the vocal tract transfer function is obtained by computing the spectral envelope of the resulting signal. In this work, the glottal source signal is represented using the Liljencrants-Fant model (LF-model). The experiments we present here show that the analysis-synthesis technique based on GSS can produce speech comparable to that of a high-quality vocoder that is based on the spectral envelope representation. However, it also permit control over voice qualities, namely to transform a modal voice into breathy and tense, by modifying the glottal parameters.
Keywords: Analytical models;Computational modeling;Estimation;Hidden Markov models;Mathematical model;Speech;Speech synthesis;Glottal spectral separation;LF-model;parametric speech synthesis;voice quality transformation
Maria K. Wolters. The minimal effective dose of reminder technology. In Proceedings of the extended abstracts of the 32nd annual ACM conference on Human factors in computing systems - CHI EA '14, pages 771-780, New York, New York, USA, April 2014. ACM Press. [ bib | DOI | http ]
Remembering to take one's medication on time is hard work. This is true for younger people with no chronic illness as well as older people with many co-morbid conditions that require a complex medication regime. Many technological solutions have been proposed to help with this problem, but is more IT really the solution? In this paper, I argue that technological help should be limited to the minimal effective dose, which depends on the person and their living situation, and may well be zero.
Maria K. Wolters, Elaine Niven, and Robert H. Logie. The art of deleting snapshots. In Proceedings of the extended abstracts of the 32nd annual ACM conference on Human factors in computing systems - CHI EA '14, pages 2521-2526, New York, New York, USA, April 2014. ACM Press. [ bib | DOI | http ]
In this paper, we investigate why people decide to delete snapshots. 74 participants took snapshots of a street festival every three minutes for an hour and were then asked to cull pictures immediately or after a delay of a day, a week, or a month. We found that the ratio of kept to deleted pictures was fairly constant. Deletion criteria fell into six main categories that mostly involved subjective assessments such as whether a photo was sufficiently characteristic. We conclude that automatic tagging of photos for deletion is problematic; interfaces should instead make it easy for users to find and compare similar photos.
C. Valentini-Botinhao, J. Yamagishi, S. King, and R. Maia. Intelligibility enhancement of HMM-generated speech in additive noise by modifying mel cepstral coefficients to increase the glimpse proportion. Computer Speech and Language, 28(2):665-686, 2014. [ bib | DOI | .pdf ]
This paper describes speech intelligibility enhancement for hidden Markov model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the Glimpse Proportion – an objective measure of the intelligibility of speech in noise – increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1-4kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains.
Moses Ekpenyong, Eno-Abasi Urua, Oliver Watts, Simon King, and Junichi Yamagishi. Statistical parametric speech synthesis for Ibibio. Speech Communication, 56:243-251, January 2014. [ bib | DOI | http | .pdf ]
Ibibio is a Nigerian tone language, spoken in the south-east coastal region of Nigeria. Like most African languages, it is resource-limited. This presents a major challenge to conventional approaches to speech synthesis, which typically require the training of numerous predictive models of linguistic features such as the phoneme sequence (i.e., a pronunciation dictionary plus a letter-to-sound model) and prosodic structure (e.g., a phrase break predictor). This training is invariably supervised, requiring a corpus of training data labelled with the linguistic feature to be predicted. In this paper, we investigate what can be achieved in the absence of many of these expensive resources, and also with a limited amount of speech recordings. We employ a statistical parametric method, because this has been found to offer good performance even on small corpora, and because it is able to directly learn the relationship between acoustics and whatever linguistic features are available, potentially mitigating the absence of explicit representations of intermediate linguistic layers such as prosody. We present an evaluation that compares systems that have access to varying degrees of linguistic structure. The simplest system only uses phonetic context (quinphones), and this is compared to systems with access to a richer set of context features, with or without tone marking. It is found that the use of tone marking contributes significantly to the quality of synthetic speech. Future work should therefore address the problem of tone assignment using a dictionary and the building of a prediction module for out-of-vocabulary words.
Liang Lu, Arnab Ghoshal, and Steve Renals. Cross-lingual subspace Gaussian mixture model for low-resource speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 22(1):17-27, 2014. [ bib | DOI | .pdf ]
This paper studies cross-lingual acoustic modelling in the context of subspace Gaussian mixture models (SGMMs). SGMMs factorize the acoustic model parameters into a set that is globally shared between all the states of a hidden Markov model (HMM) and another that is specific to the HMM states. We demonstrate that the SGMM global parameters are transferable between languages, particularly when the parameters are trained multilingually. As a result, acoustic models may be trained using limited amounts of transcribed audio by borrowing the SGMM global parameters from one or more source languages, and only training the state-specific parameters on the target language audio. Model regularization using 1-norm penalty is shown to be particularly effective at avoiding overtraining and leading to lower word error rates. We investigate maximum a posteriori (MAP) adaptation of subspace parameters in order to reduce the mismatch between the SGMM global parameters of the source and target languages. In addition, monolingual and cross-lingual speaker adaptive training is used to reduce the model variance introduced by speakers. We have systematically evaluated these techniques by experiments on the GlobalPhone corpus.
Johanna D. Moore, Leimin Tian, and Catherine Lai. Word-level emotion recognition using high-level features. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 8404 of Lecture Notes in Computer Science, pages 17-31. Springer Berlin Heidelberg, 2014. [ bib | DOI | .pdf ]
In this paper, we investigate the use of high-level features for recognizing human emotions at the word-level in natural conversations with virtual agents. Experiments were carried out on the 2012 Audio/Visual Emotion Challenge (AVEC2012) database, where emotions are defined as vectors in the Arousal-Expectancy-Power-Valence emotional space. Our model using 6 novel disfluency features yields significant improvements compared to those using large number of low-level spectral and prosodic features, and the overall performance difference between it and the best model of the AVEC2012 Word-Level Sub-Challenge is not significant. Our visual model using the Active Shape Model visual features also yields significant improvements compared to models using the low-level Local Binary Patterns visual features. We built a bimodal model By combining our disfluency and visual feature sets and applying Correlation-based Feature-subset Selection. Considering overall performance on all emotion dimensions, our bimodal model outperforms the second best model of the challenge, and comes close to the best model. It also gives the best result when predicting Expectancy values.
Catherine Lai. Interpreting final rises: Task and role factors. In Proceedings of Speech Prosody 7, Dublin, Ireland, 2014. [ bib | .pdf ]
This paper examines the distribution of utterance final pitch rises in dialogues with different task structures. More specifically, we examine map-task and topical conversation dialogues of Southern Standard British English speakers in the IViE corpus. Overall, we find that the map-task dialogues contain more rising features, where these mainly arise from instructions and affirmatives. While rise features were somewhat predictive of turn-changes, these effects were swamped by task and role effects. Final rises were not predictive of affirmative responses. These findings indicate that while rises can be interpreted as indicating some sort of contingency, it is with respect to the higher level discourse structure rather than the specific utterance bearing the rise. We explore the relationship between rises and the need for co-ordination in dialogue, and hypothesize that the more speakers have to co-ordinate in a dialogue, the more rising features we will see on non-question utterances. In general, these sorts of contextual conditions need to be taken into account when we collect and analyze intonational data, and when we link them to speaker states such as uncertainty or submissiveness.
P. Lanchantin, M. J. F. Gales, S. King, and J. Yamagishi. Multiple-average-voice-based speech synthesis. In Proc. ICASSP, 2014. [ bib ]
This paper describes a novel approach for the speaker adaptation of statistical parametric speech synthesis systems based on the interpolation of a set of average voice models (AVM). Recent results have shown that the quality/naturalness of adapted voices directly depends on the distance from the average voice model that the speaker adaptation starts from. This suggests the use of several AVMs trained on carefully chosen speaker clusters from which a more suitable AVM can be selected/interpolated during the adaptation. In the proposed approach, a Multiple-AVM is trained on clusters of speakers, iteratively re-assigned during the estimation process initialised according to metadata. In contrast with the cluster adaptive training (CAT) framework, the training stage is computationally less expensive as the amount of training data and clusters gets larger. Additionally, during adaptation, each AVM constituting the multiple-AVM are first adapted towards the speaker which suggests a better tuning to the individual speaker of the space in which the interpolation takes place. It is shown via experiments, ran on a corpus of British speakers with various regional accents, that the quality/naturalness of synthetic speech of adapted voices is significantly higher than when considering a single factor-independent AVM selected according to the target speaker characteristics.
David Abelman and Robert Clark. Altering speech synthesis prosody through real time natural gestural control. In Proc. Speech Prosody 2014, Dublin Ireland, 2014. [ bib | .pdf ]
This paper investigates the usage of natural gestural controls to alter synthesised speech prosody in real time (for example, recognising a one-handed beat as a cue to emphasise a certain word in a synthesised sentence). A user’s gestures are recognised using a Microsoft Kinect sensor, and synthesised speech prosody is altered through a series of hand-crafted rules running through a modified HTS engine (pHTS, developed at Universite de Mons). Two sets of preliminary experiments are carried out. Firstly, it is shown that users can control the device to a moderate level of accuracy, though this is projected to improve further as the system is refined. Secondly, it is shown that the prosody of the altered out- put is significantly preferred to that of the baseline pHTS synthesis. Future work is recommended to focus on learning gestural and prosodic rules from data, and in using an updated version of the underlying pHTS engine. The reader is encouraged to watch a short video demonstration of the work at http://tinyurl.com/gesture-prosody.
P. Swietojanski, J. Li, and J-T Huang. Investigation of maxout networks for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014. [ bib | .pdf ]
We explore the use of maxout neuron in various aspects of acoustic modelling for large vocabulary speech recognition systems; including low-resource scenario and multilingual knowledge transfers. Through the experiments on voice search and short message dictation datasets, we found that maxout networks are around three times faster to train and offer lower or comparable word error rates on several tasks, when compared to the networks with logistic nonlinearity. We also present a detailed study of the maxout unit internal behaviour suggesting the use of different nonlinearities in different layers.
S. Renals and P. Swietojanski. Neural networks for distant speech recognition. In The 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014. [ bib | .pdf ]
Distant conversational speech recognition is challenging owing to the presence of multiple, overlapping talkers, additional non-speech acoustic sources, and the effects of reverberation. In this paper we review work on distant speech recognition, with an emphasis on approaches which combine multichannel signal processing with acoustic modelling, and investigate the use of hybrid neural network / hidden Markov model acoustic models for distant speech recognition of meetings recorded using microphone arrays. In particular we investigate the use of convolutional and fully-connected neural networks with different activation functions (sigmoid, rectiﬁed linear, and maxout). We performed experiments on the AMI and ICSI meeting corpora, with results indicating that neural network models are capable of signiﬁcant improvements in accuracy compared with discriminatively trained Gaussian mixture models.
R. Makowski, P. Swietojanski, and R. Wielgat. Automatyczne rozpoznawanie mowy. In T. Zielinski, P. Korohoda, and R. Rumian, editors, Cyfrowe Przetwarzanie Sygnalow w Telekomunikacji. Podstawy, multimedia, transmisja. Wydawnictwo Naukowe PWN - Polish Scientific Publishers PWN, Warszawa, 2014. [ bib | http ]
Książka omawia metody analizy i przetwarzania sygnałów cyfrowych. Dokonano w niej karkołomnego przejścia od podstaw cyfrowego przetwarzania sygnałów do najnowszej technologii LTE IV generacji.
Liang Lu and Steve Renals. Probabilistic linear discriminant analysis for acoustic modelling. IEEE Signal Processing Letters, 21(6):702-706, 2014. [ bib | DOI | .pdf ]
In this letter, we propose a new acoustic modelling approach for automatic speech recognition based on probabilistic linear discriminant analysis (PLDA), which is used to model the state density function for the standard hidden Markov models (HMMs). Unlike the conventional Gaussian mixture models (GMMs) where the correlations are weakly modelled by using the diagonal covariance matrices, PLDA captures the correlations of feature vector in subspaces without vastly expanding the model. It also allows the usage of high dimensional feature input, and therefore is more flexible to make use of different type of acoustic features. We performed the preliminary experiments on the Switchboard corpus, and demonstrated the feasibility of this acoustic model.
Rasmus Dall, Mirjam Wester, and Martin Corley. The effect of filled pauses and speaking rate on speech comprehension in natural, vocoded and synthetic speech. In Proc. Interspeech, 2014. [ bib | .pdf ]
It has been shown that in natural speech filled pauses can be beneficial to a listener. In this paper, we attempt to discover whether listeners react in a similar way to filled pauses in synthetic and vocoded speech compared to natural speech. We present two experiments focusing on reaction time to a target word. In the first, we replicate earlier work in natural speech, namely that listeners respond faster to a target word following a filled pause than following a silent pause. This is replicated in vocoded but not in synthetic speech. Our second experiment investigates the effect of speaking rate on reaction times as this was potentially a confounding factor in the first experiment. Evidence suggests that slower speech rates lead to slower reaction times in synthetic and in natural speech. Moreover, in synthetic speech the response to a target word after a filled pause is slower than after a silent pause. This finding, combined with an overall slower reaction time, demonstrates a shortfall in current synthesis techniques. Remedying this could help make synthesis less demanding and more pleasant for the listener, and reaction time experiments could thus provide a measure of improvement in synthesis techniques.
Mirjam Wester, M. Luisa Garcia Lecumberri, and Martin Cooke. DIAPIX-FL: A symmetric corpus of problem-solving dialogues in first and second languages. In Proc. Interspeech, 2014. [ bib | .pdf ]
This paper describes a corpus of conversations recorded using an extension of the DiapixUK task: the Diapix Foreign Language corpus (DIAPIX-FL) . English and Spanish native talkers were recorded speaking both English and Spanish. The bidirectionality of the corpus makes it possible to separate language (English or Spanish) from speaking in a first language (L1) or second language (L2). An acoustic analysis was carried out to analyse changes in F0, voicing, intensity, spectral tilt and formants that might result from speaking in an L2. The effect of L1 and nativeness on turn types was also studied. Factors that were investigated were pausing, elongations, and incomplete words. Speakers displayed certain patterns that suggest an on-going process of L2 phonological acquisition, such as the overall percentage of voicing in their speech. Results also show an increase in hesitation phenomena (pauses, elongations, incomplete turns), a decrease in produced speech and speech rate, a reduction of F0 range, raising of minimum F0 when speaking in the non-native language which are consistent with more tentative speech and may be used as indicators of non-nativeness.
Rasmus Dall, Marcus Tomalin, Mirjam Wester, William Byrne, and Simon King. Investigating automatic & human filled pause insertion for speech synthesis. In Proc. Interspeech, 2014. [ bib | .pdf ]
Filled pauses are pervasive in conversational speech and have been shown to serve several psychological and structural purposes. Despite this, they are seldom modelled overtly by state-of-the-art speech synthesis systems. This paper seeks to motivate the incorporation of filled pauses into speech synthesis systems by exploring their use in conversational speech, and by comparing the performance of several automatic systems inserting filled pauses into fluent text. Two initial experiments are described which seek to determine whether people's predicted insertion points are consistent with actual practice and/or with each other. The experiments also investigate whether there are `right' and `wrong' places to insert filled pauses. The results show good consistency between people's predictions of usage and their actual practice, as well as a perceptual preference for the `right' placement. The third experiment contrasts the performance of several automatic systems that insert filled pauses into fluent sentences. The best performance (determined by F-score) was achieved through the by-word interpolation of probabilities predicted by Recurrent Neural Network and 4gram Language Models. The results offer insights into the use and perception of filled pauses by humans, and how automatic systems can be used to predict their locations.
Catherine Lai and Steve Renals. Incorporating lexical and prosodic information at different levels for meeting summarization. In Proc. Interspeech 2014, 2014. [ bib | .pdf ]
This paper investigates how prosodic features can be used to augment lexical features for meeting summarization. Automatic detection of summary-worthy content using non-lexical features, like prosody, has generally focused on features calculated over dialogue acts. However, a salient role of prosody is to distinguish important words within utterances. To examine whether including more fine grained prosodic information can help extractive summarization, we perform experiments incorporating lexical and prosodic features at different levels. For ICSI and AMI meeting corpora, we find that combining prosodic and lexical features at a lower level has better AUROC performance than adding in prosodic features derived over dialogue acts. ROUGE F-scores also show the same pattern for the ICSI data. However, the differences are less clear for the AMI data where the range of scores is much more compressed. In order to understand the relationship between the generated summaries and differences in standard measures, we look at the distribution of extracted content over meeting as well as summary redundancy. We find that summaries based on dialogue act level prosody better reflect the amount of human annotated summary content in meeting segments, while summaries derived from prosodically augmented lexical features exhibit less redundancy.
Liang Lu and Steve Renals. Probabilistic linear discriminant analysis with bottleneck features for speech recognition. In Proc. Interspeech, 2014. [ bib | .pdf ]
We have recently proposed a new acoustic model based on prob- abilistic linear discriminant analysis (PLDA) which enjoys the flexibility of using higher dimensional acoustic features, and is more capable to capture the intra-frame feature correlations. In this paper, we investigate the use of bottleneck features obtained from a deep neural network (DNN) for the PLDA-based acous- tic model. Experiments were performed on the Switchboard dataset - a large vocabulary conversational telephone speech corpus. We observe significant word error reduction by using the bottleneck features. In addition, we have also compared the PLDA-based acoustic model to three others using Gaussian mixture models (GMMs), subspace GMMs and hybrid deep neural networks (DNNs), and PLDA can achieve comparable or slightly higher recognition accuracy from our experiments.
P. Bell, J. Driesen, and S. Renals. Cross-lingual adaptation with multi-task adaptive networks. In Proc. Interspeech, 2014. [ bib | .pdf ]
Posterior-based or bottleneck features derived from neural networks trained on out-of-domain data may be successfully applied to improve speech recognition performance when data is scarce for the target domain or language. In this paper we combine this approach with the use of a hierarchical deep neural network (DNN) network structure - which we term a multi-level adaptive network (MLAN) - and the use of multitask learning. We have applied the technique to cross-lingual speech recognition experiments on recordings of TED talks and European Parliament sessions in English (source language) and German (target language). We demonstrate that the proposed method can lead to improvements over standard methods, even when the quantity of training data for the target language is relatively high. When the complete method is applied, we achieve relative WER reductions of around 13% compared to a monolingual hybrid DNN baseline.
A. Cervone, S. Pareti, P. Bell, I. Prodanof, and T. Caselli. Detecting attribution relations in speech: a corpus study. In Proc. Italian Conference on Computational Linguistics, Pisa, Italy, 2014. [ bib | .pdf ]
In this work we present a methodology for the annotation of Attribution Relations (ARs) in speech which we apply to create a pilot corpus of spoken informal dialogues. This represents the first step towards the creation of a resource for the analysis of ARs in speech and the development of automatic extraction systems. Despite its relevance for speech recognition systems and spoken language understanding, the relation holding between quotations and opinions and their source has been studied and extracted only in written corpora, characterized by a formal register (news, literature, scientific articles). The shift to the informal register and to a spoken corpus widens our view of this relation and poses new challenges. Our hypothesis is that the decreased reliability of the linguistic cues found for written corpora in the fragmented structure of speech could be overcome by including prosodic clues in the system. The analysis of SARC confirms the hypothesis showing the crucial role played by the acoustic level in providing the missing lexical clues.
Nicolas d’Alessandro, Joëlle Tilmanne, Maria Astrinaki, Thomas Hueber, Rasmus Dall, Thierry Ravet, Alexis Moinet, Huseyin Cakmak, Onur Babacan, Adela Barbulescu, Valentin Parfait, Victor Huguenin, EmineSümeyye Kalaycı, and Qiong Hu. Reactive statistical mapping: Towards the sketching of performative control with data. In Yves Rybarczyk, Tiago Cardoso, João Rosas, and Luis M. Camarinha-Matos, editors, Innovative and Creative Developments in Multimodal Interaction Systems, volume 425 of IFIP Advances in Information and Communication Technology, pages 20-49. Springer Berlin Heidelberg, 2014. [ bib | .pdf ]
This paper presents the results of our participation to the ninth eNTERFACE workshop on multimodal user interfaces. Our target for this workshop was to bring some technologies currently used in speech recognition and synthesis to a new level, i.e. being the core of a new HMM-based mapping system. The idea of statistical mapping has been investigated, more precisely how to use Gaussian Mixture Models and Hidden Markov Models for realtime and reactive generation of new trajectories from inputted labels and for realtime regression in a continuous-to-continuous use case. As a result, we have developed several proofs of concept, including an incremental speech synthesiser, a software for exploring stylistic spaces for gait and facial motion in real-time, a reactive audiovisual laughter and a prototype demonstrating the realtime reconstruction of lower body gait motion strictly from upper body motion, with conservation of the stylistic properties. This project has been the opportunity to formalise HMM-based mapping, integrate various of these innovations into the Mage library and explore the development of a realtime gesture recognition tool.
Herman Kamper, Aren Jansen, Simon King, and S. J. Goldwater. Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings. In Proc. SLT, 2014. [ bib | .pdf ]
Unsupervised speech processing methods are essential for applications ranging from zero-resource speech technology to modelling child language acquisition. One challenging problem is discovering the word inventory of the language: the lexicon. Lexical clustering is the task of grouping unlabelled acoustic word tokens according to type. We propose a novel lexical clustering model: variable-length word segments are embedded in a fixed-dimensional acoustic space in which clustering is then performed. We evaluate several clustering algorithms and find that the best methods produce clusters with wide variation in sizes, as observed in natural language. The best probabilistic approach is an infinite Gaussian mixture model (IGMM), which automatically chooses the number of clusters. Performance is comparable to that of non-probabilistic Chinese Whispers and average-linkage hierarchical clustering. We conclude that IGMM clustering of fixed-dimensional embeddings holds promise as the lexical clustering component in unsupervised speech processing systems.
Maria Luisa Garcia Lecumberri, Roberto Barra-Chicote, Rubén Pérez Ramón, Junichi Yamagishi, and Martin Cooke. Generating segmental foreign accent. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. [ bib | .pdf ]
For most of us, speaking in a non-native language involves de- viating to some extent from native pronunciation norms. How- ever, the detailed basis for foreign accent (FA) remains elusive, in part due to methodological challenges in isolating segmen- tal from suprasegmental factors. The current study examines the role of segmental features in conveying FA through the use of a generative approach in which accent is localised to sin- gle consonantal segments. Three techniques are evaluated: the first requires a highly-proficiency bilingual to produce words with isolated accented segments; the second uses cross-splicing of context-dependent consonants from the non-native language into native words; the third employs hidden Markov model syn- thesis to blend voice models for both languages. Using English and Spanish as the native/non-native languages respectively, lis- tener cohorts from both languages identified words and rated their degree of FA. All techniques were capable of generating accented words, but to differing degrees. Naturally-produced speech led to the strongest FA ratings and synthetic speech the weakest, which we interpret as the outcome of over-smoothing. Nevertheless, the flexibility offered by synthesising localised accent encourages further development of the method.