|
[1]
|
John Dines, Hui Liang, Lakshmi Saheer, Matthew Gibson, William Byrne, Keiichiro
Oura, Keiichi Tokuda, Junichi Yamagishi, Simon King, Mirjam Wester, Teemu
Hirsimäki, Reima Karhila, and Mikko Kurimo.
Personalising speech-to-speech translation: Unsupervised
cross-lingual speaker adaptation for HMM-based speech synthesis.
Computer Speech and Language, 27(2):420-437, February 2013.
[ bib |
DOI |
http ]
In this paper we present results of unsupervised
cross-lingual speaker adaptation applied to
text-to-speech synthesis. The application of our
research is the personalisation of speech-to-speech
translation in which we employ a HMM statistical
framework for both speech recognition and synthesis.
This framework provides a logical mechanism to adapt
synthesised speech output to the voice of the user by
way of speech recognition. In this work we present
results of several different unsupervised and
cross-lingual adaptation approaches as well as an
end-to-end speaker adaptive speech-to-speech
translation system. Our experiments show that we can
successfully apply speaker adaptation in both
unsupervised and cross-lingual scenarios and our
proposed algorithms seem to generalise well for several
language pairs. We also discuss important future
directions including the need for better evaluation
metrics.
Keywords: Speech-to-speech translation, Cross-lingual speaker
adaptation, HMM-based speech synthesis, Speaker
adaptation, Voice conversion
|
|
[2]
|
Adriana Stan, Peter Bell, and Simon King.
A grapheme-based method for automatic alignment of speech and text
data.
In Proc. IEEE Workshop on Spoken Language Technology, Miami,
Florida, USA, December 2012.
[ bib |
.pdf ]
This paper introduces a method for automatic alignment
of speech data with unsynchronised, imperfect
transcripts, for a domain where no initial acoustic
models are available. Using grapheme-based acoustic
models, word skip networks and orthographic speech
transcripts, we are able to harvest 55% of the speech
with a 93% utterance-level accuracy and 99% word
accuracy for the produced transcriptions. The work is
based on the assumption that there is a high degree of
correspondence between the speech and text, and that a
full transcription of all of the speech is not
required. The method is language independent and the
only prior knowledge and resources required are the
speech and text transcripts, and a few minor user
interventions.
|
|
[3]
|
Heng Lu and Simon King.
Using Bayesian networks to find relevant context features for
HMM-based speech synthesis.
In Proc. Interspeech, Portland, Oregon, USA, September 2012.
[ bib |
.pdf ]
Speech units are highly context-dependent, so taking
contextual features into account is essential for
speech modelling. Context is employed in HMM-based
Text-to-Speech speech synthesis systems via
context-dependent phone models. A very wide context is
taken into account, represented by a large set of
contextual factors. However, most of these factors
probably have no significant influence on the speech,
most of the time. To discover which combinations of
features should be taken into account, decision
tree-based context clustering is used. But the space of
context-dependent models is vast, and the number of
contexts seen in the training data is only a tiny
fraction of this space, so the task of the decision
tree is very hard: to generalise from observations of a
tiny fraction of the space to the rest of the space,
whilst ignoring uninformative or redundant context
features. The structure of the context feature space
has not been systematically studied for speech
synthesis. In this paper we discover a dependency
structure by learning a Bayesian Network over the joint
distribution of the features and the speech. We
demonstrate that it is possible to discard the majority
of context features with minimal impact on quality,
measured by a perceptual test.
Keywords: HMM-based speech synthesis, Bayesian Networks, context
information
|
|
[4]
|
Rasmus Dall, Christophe Veaux, Junichi Yamagishi, and Simon King.
Analysis of speaker clustering strategies for HMM-based speech
synthesis.
In Proc. Interspeech, Portland, Oregon, USA, September 2012.
[ bib |
.pdf ]
This paper describes a method for speaker clustering,
with the application of building average voice models
for speaker-adaptive HMM-based speech synthesis that
are a good basis for adapting to specific target
speakers. Our main hypothesis is that using
perceptually similar speakers to build the average
voice model will be better than use unselected
speakers, even if the amount of data available from
perceptually similar speakers is smaller. We measure
the perceived similarities among a group of 30 female
speakers in a listening test and then apply multiple
linear regression to automatically predict these
listener judgements of speaker similarity and thus to
identify similar speakers automatically. We then
compare a variety of average voice models trained on
either speakers who were perceptually judged to be
similar to the target speaker, or speakers selected by
the multiple linear regression, or a large global set
of unselected speakers. We find that the average voice
model trained on perceptually similar speakers provides
better performance than the global model, even though
the latter is trained on more data, confirming our main
hypothesis. However, the average voice model using
speakers selected automatically by the multiple linear
regression does not reach the same level of
performance.
|
|
[5]
|
C. Valentini-Botinhao, J. Yamagishi, and S. King.
Speech intelligibility enhancement for HMM-based synthetic speech
in noise.
In Proc. Sapa Workshop, Portland, USA, September 2012.
[ bib |
.pdf ]
It is possible to increase the intelligibility of
speech in noise by enhancing the clean speech signal.
In this paper we demonstrate the effects of modifying
the spectral envelope of synthetic speech according to
the environmental noise. To achieve this, we modify Mel
cepstral coefficients according to an intelligibility
measure that accounts for glimpses of speech in noise:
the Glimpse Proportion measure. We evaluate this method
against a baseline synthetic voice trained only with
normal speech and a topline voice trained with Lombard
speech, as well as natural speech. The intelligibility
of these voices was measured when mixed with
speech-shaped noise and with a competing speaker at
three different levels. The Lombard voices, both
natural and synthetic, were more intelligible than the
normal voices in all conditions. For speech-shaped
noise, the proposed modified voice was as intelligible
as the Lombard synthetic voice without requiring any
recordings of Lombard speech, which are hard to obtain.
However, in the case of competing talker noise, the
Lombard synthetic voice was more intelligible than the
proposed modified voice.
|
|
[6]
|
Ruben San-Segundo, Juan M. Montero, Veronica Lopez-Luden, and Simon King.
Detecting acronyms from capital letter sequences in spanish.
In Proc. Interspeech, Portland, Oregon, USA, September 2012.
[ bib |
.pdf ]
This paper presents an automatic strategy to decide
how to pronounce a Capital Letter Sequence (CLS) in a
Text to Speech system (TTS). If CLS is well known by
the TTS, it can be expanded in several words. But when
the CLS is unknown, the system has two alternatives:
spelling it (abbreviation) or pronouncing it as a new
word (acronym). In Spanish, there is a high
relationship between letters and phonemes. Because of
this, when a CLS is similar to other words in Spanish,
there is a high tendency to pronounce it as a standard
word. This paper proposes an automatic method for
detecting acronyms. Additionally, this paper analyses
the discrimination capability of some features, and
several strategies for combining them in order to
obtain the best classifier. For the best classifier,
the classification error is 8.45%. About the feature
analysis, the best features have been the Letter
Sequence Perplexity and the Average N-gram order.
|
|
[7]
|
C. Valentini-Botinhao, J. Yamagishi, and S. King.
Mel cepstral coefficient modification based on the Glimpse
Proportion measure for improving the intelligibility of HMM-generated
synthetic speech in noise.
In Proc. Interspeech, Portland, USA, September 2012.
[ bib ]
We propose a method that modifies the Mel cepstral
coefficients of HMM-generated synthetic speech in order
to increase the intelligibility of the generated speech
when heard by a listener in the presence of a known
noise. This method is based on an approximation we
previously proposed for the Glimpse Proportion measure.
Here we show how to update the Mel cepstral
coefficients using this measure as an optimization
criterion and how to control the amount of distortion
by limiting the frequency resolution of the
modifications. To evaluate the method we built eight
different voices from normal read-text speech data from
a male speaker. Some voices were also built from
Lombard speech data produced by the same speaker.
Listening experiments with speech-shaped noise and with
a single competing talker indicate that our method
significantly improves intelligibility when compared to
unmodified synthetic speech. The voices built from
Lombard speech outperformed the proposed method
particularly for the competing talker case. However,
compared to a voice using only the spectral parameters
from Lombard speech, the proposed method obtains
similar or higher performance.
|
|
[8]
|
C. Valentini-Botinhao, J. Yamagishi, and S. King.
Using an intelligibility measure to create noise robust cepstral
coefficients for HMM-based speech synthesis.
In Proc. LISTA Workshop, Edinburgh, UK, May 2012.
[ bib |
.pdf ]
|
|
[9]
|
C. Valentini-Botinhao, R. Maia, J. Yamagishi, S. King, and H. Zen.
Cepstral analysis based on the Glimpse proportion measure for
improving the intelligibility of HMM-based synthetic speech in noise.
In Proc. ICASSP, pages 3997-4000, Kyoto, Japan, March 2012.
[ bib |
DOI |
.pdf ]
In this paper we introduce a new cepstral coefficient
extraction method based on an intelligibility measure
for speech in noise, the Glimpse Proportion measure.
This new method aims to increase the intelligibility of
speech in noise by modifying the clean speech, and has
applications in scenarios such as public announcement
and car navigation systems. We first explain how the
Glimpse Proportion measure operates and further show
how we approximated it to integrate it into an existing
spectral envelope parameter extraction method commonly
used in the HMM-based speech synthesis framework. We
then demonstrate how this new method changes the
modelled spectrum according to the characteristics of
the noise and show results for a listening test with
vocoded and HMM-based synthetic speech. The test
indicates that the proposed method can significantly
improve intelligibility of synthetic speech in speech
shaped noise.
|
|
[10]
|
Dong Wang, Javier Tejedor, Simon King, and Joe Frankel.
Term-dependent confidence normalization for out-of-vocabulary spoken
term detection.
Journal of Computer Science and Technology, 27(2), 2012.
[ bib |
DOI ]
Spoken Term Detection (STD) is a fundamental component
of spoken information retrieval systems. A key task of
an STD system is to determine reliable detections and
reject false alarms based on certain confidence
measures. The detection posterior probability, which is
often computed from lattices, is a widely used
confidence measure. However, a potential problem of
this confidence measure is that the confidence scores
of detections of all search terms are treated
uniformly, regardless of how much they may differ in
terms of phonetic or linguistic properties. This
problem is particularly evident for out-of-vocabulary
(OOV) terms which tend to exhibit high intra-term
diversity. To address the discrepancy on confidence
levels that the same confidence score may convey for
different terms, a term-dependent decision strategy is
desirable – for example, the term-specific threshold
(TST) approach. In this work, we propose a
term-dependent normalisation technique which
compensates for term diversity on confidence
estimation. Particularly, we propose a linear bias
compensation and a discriminative compensation to deal
with the bias problem that is inherent in lattice-based
confidence measuring from which the TST approach
suffers. We tested the proposed technique on speech
data from the multi-party meeting domain with two
state-of-the-art STD systems based on phonemes and
words respectively. The experimental results
demonstrate that the confidence normalisation approach
leads to a significant performance improvement in STD,
particularly for OOV terms with phoneme-based systems.
|
|
[11]
|
Keiichiro Oura, Junichi Yamagishi, Mirjam Wester, Simon King, and Keiichi
Tokuda.
Analysis of unsupervised cross-lingual speaker adaptation for
HMM-based speech synthesis using KLD-based transform mapping.
Speech Communication, 54(6):703-714, 2012.
[ bib |
DOI |
http ]
In the EMIME project, we developed a mobile device
that performs personalized speech-to-speech translation
such that a user's spoken input in one language is used
to produce spoken output in another language, while
continuing to sound like the user's voice. We
integrated two techniques into a single architecture:
unsupervised adaptation for HMM-based TTS using
word-based large-vocabulary continuous speech
recognition, and cross-lingual speaker adaptation
(CLSA) for HMM-based TTS. The CLSA is based on a
state-level transform mapping learned using minimum
Kullback-Leibler divergence between pairs of HMM states
in the input and output languages. Thus, an
unsupervised cross-lingual speaker adaptation system
was developed. End-to-end speech-to-speech translation
systems for four languages (English, Finnish, Mandarin,
and Japanese) were constructed within this framework.
In this paper, the English-to-Japanese adaptation is
evaluated. Listening tests demonstrate that adapted
voices sound more similar to a target speaker than
average voices and that differences between supervised
and unsupervised cross-lingual speaker adaptation are
small. Calculating the KLD state-mapping on only the
first 10 mel-cepstral coefficients leads to huge
savings in computational costs, without any detrimental
effect on the quality of the synthetic speech.
Keywords: HMM-based speech synthesis, Unsupervised speaker
adaptation, Cross-lingual speaker adaptation,
Speech-to-speech translation
|
|
[12]
|
Kei Hashimoto, Junichi Yamagishi, William Byrne, Simon King, and Keiichi
Tokuda.
Impacts of machine translation and speech synthesis on
speech-to-speech translation.
Speech Communication, 54(7):857-866, 2012.
[ bib |
DOI |
http ]
This paper analyzes the impacts of machine translation
and speech synthesis on speech-to-speech translation
systems. A typical speech-to-speech translation system
consists of three components: speech recognition,
machine translation and speech synthesis. Many
techniques have been proposed for integration of speech
recognition and machine translation. However,
corresponding techniques have not yet been considered
for speech synthesis. The focus of the current work is
machine translation and speech synthesis, and we
present a subjective evaluation designed to analyze
their impact on speech-to-speech translation. The
results of these analyses show that the naturalness and
intelligibility of the synthesized speech are strongly
affected by the fluency of the translated sentences. In
addition, several features were found to correlate well
with the average fluency of the translated sentences
and the average naturalness of the synthesized speech.
Keywords: Speech-to-speech translation, Machine translation,
Speech synthesis, Subjective evaluation
|
|
[13]
|
Junichi Yamagishi, Christophe Veaux, Simon King, and Steve Renals.
Speech synthesis technologies for individuals with vocal
disabilities: Voice banking and reconstruction.
Acoustical Science and Technology, 33(1):1-5, 2012.
[ bib |
http ]
|
|
[14]
|
Oliver Watts, Junichi Yamagishi, and Simon King.
Unsupervised continuous-valued word features for phrase-break
prediction without a part-of-speech tagger.
In Proc. Interspeech, pages 2157-2160, Florence, Italy, August
2011.
[ bib |
.pdf ]
Part of speech (POS) tags are foremost among the
features conventionally used to predict intonational
phrase-breaks for text to speech (TTS) conversion. The
construction of such systems therefore presupposes the
availability of a POS tagger for the relevant language,
or of a corpus manually tagged with POS. However, such
tools and resources are not available in the majority
of the world’s languages, and manually labelling text
with POS tags is an expensive and time-consuming
process. We therefore propose the use of
continuous-valued features that summarise the
distributional characteristics of word types as
surrogates for POS features. Importantly, such features
are obtained in an unsupervised manner from an untagged
text corpus. We present results on the phrase-break
prediction task, where use of the features closes the
gap in performance between a baseline system (using
only basic punctuation-related features) and a topline
system (incorporating a state-of-the-art POS tagger).
|
|
[15]
|
Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King.
Can objective measures predict the intelligibility of modified
HMM-based synthetic speech in noise?
In Proc. Interspeech, August 2011.
[ bib |
.pdf ]
Synthetic speech can be modified to improve
intelligibility in noise. In order to perform
modifications automatically, it would be useful to have
an objective measure that could predict the
intelligibility of modified synthetic speech for human
listeners. We analysed the impact on intelligibility
– and on how well objective measures predict it –
when we separately modify speaking rate, fundamental
frequency, line spectral pairs and spectral peaks.
Shifting LSPs can increase intelligibility for human
listeners; other modifications had weaker effects.
Among the objective measures we evaluated, the Dau
model and the Glimpse proportion were the best
predictors of human performance.
|
|
[16]
|
Korin Richmond, Phil Hoole, and Simon King.
Announcing the electromagnetic articulography (day 1) subset of the
mngu0 articulatory corpus.
In Proc. Interspeech, pages 1505-1508, Florence, Italy, August
2011.
[ bib |
.pdf ]
This paper serves as an initial announcement of the
availability of a corpus of articulatory data called
mngu0. This corpus will ultimately consist of a
collection of multiple sources of articulatory data
acquired from a single speaker: electromagnetic
articulography (EMA), audio, video, volumetric MRI
scans, and 3D scans of dental impressions. This data
will be provided free for research use. In this first
stage of the release, we are making available one
subset of EMA data, consisting of more than 1,300
phonetically diverse utterances recorded with a
Carstens AG500 electromagnetic articulograph.
Distribution of mngu0 will be managed by a dedicated
“forum-style” web site. This paper both outlines the
general goals motivating the distribution of the data
and the creation of the mngu0 web forum, and also
provides a description of the EMA data contained in
this initial release.
|
|
[17]
|
Ming Lei, Junichi Yamagishi, Korin Richmond, Zhen-Hua Ling, Simon King, and
Li-Rong Dai.
Formant-controlled HMM-based speech synthesis.
In Proc. Interspeech, pages 2777-2780, Florence, Italy, August
2011.
[ bib |
.pdf ]
This paper proposes a novel framework that enables us
to manipulate and control formants in HMM-based speech
synthesis. In this framework, the dependency between
formants and spectral features is modelled by piecewise
linear transforms; formant parameters are effectively
mapped by these to the means of Gaussian distributions
over the spectral synthesis parameters. The spectral
envelope features generated under the influence of
formants in this way may then be passed to high-quality
vocoders to generate the speech waveform. This provides
two major advantages over conventional frameworks.
First, we can achieve spectral modification by changing
formants only in those parts where we want control,
whereas the user must specify all formants manually in
conventional formant synthesisers (e.g. Klatt). Second,
this can produce high-quality speech. Our results show
the proposed method can control vowels in the
synthesized speech by manipulating F 1 and F 2 without
any degradation in synthesis quality.
|
|
[18]
|
S. Andraszewicz, J. Yamagishi, and S. King.
Vocal attractiveness of statistical speech synthesisers.
In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE
International Conference on, pages 5368-5371, May 2011.
[ bib |
DOI ]
Our previous analysis of speaker-adaptive HMM-based
speech synthesis methods suggested that there are two
possible reasons why average voices can obtain higher
subjective scores than any individual adapted voice: 1)
model adaptation degrades speech quality proportionally
to the distance 'moved' by the transforms, and 2)
psychoacoustic effects relating to the attractiveness
of the voice. This paper is a follow-on from that
analysis and aims to separate these effects out. Our
latest perceptual experiments focus on attractiveness,
using average voices and speaker-dependent voices
without model trans formation, and show that using
several speakers to create a voice improves smoothness
(measured by Harmonics-to-Noise Ratio), reduces
distance from the the average voice in the log F0-F1
space of the final voice and hence makes it more
attractive at the segmental level. However, this is
weakened or overridden at supra-segmental or sentence
levels.
Keywords: speaker-adaptive HMM-based speech synthesis
methods;speaker-dependent voices;statistical speech
synthesisers;vocal attractiveness;hidden Markov
models;speaker recognition;speech synthesis;
|
|
[19]
|
Cassia Valentini-Botinhao, Junichi Yamagishi, and Simon King.
Evaluation of objective measures for intelligibility prediction of
HMM-based synthetic speech in noise.
In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE
International Conference on, pages 5112-5115, May 2011.
[ bib |
DOI |
.pdf ]
In this paper we evaluate four objective measures of
speech with regards to intelligibility prediction of
synthesized speech in diverse noisy situations. We
evaluated three intelligibility measures, the Dau
measure, the glimpse proportion and the Speech
Intelligibility Index (SII) and a quality measure, the
Perceptual Evaluation of Speech Quality (PESQ). For the
generation of synthesized speech we used a state of the
art HMM-based speech synthesis system. The noisy
conditions comprised four additive noises. The measures
were compared with subjective intelligibility scores
obtained in listening tests. The results show the Dau
and the glimpse measures to be the best predictors of
intelligibility, with correlations of around 0.83 to
subjective scores. All measures gave less accurate
predictions of intelligibility for synthetic speech
than have previously been found for natural speech; in
particular the SII measure. In additional experiments,
we processed the synthesized speech by an ideal binary
mask before adding noise. The Glimpse measure gave the
most accurate intelligibility predictions in this
situation.
|
|
[20]
|
K. Hashimoto, J. Yamagishi, W. Byrne, S. King, and K. Tokuda.
An analysis of machine translation and speech synthesis in
speech-to-speech translation system.
In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE
International Conference on, pages 5108-5111, May 2011.
[ bib |
DOI ]
This paper provides an analysis of the impacts of
machine translation and speech synthesis on
speech-to-speech translation systems. The
speech-to-speech translation system consists of three
components: speech recognition, machine translation and
speech synthesis. Many techniques for integration of
speech recognition and machine translation have been
proposed. However, speech synthesis has not yet been
considered. Therefore, in this paper, we focus on
machine translation and speech synthesis, and report a
subjective evaluation to analyze the impact of each
component. The results of these analyses show that the
naturalness and intelligibility of synthesized speech
are strongly affected by the fluency of the translated
sentences.
Keywords: machine translation;speech recognition;speech
synthesis;speech-to-speech translation system;speech
recognition;speech synthesis;
|
|
[21]
|
Dong Wang, Nicholas Evans, Raphael Troncy, and Simon King.
Handling overlaps in spoken term detection.
In Proc. International Conference on Acoustics, Speech and
Signal Processing, pages 5656-5659, May 2011.
[ bib |
DOI |
.pdf ]
Spoken term detection (STD) systems usually arrive at
many overlapping detections which are often addressed
with some pragmatic approaches, e.g. choosing the best
detection to represent all the overlaps. In this paper
we present a theoretical study based on a concept of
acceptance space. In particular, we present two
confidence estimation approaches based on Bayesian and
evidence perspectives respectively. Analysis shows that
both approaches possess respective ad vantages and
shortcomings, and that their combination has the
potential to provide an improved confidence estimation.
Experiments conducted on meeting data confirm our
analysis and show considerable performance improvement
with the combined approach, in particular for
out-of-vocabulary spoken term detection with stochastic
pronunciation modeling.
|
|
[22]
|
Dong Wang and Simon King.
Letter-to-sound pronunciation prediction using conditional random
fields.
IEEE Signal Processing Letters, 18(2):122-125, February 2011.
[ bib |
DOI |
.pdf ]
Pronunciation prediction, or letter-to-sound (LTS)
conversion, is an essential task for speech synthesis,
open vo- cabulary spoken term detection and other
applications dealing with novel words. Most current
approaches (at least for English) employ data-driven
methods to learn and represent pronunciation “rules”
using statistical models such as decision trees, hidden
Markov models (HMMs) or joint-multigram models (JMMs).
The LTS task remains challenging, particularly for
languages with a complex relationship between spelling
and pronunciation such as English. In this paper, we
propose to use a conditional random field (CRF) to
perform LTS because it avoids having to model a
distribution over observations and can perform global
inference, suggesting that it may be more suitable for
LTS than decision trees, HMMs or JMMs. One challenge in
applying CRFs to LTS is that the phoneme and grapheme
sequences of a word are generally of different lengths,
which makes CRF training difficult. To solve this
problem, we employed a joint-multigram model to
generate aligned training exemplars. Experiments
conducted with the AMI05 dictionary demonstrate that a
CRF significantly outperforms other models, especially
if n-best lists of predictions are generated.
|
|
[23]
|
J. Dines, J. Yamagishi, and S. King.
Measuring the gap between HMM-based ASR and TTS.
IEEE Selected Topics in Signal Processing, 2011.
(in press).
[ bib |
DOI ]
The EMIME European project is conducting research in
the development of technologies for mobile,
personalised speech-to-speech translation systems. The
hidden Markov model (HMM) is being used as the
underlying technology in both automatic speech
recognition (ASR) and text-to-speech synthesis (TTS)
components, thus, the investigation of unified
statistical modelling approaches has become an implicit
goal of our research. As one of the first steps towards
this goal, we have been investigating commonalities and
differences between HMM-based ASR and TTS. In this
paper we present results and analysis of a series of
experiments that have been conducted on English ASR and
TTS systems measuring their performance with respect to
phone set and lexicon; acoustic feature type and
dimensionality; HMM topology; and speaker adaptation.
Our results show that, although the fundamental
statistical model may be essentially the same, optimal
ASR and TTS performance often demands diametrically
opposed system designs. This represents a major
challenge to be addressed in the investigation of such
unified modelling approaches.
Keywords: Acoustics, Adaptation model, Context modeling, Hidden
Markov models, Speech, Speech recognition, Training,
speech recognition, speech synthesis, unified models
|
|
[24]
|
Adriana Stan, Junichi Yamagishi, Simon King, and Matthew Aylett.
The Romanian speech synthesis (RSS) corpus: Building a high
quality HMM-based speech synthesis system using a high sampling rate.
Speech Communication, 53(3):442-450, 2011.
[ bib |
DOI |
http ]
This paper first introduces a newly-recorded high
quality Romanian speech corpus designed for speech
synthesis, called “RSS”, along with Romanian
front-end text processing modules and HMM-based
synthetic voices built from the corpus. All of these
are now freely available for academic use in order to
promote Romanian speech technology research. The RSS
corpus comprises 3500 training sentences and 500 test
sentences uttered by a female speaker and was recorded
using multiple microphones at 96 kHz sampling
frequency in a hemianechoic chamber. The details of the
new Romanian text processor we have developed are also
given. Using the database, we then revisit some basic
configuration choices of speech synthesis, such as
waveform sampling frequency and auditory frequency
warping scale, with the aim of improving speaker
similarity, which is an acknowledged weakness of
current HMM-based speech synthesisers. As we
demonstrate using perceptual tests, these configuration
choices can make substantial differences to the quality
of the synthetic speech. Contrary to common practice in
automatic speech recognition, higher waveform sampling
frequencies can offer enhanced feature extraction and
improved speaker similarity for HMM-based speech
synthesis.
Keywords: Speech synthesis, HTS, Romanian, HMMs, Sampling
frequency, Auditory scale
|
|
[25]
|
C. Mayo, R. A. J. Clark, and S. King.
Listeners' weighting of acoustic cues to synthetic speech
naturalness: A multidimensional scaling analysis.
Speech Communication, 53(3):311-326, 2011.
[ bib |
DOI ]
The quality of current commercial speech synthesis
systems is now so high that system improvements are
being made at subtle sub- and supra-segmental levels.
Human perceptual evaluation of such subtle improvements
requires a highly sophisticated level of perceptual
attention to specific acoustic characteristics or cues.
However, it is not well understood what acoustic cues
listeners attend to by default when asked to evaluate
synthetic speech. It may, therefore, be potentially
quite difficult to design an evaluation method that
allows listeners to concentrate on only one dimension
of the signal, while ignoring others that are
perceptually more important to them. The aim of the
current study was to determine which acoustic
characteristics of unit-selection synthetic speech are
most salient to listeners when evaluating the
naturalness of such speech. This study made use of
multidimensional scaling techniques to analyse
listeners' pairwise comparisons of synthetic speech
sentences. Results indicate that listeners place a
great deal of perceptual importance on the presence of
artifacts and discontinuities in the speech, somewhat
less importance on aspects of segmental quality, and
very little importance on stress/intonation
appropriateness. These relative differences in
importance will impact on listeners' ability to attend
to these different acoustic characteristics of
synthetic speech, and should therefore be taken into
account when designing appropriate methods of synthetic
speech evaluation.
Keywords: Speech synthesis; Evaluation; Speech perception;
Acoustic cue weighting; Multidimensional scaling
|
|
[26]
|
Dong Wang, Simon King, Nick Evans, and Raphael Troncy.
Direct posterior confidence for out-of-vocabulary spoken term
detection.
In Proc. ACM Multimedia 2010 Searching Spontaneous
Conversational Speech Workshop, October 2010.
[ bib |
DOI |
.pdf ]
Spoken term detection (STD) is a fundamental task in
spoken information retrieval. Compared to conventional
speech transcription and keyword spotting, STD is an
open-vocabulary task and is necessarily required to
address out-of-vocabulary (OOV) terms. Approaches based
on subword units, e.g. phonemes, are widely used to
solve the OOV issue; however, performance on OOV terms
is still significantly inferior to that for
in-vocabulary (INV) terms. The performance degradation
on OOV terms can be attributed to a multitude of
factors. A particular factor we address in this paper
is that the acoustic and language models used for
speech transcribing are highly vulnerable to OOV terms,
which leads to unreliable confidence measures and
error-prone detections. A direct posterior confidence
measure that is derived from discriminative models has
been proposed for STD. In this paper, we utilize this
technique to tackle the weakness of OOV terms in
confidence estimation. Neither acoustic models nor
language models being included in the computation, the
new confidence avoids the weak modeling problem with
OOV terms. Our experiments, set up on multi-party
meeting speech which is highly spontaneous and
conversational, demonstrate that the proposed technique
improves STD performance on OOV terms significantly;
when combined with conventional lattice-based
confidence, a significant improvement in performance is
obtained on both INVs and OOVs. Furthermore, the new
confidence measure technique can be combined together
with other advanced techniques for OOV treatment, such
as stochastic pronunciation modeling and term-dependent
confidence discrimination, which leads to an integrated
solution for OOV STD with greatly improved performance.
|
|
[27]
|
Dong Wang, Simon King, Nicholas W. D. Evans, and Raphaël Troncy.
Direct posterior confidence for out-of-vocabulary spoken term
detection.
In SSCS 2010, ACM Workshop on Searching Spontaneous
Conversational Speech, September 20-24, 2010, Firenze, Italy,
Firenze, ITALY, September 2010.
[ bib |
DOI ]
Spoken term detection (STD) is a fundamental task in
spoken information retrieval. Compared to conventional
speech transcription and keyword spotting, STD is an
open-vocabulary task and is necessarily required to
address out-of-vocabulary (OOV) terms. Approaches based
on subword units, e.g. phonemes, are widely used to
solve the OOV issue; however, performance on OOV terms
is still significantly inferior to that for
in-vocabulary (INV) terms. The performance degradation
on OOV terms can be attributed to a multitude of
factors. A particular factor we address in this paper
is that the acoustic and language models used for
speech transcribing are highly vulnerable to OOV terms,
which leads to unreliable confidence measures and
error-prone detections. A direct posterior confidence
measure that is derived from discriminative models has
been proposed for STD. In this paper, we utilize this
technique to tackle the weakness of OOV terms in
confidence estimation. Neither acoustic models nor
language models being included in the computation, the
new confidence avoids the weak modeling problem with
OOV terms. Our experiments, set up on multi-party
meeting speech which is highly spontaneous and
conversational, demonstrate that the proposed technique
improves STD performance on OOV terms significantly;
when combined with conventional lattice-based
confidence, a significant improvement in performance is
obtained on both INVs and OOVs. Furthermore, the new
confidence measure technique can be combined together
with other advanced techniques for OOV treatment, such
as stochastic pronunciation modeling and term-dependent
confidence discrimination, which leads to an integrated
solution for OOV STD with greatly improved performance.
|
|
[28]
|
Dong Wang, Simon King, Nick Evans, and Raphael Troncy.
CRF-based stochastic pronunciation modelling for out-of-vocabulary
spoken term detection.
In Proc. Interspeech, Makuhari, Chiba, Japan, September 2010.
[ bib ]
Out-of-vocabulary (OOV) terms present a significant
challenge to spoken term detection (STD). This
challenge, to a large extent, lies in the high degree
of uncertainty in pronunciations of OOV terms. In
previous work, we presented a stochastic pronunciation
modeling (SPM) approach to compensate for this
uncertainty. A shortcoming of our original work,
however, is that the SPM was based on a joint-multigram
model (JMM), which is suboptimal. In this paper, we
propose to use conditional random fields (CRFs) for
letter-to-sound conversion, which significantly
improves quality of the predicted pronunciations. When
applied to OOV STD, we achieve consider- able
performance improvement with both a 1-best system and
an SPM-based system.
|
|
[29]
|
Oliver Watts, Junichi Yamagishi, and Simon King.
The role of higher-level linguistic features in HMM-based speech
synthesis.
In Proc. Interspeech, pages 841-844, Makuhari, Japan,
September 2010.
[ bib |
.pdf ]
We analyse the contribution of higher-level elements
of the linguistic specification of a data-driven speech
synthesiser to the naturalness of the synthetic speech
which it generates. The system is trained using various
subsets of the full feature-set, in which features
relating to syntactic category, intonational phrase
boundary, pitch accent and boundary tones are
selectively removed. Utterances synthesised by the
different configurations of the system are then
compared in a subjective evaluation of their
naturalness. The work presented forms background
analysis for an ongoing set of experiments in
performing text-to-speech (TTS) conversion based on
shallow features: features that can be trivially
extracted from text. By building a range of systems,
each assuming the availability of a different level of
linguistic annotation, we obtain benchmarks for our
on-going work.
|
|
[30]
|
Junichi Yamagishi, Oliver Watts, Simon King, and Bela Usabaev.
Roles of the average voice in speaker-adaptive HMM-based speech
synthesis.
In Proc. Interspeech, pages 418-421, Makuhari, Japan,
September 2010.
[ bib |
.pdf ]
In speaker-adaptive HMM-based speech synthesis, there
are typically a few speakers for which the output
synthetic speech sounds worse than that of other
speakers, despite having the same amount of adaptation
data from within the same corpus. This paper
investigates these fluctuations in quality and
concludes that as mel-cepstral distance from the
average voice becomes larger, the MOS naturalness
scores generally become worse. Although this negative
correlation is not that strong, it suggests a way to
improve the training and adaptation strategies. We also
draw comparisons between our findings and the work of
other researchers regarding “vocal attractiveness.”
Keywords: speech synthesis, HMM, average voice, speaker
adaptation
|
|
[31]
|
Mirjam Wester, John Dines, Matthew Gibson, Hui Liang, Yi-Jian Wu, Lakshmi
Saheer, Simon King, Keiichiro Oura, Philip N. Garner, William Byrne, Yong
Guan, Teemu Hirsimäki, Reima Karhila, Mikko Kurimo, Matt Shannon, Sayaka
Shiota, Jilei Tian, Keiichi Tokuda, and Junichi Yamagishi.
Speaker adaptation and the evaluation of speaker similarity in the
EMIME speech-to-speech translation project.
In Proc. of 7th ISCA Speech Synthesis Workshop, Kyoto, Japan,
September 2010.
[ bib |
.pdf ]
This paper provides an overview of speaker adaptation
research carried out in the EMIME speech-to-speech
translation (S2ST) project. We focus on how speaker
adaptation transforms can be learned from speech in one
language and applied to the acoustic models of another
language. The adaptation is transferred across
languages and/or from recognition models to synthesis
models. The various approaches investigated can all be
viewed as a process in which a mapping is defined in
terms of either acoustic model states or linguistic
units. The mapping is used to transfer either speech
data or adaptation transforms between the two models.
Because the success of speaker adaptation in
text-to-speech synthesis is measured by judging speaker
similarity, we also discuss issues concerning
evaluation of speaker similarity in an S2ST scenario.
|
|
[32]
|
Javier Tejedor, Doroteo T. Toledano, Miguel Bautista, Simon King, Dong Wang,
and Jose Colas.
Augmented set of features for confidence estimation in spoken term
detection.
In Proc. Interspeech, September 2010.
[ bib |
.pdf ]
Discriminative confidence estimation along with
confidence normalisation have been shown to construct
robust decision maker modules in spoken term detection
(STD) systems. Discriminative confidence estimation,
making use of termdependent features, has been shown to
improve the widely used lattice-based confidence
estimation in STD. In this work, we augment the set of
these term-dependent features and show a significant
improvement in the STD performance both in terms of
ATWV and DET curves in experiments conducted on a
Spanish geographical corpus. This work also proposes a
multiple linear regression analysis to carry out the
feature selection. Next, the most informative features
derived from it are used within the discriminative
confidence on the STD system.
|
|
[33]
|
Oliver Watts, Junichi Yamagishi, and Simon King.
Letter-based speech synthesis.
In Proc. Speech Synthesis Workshop 2010, pages 317-322, Nara,
Japan, September 2010.
[ bib |
.pdf ]
Initial attempts at performing text-to-speech
conversion based on standard orthographic units are
presented, forming part of a larger scheme of training
TTS systems on features that can be trivially extracted
from text. We evaluate the possibility of using the
technique of decision-tree-based context clustering
conventionally used in HMM-based systems for
parametertying to handle letter-to-sound conversion. We
present the application of a method of compound-feature
discovery to corpusbased speech synthesis. Finally, an
evaluation of intelligibility of letter-based systems
and more conventional phoneme-based systems is
presented.
|
|
[34]
|
Alice Turk, James Scobbie, Christian Geng, Barry Campbell, Catherine Dickie,
Eddie Dubourg, Ellen Gurman Bard, William Hardcastle, Mariam Hartinger, Simon
King, Robin Lickley, Cedric Macmartin, Satsuki Nakai, Steve Renals, Korin
Richmond, Sonja Schaeffler, Kevin White, Ronny Wiegand, and Alan Wrench.
An Edinburgh speech production facility.
Poster presented at the 12th Conference on Laboratory Phonology,
Albuquerque, New Mexico., July 2010.
[ bib |
.pdf ]
|
|
[35]
|
D. Wang, S. King, and J. Frankel.
Stochastic pronunciation modelling for out-of-vocabulary spoken term
detection.
Audio, Speech, and Language Processing, IEEE Transactions on,
PP(99), July 2010.
[ bib |
DOI ]
Spoken term detection (STD) is the name given to the
task of searching large amounts of audio for
occurrences of spoken terms, which are typically single
words or short phrases. One reason that STD is a hard
task is that search terms tend to contain a
disproportionate number of out-of-vocabulary (OOV)
words. The most common approach to STD uses subword
units. This, in conjunction with some method for
predicting pronunciations of OOVs from their written
form, enables the detection of OOV terms but
performance is considerably worse than for
in-vocabulary terms. This performance differential can
be largely attributed to the special properties of
OOVs. One such property is the high degree of
uncertainty in the pronunciation of OOVs. We present a
stochastic pronunciation model (SPM) which explicitly
deals with this uncertainty. The key insight is to
search for all possible pronunciations when detecting
an OOV term, explicitly capturing the uncertainty in
pronunciation. This requires a probabilistic model of
pronunciation, able to estimate a distribution over all
possible pronunciations. We use a joint-multigram model
(JMM) for this and compare the JMM-based SPM with the
conventional soft match approach. Experiments using
speech from the meetings domain demonstrate that the
SPM performs better than soft match in most operating
regions, especially at low false alarm probabilities.
Furthermore, SPM and soft match are found to be
complementary: their combination provides further
performance gains.
|
|
[36]
|
Mikko Kurimo, William Byrne, John Dines, Philip N. Garner, Matthew Gibson, Yong
Guan, Teemu Hirsimäki, Reima Karhila, Simon King, Hui Liang, Keiichiro
Oura, Lakshmi Saheer, Matt Shannon, Sayaka Shiota, Jilei Tian, Keiichi
Tokuda, Mirjam Wester, Yi-Jian Wu, and Junichi Yamagishi.
Personalising speech-to-speech translation in the EMIME project.
In Proc. of the ACL 2010 System Demonstrations, Uppsala,
Sweden, July 2010.
[ bib |
.pdf ]
In the EMIME project we have studied unsupervised
cross-lingual speaker adaptation. We have employed an
HMM statistical framework for both speech recognition
and synthesis which provides transformation mechanisms
to adapt the synthesized voice in TTS (text-to-speech)
using the recognized voice in ASR (automatic speech
recognition). An important application for this
research is personalised speech-to-speech translation
that will use the voice of the speaker in the input
language to utter the translated sentences in the
output language. In mobile environments this enhances
the users' interaction across language barriers by
making the output speech sound more like the original
speaker's way of speaking, even if she or he could not
speak the output language.
|
|
[37]
|
O. Watts, J. Yamagishi, S. King, and K. Berkling.
Synthesis of child speech with HMM adaptation and voice conversion.
Audio, Speech, and Language Processing, IEEE Transactions on,
18(5):1005-1016, July 2010.
[ bib |
DOI |
.pdf ]
The synthesis of child speech presents challenges both
in the collection of data and in the building of a
synthesizer from that data. We chose to build a
statistical parametric synthesizer using the hidden
Markov model (HMM)-based system HTS, as this technique
has previously been shown to perform well for limited
amounts of data, and for data collected under imperfect
conditions. Six different configurations of the
synthesizer were compared, using both speaker-dependent
and speaker-adaptive modeling techniques, and using
varying amounts of data. For comparison with HMM
adaptation, techniques from voice conversion were used
to transform existing synthesizers to the
characteristics of the target speaker. Speaker-adaptive
voices generally outperformed child speaker-dependent
voices in the evaluation. HMM adaptation outperformed
voice conversion style techniques when using the full
target speaker corpus; with fewer adaptation data,
however, no significant listener preference for either
HMM adaptation or voice conversion methods was found.
Keywords: HMM adaptation techniques;child speech
synthesis;hidden Markov model;speaker adaptive modeling
technique;speaker dependent technique;speaker-adaptive
voice;statistical parametric synthesizer;target speaker
corpus;voice conversion;hidden Markov models;speech
synthesis;
|
|
[38]
|
J. Yamagishi, B. Usabaev, S. King, O. Watts, J. Dines, J. Tian, R. Hu, Y. Guan,
K. Oura, K. Tokuda, R. Karhila, and M. Kurimo.
Thousands of voices for HMM-based speech synthesis - analysis and
application of TTS systems built on various ASR corpora.
IEEE Transactions on Audio, Speech and Language Processing,
18(5):984-1004, July 2010.
[ bib |
DOI ]
In conventional speech synthesis, large amounts of
phonetically balanced speech data recorded in highly
controlled recording studio environments are typically
required to build a voice. Although using such data is
a straightforward solution for high quality synthesis,
the number of voices available will always be limited,
because recording costs are high. On the other hand,
our recent experiments with HMM-based speech synthesis
systems have demonstrated that speaker-adaptive
HMM-based speech synthesis (which uses an “average
voice model” plus model adaptation) is robust to
non-ideal speech data that are recorded under various
conditions and with varying microphones, that are not
perfectly clean, and/or that lack phonetic balance.
This enables us to consider building high-quality
voices on “non-TTS” corpora such as ASR corpora.
Since ASR corpora generally include a large number of
speakers, this leads to the possibility of producing an
enormous number of voices automatically. In this paper,
we demonstrate the thousands of voices for HMM-based
speech synthesis that we have made from several popular
ASR corpora such as the Wall Street Journal (WSJ0,
WSJ1, and WSJCAM0), Resource Management, Globalphone,
and SPEECON databases. We also present the results of
associated analysis based on perceptual evaluation, and
discuss remaining issues.
Keywords: Automatic speech recognition (ASR), H Triple S (HTS),
SPEECON database, WSJ database, average voice, hidden
Markov model (HMM)-based speech synthesis, speaker
adaptation, speech synthesis, voice conversion
|
|
[39]
|
R. Barra-Chicote, J. Yamagishi, S. King, J. Manuel Monero, and
J. Macias-Guarasa.
Analysis of statistical parametric and unit-selection speech
synthesis systems applied to emotional speech.
Speech Communication, 52(5):394-404, May 2010.
[ bib |
DOI ]
We have applied two state-of-the-art speech synthesis
techniques (unit selection and HMM-based synthesis) to
the synthesis of emotional speech. A series of
carefully designed perceptual tests to evaluate speech
quality, emotion identification rates and emotional
strength were used for the six emotions which we
recorded - happiness, sadness, anger, surprise, fear,
disgust. For the HMM-based method, we evaluated
spectral and source components separately and
identified which components contribute to which
emotion. Our analysis shows that, although the HMM
method produces significantly better neutral speech,
the two methods produce emotional speech of similar
quality, except for emotions having context-dependent
prosodic patterns. Whilst synthetic speech produced
using the unit selection method has better emotional
strength scores than the HMM-based method, the
HMM-based method has the ability to manipulate the
emotional strength. For emotions that are characterized
by both spectral and prosodic components, synthetic
speech using unit selection methods was more accurately
identified by listeners. For emotions mainly
characterized by prosodic components, HMM-based
synthetic speech was more accurately identified. This
finding differs from previous results regarding
listener judgements of speaker similarity for neutral
speech. We conclude that unit selection methods require
improvements to prosodic modeling and that HMM-based
methods require improvements to spectral modeling for
emotional speech. Certain emotions cannot be reproduced
well by either method.
Keywords: Emotional speech synthesis; HMM-based synthesis; Unit
selection
|
|
[40]
|
Dong Wang, Simon King, Joe Frankel, and Peter Bell.
Stochastic pronunciation modelling and soft match for
out-of-vocabulary spoken term detection.
In Proc. ICASSP, Dallas, Texas, USA, March 2010.
[ bib |
.pdf ]
A major challenge faced by a spoken term detection
(STD) system is the detection of out-of-vocabulary
(OOV) terms. Although a subword-based STD system is
able to detect OOV terms, performance reduction is
always observed compared to in-vocabulary terms. One
challenge that OOV terms bring to STD is the
pronunciation uncertainty. A commonly used approach to
address this problem is a soft matching procedure,and
the other is the stochastic pronunciation modelling
(SPM) proposed by the authors. In this paper we compare
these two approaches, and combine them using a
discriminative decision strategy. Experimental results
demonstrated that SPM and soft match are highly
complementary, and their combination gives significant
performance improvement to OOV term detection.
Keywords: confidence estimation, spoken term detection, speech
recognition
|
|
[41]
|
Simon King.
Speech synthesis.
In Morgan and Ellis, editors, Speech and Audio Signal
Processing. Wiley, 2010.
[ bib ]
No abstract (this is a book chapter)
|
|
[42]
|
Steve Renals and Simon King.
Automatic speech recognition.
In William J. Hardcastle, John Laver, and Fiona E. Gibbon, editors,
Handbook of Phonetic Sciences, chapter 22. Wiley Blackwell, 2010.
[ bib ]
|
|
[43]
|
Keiichiro Oura, Keiichi Tokuda, Junichi Yamagishi, Mirjam Wester, and Simon
King.
Unsupervised cross-lingual speaker adaptation for HMM-based speech
synthesis.
In Proc. of ICASSP, volume I, pages 4954-4957, 2010.
[ bib |
.pdf ]
In the EMIME project, we are developing a mobile
device that performs personalized speech-to-speech
translation such that a user's spoken input in one
language is used to produce spoken output in another
language, while continuing to sound like the user's
voice. We integrate two techniques, unsupervised
adaptation for HMM-based TTS using a word-based
large-vocabulary continuous speech recognizer and
cross-lingual speaker adaptation for HMM-based TTS,
into a single architecture. Thus, an unsupervised
cross-lingual speaker adaptation system can be
developed. Listening tests show very promising results,
demonstrating that adapted voices sound similar to the
target speaker and that differences between supervised
and unsupervised cross-lingual speaker adaptation are
small.
|
|
[44]
|
Volker Strom and Simon King.
A classifier-based target cost for unit selection speech synthesis
trained on perceptual data.
In Proc. Interspeech, Makuhari, Japan, 2010.
[ bib |
.ps |
.pdf ]
Our goal is to automatically learn a
PERCEPTUALLY-optimal target cost function for a unit
selection speech synthesiser. The approach we take here
is to train a classifier on human perceptual judgements
of synthetic speech. The output of the classifier is
used to make a simple three-way distinction rather than
to estimate a continuously-valued cost. In order to
collect the necessary perceptual data, we synthesised
145,137 short sentences with the usual target cost
switched off, so that the search was driven by the join
cost only. We then selected the 7200 sentences with the
best joins and asked 60 listeners to judge them,
providing their ratings for each syllable. From this,
we derived a rating for each demiphone. Using as input
the same context features employed in our conventional
target cost function, we trained a classifier on these
human perceptual ratings. We synthesised two sets of
test sentences with both our standard target cost and
the new target cost based on the classifier. A/B
preference tests showed that the classifier-based
target cost, which was learned completely automatically
from modest amounts of perceptual data, is almost as
good as our carefully- and expertly-tuned standard
target cost.
|
|
[45]
|
Alice Turk, James Scobbie, Christian Geng, Cedric Macmartin, Ellen Bard, Barry
Campbell, Catherine Dickie, Eddie Dubourg, Bill Hardcastle, Phil Hoole, Evia
Kanaida, Robin Lickley, Satsuki Nakai, Marianne Pouplier, Simon King, Steve
Renals, Korin Richmond, Sonja Schaeffler, Ronnie Wiegand, Kevin White, and
Alan Wrench.
The Edinburgh Speech Production Facility's articulatory corpus of
spontaneous dialogue.
The Journal of the Acoustical Society of America,
128(4):2429-2429, 2010.
[ bib |
DOI ]
The EPSRC‐funded Edinburgh Speech Production is
built around two synchronized Carstens AG500
electromagnetic articulographs (EMAs) in order to
capture articulatory∕acoustic data from spontaneous
dialogue. An initial articulatory corpus was designed
with two aims. The first was to elicit a range of
speech styles∕registers from speakers, and therefore
provide an alternative to fully scripted corpora. The
second was to extend the corpus beyond monologue, by
using tasks that promote natural discourse and
interaction. A subsidiary driver was to use dialects
from outwith North America: dialogues paired up a
Scottish English and a Southern British English
speaker. Tasks. Monologue: Story reading of “Comma
Gets a Cure” [Honorof et al. (2000)], lexical sets
[Wells (1982)], spontaneous story telling,
diadochokinetic tasks. Dialogue: Map tasks [Anderson et
al. (1991)], “Spot the Difference” picture tasks
[Bradlow et al. (2007)], story‐recall. Shadowing of
the spontaneous story telling by the second
participant. Each dialogue session includes
approximately 30 min of speech, and there are
acoustics‐only baseline materials. We will introduce
the corpus and highlight the role of articulatory
production data in helping provide a fuller
understanding of various spontaneous speech phenomena
by presenting examples of naturally occurring covert
speech errors, accent accommodation, turn taking
negotiation, and shadowing.
|
|
[46]
|
J. Yamagishi and S. King.
Simple methods for improving speaker-similarity of HMM-based speech
synthesis.
In Proc. ICASSP 2010, Dallas, Texas, USA, 2010.
[ bib |
.pdf ]
|
|
[47]
|
Simon King.
A tutorial on HMM speech synthesis (invited paper).
In Sadhana - Academy Proceedings in Engineering Sciences,
Indian Institute of Sciences, 2010.
[ bib |
.pdf ]
Statistical parametric speech synthesis, based on
HMM-like models, has become competitive with
established concatenative techniques over the last few
years. This paper offers a non-mathematical
introduction to this method of speech synthesis. It is
intended to be complementary to the wide range of
excellent technical publications already available.
Rather than offer a comprehensive literature review,
this paper instead gives a small number of carefully
chosen references which are good starting points for
further reading.
|
|
[48]
|
Peter Bell and Simon King.
Diagonal priors for full covariance speech recognition.
In Proc. IEEE Workshop on Automatic Speech Recognition and
Understanding, Merano, Italy, December 2009.
[ bib |
DOI |
.pdf ]
We investigate the use of full covariance Gaussians
for large-vocabulary speech recognition. The large
number of parameters gives high modelling power, but
when training data is limited, the standard sample
covariance matrix is often poorly conditioned, and has
high variance. We explain how these problems may be
solved by the use of a diagonal covariance smoothing
prior, and relate this to the shrinkage estimator, for
which the optimal shrinkage parameter may itself be
estimated from the training data. We also compare the
use of generatively and discriminatively trained
priors. Results are presented on a large vocabulary
conversational telephone speech recognition task.
|
|
[49]
|
Dong Wang, Simon King, and Joe Frankel.
Stochastic pronunciation modelling for spoken term detection.
In Proc. of Interspeech, pages 2135-2138, Brighton, UK,
September 2009.
[ bib |
.pdf ]
A major challenge faced by a spoken term detection
(STD) system is the detection of out-of-vocabulary
(OOV) terms. Although a subword-based STD system is
able to detect OOV terms, performance reduction is
always observed compared to in-vocabulary terms.
Current approaches to STD do not acknowledge the
particular properties of OOV terms, such as
pronunciation uncertainty. In this paper, we use a
stochastic pronunciation model to deal with the
uncertain pronunciations of OOV terms. By considering
all possible term pronunciations, predicted by a
joint-multigram model, we observe a significant
performance improvement.
|
|
[50]
|
Oliver Watts, Junichi Yamagishi, Simon King, and Kay Berkling.
HMM adaptation and voice conversion for the synthesis of child
speech: A comparison.
In Proc. Interspeech 2009, pages 2627-2630, Brighton, U.K.,
September 2009.
[ bib |
.pdf ]
This study compares two different methodologies for
producing data-driven synthesis of child speech from
existing systems that have been trained on the speech
of adults. On one hand, an existing statistical
parametric synthesiser is transformed using model
adaptation techniques, informed by linguistic and
prosodic knowledge, to the speaker characteristics of a
child speaker. This is compared with the application of
voice conversion techniques to convert the output of an
existing waveform concatenation synthesiser with no
explicit linguistic or prosodic knowledge. In a
subjective evaluation of the similarity of synthetic
speech to natural speech from the target speaker, the
HMM-based systems evaluated are generally preferred,
although this is at least in part due to the higher
dimensional acoustic features supported by these
techniques.
|
|
[51]
|
Simon King and Vasilis Karaiskos.
The Blizzard Challenge 2009.
In Proc. Blizzard Challenge Workshop, Edinburgh, UK, September
2009.
[ bib |
.pdf ]
The Blizzard Challenge 2009 was the fifth annual
Blizzard Challenge. As in 2008, UK English and Mandarin
Chinese were the chosen languages for the 2009
Challenge. The English corpus was the same one used in
2008. The Mandarin corpus was pro- vided by iFLYTEK. As
usual, participants with limited resources or limited
experience in these languages had the option of using
unaligned labels that were provided for both corpora
and for the test sentences. An accent-specific
pronunciation dictionary was also available for the
English speaker. This year, the tasks were organised in
the form of `hubs' and `spokes' where each hub task
involved building a general-purpose voice and each
spoke task involved building a voice for a specific
application. A set of test sentences was released to
participants, who were given a limited time in which to
synthesise them and submit the synthetic speech. An
online listening test was conducted to evaluate
naturalness, intelligibility, degree of similarity to
the original speaker and, for one of the spoke tasks,
"appropriateness."
Keywords: Blizzard
|
|
[52]
|
Dong Wang, Simon King, Joe Frankel, and Peter Bell.
Term-dependent confidence for out-of-vocabulary term detection.
In Proc. Interspeech, pages 2139-2142, Brighton, UK, September
2009.
[ bib |
.pdf ]
Within a spoken term detection (STD) system, the
decision maker plays an important role in retrieving
reliable detections. Most of the state-of-the-art STD
systems make decisions based on a confidence measure
that is term-independent, which poses a serious problem
for out-of-vocabulary (OOV) term detection. In this
paper, we study a term-dependent confidence measure
based on confidence normalisation and discriminative
modelling, particularly focusing on its remarkable
effectiveness for detecting OOV terms. Experimental
results indicate that the term-dependent confidence
provides much more significant improvement for OOV
terms than terms in-vocabulary.
|
|
[53]
|
Junichi Yamagishi, Mike Lincoln, Simon King, John Dines, Matthew Gibson, Jilei
Tian, and Yong Guan.
Analysis of unsupervised and noise-robust speaker-adaptive
HMM-based speech synthesis systems toward a unified ASR and TTS
framework.
In Proc. Interspeech 2009, Edinburgh, U.K., September 2009.
[ bib ]
For the 2009 Blizzard Challenge we have built an
unsupervised version of the HTS-2008 speaker-adaptive
HMM-based speech synthesis system for English, and a
noise robust version of the systems for Mandarin. They
are designed from a multidisciplinary application point
of view in that we attempt to integrate the components
of the TTS system with other technologies such as ASR.
All the average voice models are trained exclusively
from recognized, publicly available, ASR databases.
Multi-pass LVCSR and confidence scores calculated from
confusion network are used for the unsupervised
systems, and noisy data recorded in cars or public
spaces is used for the noise robust system. We believe
the developed systems form solid benchmarks and provide
good connections to ASR fields. This paper describes
the development of the systems and reports the results
and analysis of their evaluation.
|
|
[54]
|
J. Dines, J. Yamagishi, and S. King.
Measuring the gap between HMM-based ASR and TTS.
In Proc. Interspeech, pages 1391-1394, Brighton, U.K.,
September 2009.
[ bib ]
The EMIME European project is conducting research in
the development of technologies for mobile,
personalised speech-to-speech translation systems. The
hidden Markov model is being used as the underlying
technology in both automatic speech recognition (ASR)
and text-to-speech synthesis (TTS) components, thus,
the investigation of unified statistical modelling
approaches has become an implicit goal of our research.
As one of the first steps towards this goal, we have
been investigating commonalities and differences
between HMM-based ASR and TTS. In this paper we present
results and analysis of a series of experiments that
have been conducted on English ASR and TTS systems,
measuring their performance with respect to phone set
and lexicon, acoustic feature type and dimensionality
and HMM topology. Our results show that, although the
fundamental statistical model may be essentially the
same, optimal ASR and TTS performance often demands
diametrically opposed system designs. This represents a
major challenge to be addressed in the investigation of
such unified modelling approaches.
|
|
[55]
|
Javier Tejedor, Dong Wang, Simon King, Joe Frankel, and Jose Colas.
A posterior probability-based system hybridisation and combination
for spoken term detection.
In Proc. Interspeech, pages 2131-2134, Brighton, UK, September
2009.
[ bib |
.pdf ]
Spoken term detection (STD) is a fundamental task for
multimedia information retrieval. To improve the
detection performance, we have presented a direct
posterior-based confidence measure generated from a
neural network. In this paper, we propose a
detection-independent confidence estimation based on
the direct posterior confidence measure, in which the
decision making is totally separated from the term
detection. Based on this idea, we first present a
hybrid system which conducts the term detection and
confidence estimation based on different sub-word
units, and then propose a combination method which
merges detections from heterogeneous term detectors
based on the direct posterior-based confidence.
Experimental results demonstrated that the proposed
methods improved system performance considerably for
both English and Spanish.
|
|
[56]
|
J. Yamagishi, Bela Usabaev, Simon King, Oliver Watts, John Dines, Jilei Tian,
Rile Hu, Yong Guan, Keiichiro Oura, Keiichi Tokuda, Reima Karhila, and Mikko
Kurimo.
Thousands of voices for HMM-based speech synthesis.
In Proc. Interspeech, pages 420-423, Brighton, U.K., September
2009.
[ bib |
http ]
Our recent experiments with HMM-based speech synthesis
systems have demonstrated that speaker-adaptive
HMM-based speech synthesis (which uses an ‘average
voice model’ plus model adaptation) is robust to
non-ideal speech data that are recorded under various
conditions and with varying microphones, that are not
perfectly clean, and/or that lack of phonetic balance.
This enables us consider building high-quality voices
on ’non-TTS’ corpora such as ASR corpora. Since ASR
corpora generally include a large number of speakers,
this leads to the possibility of producing an enormous
number of voices automatically. In this paper we show
thousands of voices for HMM-based speech synthesis that
we have made from several popular ASR corpora such as
the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0),
Resource Management, Globalphone and Speecon. We report
some perceptual evaluation results and outline the
outstanding issues.
|
|
[57]
|
Dong Wang, Tejedor Tejedor, Joe Frankel, and Simon King.
Posterior-based confidence measures for spoken term detection.
In Proc. of ICASSP09, Taiwan, April 2009.
[ bib |
.pdf ]
Confidence measures play a key role in spoken term
detection (STD) tasks. The confidence measure expresses
the posterior probability of the search term appearing
in the detection period, given the speech. Traditional
approaches are based on the acoustic and language model
scores for candidate detections found using automatic
speech recognition, with Bayes' rule being used to
compute the desired posterior probability. In this
paper, we present a novel direct posterior-based
confidence measure which, instead of resorting to the
Bayesian formula, calculates posterior probabilities
from a multi-layer perceptron (MLP) directly. Compared
with traditional Bayesian-based methods, the
direct-posterior approach is conceptually and
mathematically simpler. Moreover, the MLP-based model
does not require assumptions to be made about the
acoustic features such as their statistical
distribution and the independence of static and dynamic
co-efficients. Our experimental results in both English
and Spanish demonstrate that the proposed direct
posterior-based confidence improves STD performance.
|
|
[58]
|
Matthew P. Aylett, Simon King, and Junichi Yamagishi.
Speech synthesis without a phone inventory.
In Interspeech, pages 2087-2090, 2009.
[ bib |
.pdf ]
In speech synthesis the unit inventory is decided
using phonological and phonetic expertise. This process
is resource intensive and potentially sub-optimal. In
this paper we investigate how acoustic clustering,
together with lexicon constraints, can be used to build
a self-organised inventory. Six English speech
synthesis systems were built using two frameworks, unit
selection and parametric HTS for three inventory
conditions: 1) a traditional phone set, 2) a system
using orthographic units, and 3) a self-organised
inventory. A listening test showed a strong preference
for the classic system, and for the orthographic system
over the self-organised system. Results also varied by
letter to sound complexity and database coverage. This
suggests the self-organised approach failed to
generalise pronunciation as well as introducing noise
above and beyond that caused by orthographic sound
mismatch.
|
|
[59]
|
Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhenhua Ling, Tomoki Toda, Keiichi
Tokuda, Simon King, and Steve Renals.
Robust speaker-adaptive HMM-based text-to-speech synthesis.
IEEE Transactions on Audio, Speech and Language Processing,
17(6):1208-1230, 2009.
[ bib |
http |
www: ]
This paper describes a speaker-adaptive HMM-based
speech synthesis system. The new system, called
“HTS-2007,” employs speaker adaptation (CSMAPLR+MAP),
feature-space adaptive training, mixed-gender modeling,
and full-covariance modeling using CSMAPLR transforms,
in addition to several other techniques that have
proved effective in our previous systems. Subjective
evaluation results show that the new system generates
significantly better quality synthetic speech than
speaker-dependent approaches with realistic amounts of
speech data, and that it bears comparison with
speaker-dependent approaches even when large amounts of
speech data are available. In addition, a comparison
study with several speech synthesis techniques shows
the new system is very robust: It is able to build
voices from less-than-ideal speech data and synthesize
good-quality speech even for out-of-domain sentences.
|
|
[60]
|
R. Barra-Chicote, J. Yamagishi, J.M. Montero, S. King, S. Lutfi, and
J. Macias-Guarasa.
Generacion de una voz sintetica en Castellano basada en HSMM para
la Evaluacion Albayzin 2008: conversion texto a voz.
In V Jornadas en Tecnologia del Habla, pages 115-118, November
2008.
(in Spanish).
[ bib |
.pdf ]
|
|
[61]
|
Javier Tejedor, Dong Wang, Joe Frankel, Simon King, and José Colás.
A comparison of grapheme and phoneme-based units for Spanish spoken
term detection.
Speech Communication, 50(11-12):980-991, November-December
2008.
[ bib |
DOI ]
The ever-increasing volume of audio data available
online through the world wide web means that automatic
methods for indexing and search are becoming essential.
Hidden Markov model (HMM) keyword spotting and lattice
search techniques are the two most common approaches
used by such systems. In keyword spotting, models or
templates are defined for each search term prior to
accessing the speech and used to find matches. Lattice
search (referred to as spoken term detection), uses a
pre-indexing of speech data in terms of word or
sub-word units, which can then quickly be searched for
arbitrary terms without referring to the original
audio. In both cases, the search term can be modelled
in terms of sub-word units, typically phonemes. For
in-vocabulary words (i.e. words that appear in the
pronunciation dictionary), the letter-to-sound
conversion systems are accepted to work well. However,
for out-of-vocabulary (OOV) search terms,
letter-to-sound conversion must be used to generate a
pronunciation for the search term. This is usually a
hard decision (i.e. not probabilistic and with no
possibility of backtracking), and errors introduced at
this step are difficult to recover from. We therefore
propose the direct use of graphemes (i.e., letter-based
sub-word units) for acoustic modelling. This is
expected to work particularly well in languages such as
Spanish, where despite the letter-to-sound mapping
being very regular, the correspondence is not
one-to-one, and there will be benefits from avoiding
hard decisions at early stages of processing. In this
article, we compare three approaches for Spanish
keyword spotting or spoken term detection, and within
each of these we compare acoustic modelling based on
phone and grapheme units. Experiments were performed
using the Spanish geographical-domain Albayzin corpus.
Results achieved in the two approaches proposed for
spoken term detection show us that trigrapheme units
for acoustic modelling match or exceed the performance
of phone-based acoustic models. In the method proposed
for keyword spotting, the results achieved with each
acoustic model are very similar.
|
|
[62]
|
Oliver Watts, Junichi Yamagishi, Kay Berkling, and Simon King.
HMM-based synthesis of child speech.
In Proc. of The 1st Workshop on Child, Computer and Interaction
(ICMI'08 post-conference workshop), Crete, Greece, October 2008.
[ bib |
.pdf ]
The synthesis of child speech presents challenges both
in the collection of data and in the building of a
synthesiser from that data. Because only limited data
can be collected, and the domain of that data is
constrained, it is difficult to obtain the type of
phonetically-balanced corpus usually used in speech
synthesis. As a consequence, building a synthesiser
from this data is difficult. Concatenative synthesisers
are not robust to corpora with many missing units (as
is likely when the corpus content is not carefully
designed), so we chose to build a statistical
parametric synthesiser using the HMM-based system HTS.
This technique has previously been shown to perform
well for limited amounts of data, and for data
collected under imperfect conditions. We compared 6
different configurations of the synthesiser, using both
speaker-dependent and speaker-adaptive modelling
techniques, and using varying amounts of data. The
output from these systems was evaluated alongside
natural and vocoded speech, in a Blizzard-style
listening test.
|
|
[63]
|
Peter Bell and Simon King.
A shrinkage estimator for speech recognition with full covariance
HMMs.
In Proc. Interspeech, Brisbane, Australia, September 2008.
Shortlisted for best student paper award.
[ bib |
.pdf ]
We consider the problem of parameter estimation in
full-covariance Gaussian mixture systems for automatic
speech recognition. Due to the high dimensionality of
the acoustic feature vector, the standard sample
covariance matrix has a high variance and is often
poorly-conditioned when the amount of training data is
limited. We explain how the use of a shrinkage
estimator can solve these problems, and derive a
formula for the optimal shrinkage intensity. We present
results of experiments on a phone recognition task,
showing that the estimator gives a performance
improvement over a standard full-covariance system
|
|
[64]
|
Junichi Yamagishi, Zhenhua Ling, and Simon King.
Robustness of hmm-based speech synthesis.
In Proc. Interspeech 2008, pages 581-584, Brisbane, Australia,
September 2008.
[ bib |
.pdf ]
As speech synthesis techniques become more advanced,
we are able to consider building high-quality voices
from data collected outside the usual highly-controlled
recording studio environment. This presents new
challenges that are not present in conventional
text-to-speech synthesis: the available speech data are
not perfectly clean, the recording conditions are not
consistent, and/or the phonetic balance of the material
is not ideal. Although a clear picture of the
performance of various speech synthesis techniques
(e.g., concatenative, HMM-based or hybrid) under good
conditions is provided by the Blizzard Challenge, it is
not well understood how robust these algorithms are to
less favourable conditions. In this paper, we analyse
the performance of several speech synthesis methods
under such conditions. This is, as far as we know, a
new research topic: “Robust speech synthesis.” As a
consequence of our investigations, we propose a new
robust training method for the HMM-based speech
synthesis in for use with speech data collected in
unfavourable conditions.
|
|
[65]
|
Dong Wang, Ivan Himawan, Joe Frankel, and Simon King.
A posterior approach for microphone array based speech recognition.
In Proc. Interspeech, pages 996-999, September 2008.
[ bib |
.pdf ]
Automatic speech recognition (ASR) becomes rather
difficult in meetings domains because of the adverse
acoustic conditions, including more background noise,
more echo and reverberation and frequent cross-talking.
Microphone arrays have been demonstrated able to boost
ASR performance dramatically in such noisy and
reverberant environments, with various beamforming
algorithms. However, almost all existing beamforming
measures work in the acoustic domain, resorting to
signal processing theories and geometric explanation.
This limits their application, and induces significant
performance degradation when the geometric property is
unavailable or hard to estimate, or if heterogenous
channels exist in the audio system. In this paper, we
preset a new posterior-based approach for array-based
speech recognition. The main idea is, instead of
enhancing speech signals, we try to enhance the
posterior probabilities that frames belonging to
recognition units, e.g., phones. These enhanced
posteriors are then transferred to posterior
probability based features and are modeled by HMMs,
leading to a tandem ANN-HMM hybrid system presented by
Hermansky et al.. Experimental results demonstrated the
validity of this posterior approach. With the posterior
accumulation or enhancement, significant improvement
was achieved over the single channel baseline.
Moreover, we can combine the acoustic enhancement and
posterior enhancement together, leading to a hybrid
acoustic-posterior beamforming approach, which works
significantly better than just the acoustic
beamforming, especially in the scenario with
moving-speakers.
|
|
[66]
|
Joe Frankel, Dong Wang, and Simon King.
Growing bottleneck features for tandem ASR.
In Proc. Interspeech, page 1549, September 2008.
[ bib |
.pdf ]
We present a method for training bottleneck MLPs for
use in tandem ASR. Experiments on meetings data show
that this approach leads to improved performance
compared with training MLPs from a random
initialization.
|
|
[67]
|
Simon King, Keiichi Tokuda, Heiga Zen, and Junichi Yamagishi.
Unsupervised adaptation for hmm-based speech synthesis.
In Proc. Interspeech, pages 1869-1872, Brisbane, Australia,
September 2008.
[ bib |
.PDF ]
It is now possible to synthesise speech using HMMs
with a comparable quality to unit-selection techniques.
Generating speech from a model has many potential
advantages over concatenating waveforms. The most
exciting is model adaptation. It has been shown that
supervised speaker adaptation can yield high-quality
synthetic voices with an order of magnitude less data
than required to train a speaker-dependent model or to
build a basic unit-selection system. Such supervised
methods require labelled adaptation data for the target
speaker. In this paper, we introduce a method capable
of unsupervised adaptation, using only speech from the
target speaker without any labelling.
|
|
[68]
|
Laszlo Toth, Joe Frankel, Gabor Gosztolya, and Simon King.
Cross-lingual portability of mlp-based tandem features - a case
study for english and hungarian.
In Proc. Interspeech, pages 2695-2698, Brisbane, Australia,
September 2008.
[ bib |
.PDF ]
One promising approach for building ASR systems for
less-resourced languages is cross-lingual adaptation.
Tandem ASR is particularly well suited to such
adaptation, as it includes two cascaded modelling
steps: feature extraction using multi-layer perceptrons
(MLPs), followed by modelling using a standard HMM. The
language-specific tuning can be performed by adjusting
the HMM only, leaving the MLP untouched. Here we
examine the portability of feature extractor MLPs
between an Indo-European (English) and a Finno-Ugric
(Hungarian) language. We present experiments which use
both conventional phone-posterior and articulatory
feature (AF) detector MLPs, both trained on a much
larger quantity of (English) data than the monolingual
(Hungarian) system. We find that the cross-lingual
configurations achieve similar performance to the
monolingual system, and that, interestingly, the AF
detectors lead to slightly worse performance, despite
the expectation that they should be more
language-independent than phone-based MLPs. However,
the cross-lingual system outperforms all other
configurations when the English phone MLP is adapted on
the Hungarian data.
Keywords: tandem, ASR
|
|
[69]
|
Vasilis Karaiskos, Simon King, Robert A. J. Clark, and Catherine Mayo.
The blizzard challenge 2008.
In Proc. Blizzard Challenge Workshop, Brisbane, Australia,
September 2008.
[ bib |
.pdf ]
The Blizzard Challenge 2008 was the fourth annual
Blizzard Challenge. This year, participants were asked
to build two voices from a UK English corpus and one
voice from a Man- darin Chinese corpus. This is the
first time that a language other than English has been
included and also the first time that a large UK
English corpus has been available. In addi- tion, the
English corpus contained somewhat more expressive
speech than that found in corpora used in previous
Blizzard Challenges. To assist participants with
limited resources or limited ex- perience in
UK-accented English or Mandarin, unaligned la- bels
were provided for both corpora and for the test
sentences. Participants could use the provided labels
or create their own. An accent-specific pronunciation
dictionary was also available for the English speaker.
A set of test sentences was released to participants,
who were given a limited time in which to synthesise
them and submit the synthetic speech. An online
listening test was con- ducted, to evaluate
naturalness, intelligibility and degree of similarity
to the original speaker.
Keywords: Blizzard
|
|
[70]
|
Peter Bell and Simon King.
Covariance updates for discriminative training by constrained line
search.
In Proc. Interspeech, Brisbane, Australia, September 2008.
[ bib |
.pdf ]
We investigate the recent Constrained Line Search
algorithm for discriminative training of HMMs and
propose an alternative formula for variance update. We
compare the method to standard techniques on a phone
recognition task.
|
|
[71]
|
Olga Goubanova and Simon King.
Bayesian networks for phone duration prediction.
Speech Communication, 50(4):301-311, April 2008.
[ bib |
DOI ]
In a text-to-speech system, the duration of each phone
may be predicted by a duration model. This model is
usually trained using a database of phones with known
durations; each phone (and the context it appears in)
is characterised by a feature vector that is composed
of a set of linguistic factor values. We describe the
use of a graphical model - a Bayesian network - for
predicting the duration of a phone, given the values
for these factors. The network has one discrete
variable for each of the linguistic factors and a
single continuous variable for the phone's duration.
Dependencies between variables (or the lack of them)
are represented in the BN structure by arcs (or missing
arcs) between pairs of nodes. During training, both the
topology of the network and its parameters are learned
from labelled data. We compare the results of the BN
model with results for sums of products and CART models
on the same data. In terms of the root mean square
error, the BN model performs much better than both CART
and SoP models. In terms of correlation coefficient,
the BN model performs better than the SoP model, and as
well as the CART model. A BN model has certain
advantages over CART and SoP models. Training SoP
models requires a high degree of expertise. CART models
do not deal with interactions between factors in any
explicit way. As we demonstrate, a BN model can also
make accurate predictions of a phone's duration, even
when the values for some of the linguistic factors are
unknown.
|
|
[72]
|
Dong Wang, Joe Frankel, Javier Tejedor, and Simon King.
A comparison of phone and grapheme-based spoken term detection.
In Proc. ICASSP, pages 4969-4972, March-April 2008.
[ bib |
DOI ]
We propose grapheme-based sub-word units for spoken
term detection (STD). Compared to phones, graphemes
have a number of potential advantages. For
out-of-vocabulary search terms, phone- based approaches
must generate a pronunciation using letter-to-sound
rules. Using graphemes obviates this potentially
error-prone hard decision, shifting pronunciation
modelling into the statistical models describing the
observation space. In addition, long-span grapheme
language models can be trained directly from large text
corpora. We present experiments on Spanish and English
data, comparing phone and grapheme-based STD. For
Spanish, where phone and grapheme-based systems give
similar transcription word error rates (WERs),
grapheme-based STD significantly outperforms a phone-
based approach. The converse is found for English,
where the phone-based system outperforms a grapheme
approach. However, we present additional analysis which
suggests that phone-based STD performance levels may be
achieved by a grapheme-based approach despite lower
transcription accuracy, and that the two approaches may
usefully be combined. We propose a number of directions
for future development of these ideas, and suggest that
if grapheme-based STD can match phone-based
performance, the inherent flexibility in dealing with
out-of-vocabulary terms makes this a desirable
approach.
|
|
[73]
|
Matthew P. Aylett and Simon King.
Single speaker segmentation and inventory selection using dynamic
time warping self organization and joint multigram mapping.
In SSW06, pages 258-263, 2008.
[ bib |
.pdf ]
In speech synthesis the inventory of units is decided
by inspection and on the basis of phonological and
phonetic expertise. The ephone (or emergent phone)
project at CSTR is investigating how self organisation
techniques can be applied to build an inventory based
on collected acoustic data together with the
constraints of a synthesis lexicon. In this paper we
will describe a prototype inventory creation method
using dynamic time warping (DTW) for acoustic
clustering and a joint multigram approach for relating
a series of symbols that represent the speech to these
emerged units. We initially examined two symbol sets:
1) A baseline of standard phones 2) Orthographic
symbols. The success of the approach is evaluated by
comparing word boundaries generated by the emergent
phones against those created using state-of-the-art HMM
segmentation. Initial results suggest the DTW
segmentation can match word boundaries with a root mean
square error (RMSE) of 35ms. Results from mapping units
onto phones resulted in a higher RMSE of 103ms. This
error was increased when multiple multigram types were
added and when the default unit clustering was altered
from 40 (our baseline) to 10. Results for orthographic
matching had a higher RMSE of 125ms. To conclude we
discuss future work that we believe can reduce this
error rate to a level sufficient for the techniques to
be applied to a unit selection synthesis system.
|
|
[74]
|
Volker Strom and Simon King.
Investigating Festival's target cost function using perceptual
experiments.
In Proc. Interspeech, Brisbane, 2008.
[ bib |
.ps |
.pdf ]
We describe an investigation of the target cost used
in the Festival unit selection speech synthesis system.
Our ultimate goal is to automatically learn a
perceptually optimal target cost function. In this
study, we investigated the behaviour of the target cost
for one segment type. The target cost is based on
counting the mismatches in several context features. A
carrier sentence (“My name is Roger”) was synthesised
using all 147,820 possible combinations of the diphones
/n_ei/ and /ei_m/. 92 representative versions were
selected and presented to listeners as 460 pairwise
comparisons. The listeners' preference votes were used
to analyse the behaviour of the target cost, with
respect to the values of its component linguistic
context features.
|
|
[75]
|
J. Frankel and S. King.
Factoring Gaussian precision matrices for linear dynamic models.
Pattern Recognition Letters, 28(16):2264-2272, December 2007.
[ bib |
DOI |
.pdf ]
The linear dynamic model (LDM), also known as the
Kalman filter model, has been the subject of research
in the engineering, control, and more recently, machine
learning and speech technology communities. The
Gaussian noise processes are usually assumed to have
diagonal, or occasionally full, covariance matrices. A
number of recent papers have considered modelling the
precision rather than covariance matrix of a Gaussian
distribution, and this work applies such ideas to the
LDM. A Gaussian precision matrix P can be factored into
the form P = UTSU where U is a transform and S a
diagonal matrix. By varying the form of U, the
covariance can be specified as being diagonal or full,
or used to model a given set of spatial dependencies.
Furthermore, the transform and scaling components can
be shared between models, allowing richer distributions
with only marginally more parameters than required to
specify diagonal covariances. The method described in
this paper allows the construction of models with an
appropriate number of parameters for the amount of
available training data. We provide illustrative
experimental results on synthetic and real speech data
in which models with factored precision matrices and
automatically-selected numbers of parameters are as
good as or better than models with diagonal covariances
on small data sets and as good as models with full
covariance matrices on larger data sets.
|
|
[76]
|
Ö. Çetin, M. Magimai-Doss, A. Kantor, S. King, C. Bartels, J. Frankel, and
K. Livescu.
Monolingual and crosslingual comparison of tandem features derived
from articulatory and phone MLPs.
In Proc. ASRU, Kyoto, December 2007. IEEE.
[ bib |
.pdf ]
In recent years, the features derived from posteriors
of a multilayer perceptron (MLP), known as tandem
features, have proven to be very effective for
automatic speech recognition. Most tandem features to
date have relied on MLPs trained for phone
classification. We recently showed on a relatively
small data set that MLPs trained for articulatory
feature classification can be equally effective. In
this paper, we provide a similar comparison using MLPs
trained on a much larger data set - 2000 hours of
English conversational telephone speech. We also
explore how portable phone- and articulatory feature-
based tandem features are in an entirely different
language - Mandarin - without any retraining. We find
that while phone-based features perform slightly better
in the matched-language condition, they perform
significantly better in the cross-language condition.
Yet, in the cross-language condition, neither approach
is as effective as the tandem features extracted from
an MLP trained on a relatively small amount of
in-domain data. Beyond feature concatenation, we also
explore novel observation modelling schemes that allow
for greater flexibility in combining the tandem and
standard features at hidden Markov model (HMM) outputs.
|
|
[77]
|
J. Frankel, M. Wester, and S. King.
Articulatory feature recognition using dynamic Bayesian networks.
Computer Speech & Language, 21(4):620-640, October 2007.
[ bib |
.pdf ]
We describe a dynamic Bayesian network for
articulatory feature recognition. The model is intended
to be a component of a speech recognizer that avoids
the problems of conventional “beads-on-a-string”
phoneme-based models. We demonstrate that the model
gives superior recognition of articulatory features
from the speech signal compared with a stateof- the art
neural network system. We also introduce a training
algorithm that offers two major advances: it does not
require time-aligned feature labels and it allows the
model to learn a set of asynchronous feature changes in
a data-driven manner.
|
|
[78]
|
J. Frankel, M. Magimai-Doss, S. King, K. Livescu, and Ö. Çetin.
Articulatory feature classifiers trained on 2000 hours of telephone
speech.
In Proc. Interspeech, Antwerp, Belgium, August 2007.
[ bib |
.pdf ]
This paper is intended to advertise the public
availability of the articulatory feature (AF)
classification multi-layer perceptrons (MLPs) which
were used in the Johns Hopkins 2006 summer workshop. We
describe the design choices, data preparation, AF label
generation, and the training of MLPs for feature
classification on close to 2000 hours of telephone
speech. In addition, we present some analysis of the
MLPs in terms of classification accuracy and confusions
along with a brief summary of the results obtained
during the workshop using the MLPs. We invite
interested parties to make use of these MLPs.
|
|
[79]
|
Junichi Yamagishi, Takao Kobayashi, Steve Renals, Simon King, Heiga Zen, Tomoki
Toda, and Keiichi Tokuda.
Improved average-voice-based speech synthesis using gender-mixed
modeling and a parameter generation algorithm considering GV.
In Proc. 6th ISCA Workshop on Speech Synthesis (SSW-6), August
2007.
[ bib |
.pdf ]
For constructing a speech synthesis system which can
achieve diverse voices, we have been developing a
speaker independent approach of HMM-based speech
synthesis in which statistical average voice models are
adapted to a target speaker using a small amount of
speech data. In this paper, we incorporate a
high-quality speech vocoding method STRAIGHT and a
parameter generation algorithm with global variance
into the system for improving quality of synthetic
speech. Furthermore, we introduce a feature-space
speaker adaptive training algorithm and a gender mixed
modeling technique for conducting further normalization
of the average voice model. We build an English
text-to-speech system using these techniques and show
the performance of the system.
|
|
[80]
|
Robert A. J. Clark, Monika Podsiadlo, Mark Fraser, Catherine Mayo, and Simon
King.
Statistical analysis of the Blizzard Challenge 2007 listening
test results.
In Proc. Blizzard 2007 (in Proc. Sixth ISCA Workshop on Speech
Synthesis), Bonn, Germany, August 2007.
[ bib |
.pdf ]
Blizzard 2007 is the third Blizzard Challenge, in
which participants build voices from a common dataset.
A large listening test is conducted which allows
comparison of systems in terms of naturalness and
intelligibility. New sections were added to the
listening test for 2007 to test the perceived
similarity of the speaker's identity between natural
and synthetic speech. In this paper, we present the
results of the listening test and the subsequent
statistical analysis.
Keywords: Blizzard
|
|
[81]
|
Mark Fraser and Simon King.
The Blizzard Challenge 2007.
In Proc. Blizzard 2007 (in Proc. Sixth ISCA Workshop on Speech
Synthesis), Bonn, Germany, August 2007.
[ bib |
.pdf ]
In Blizzard 2007, the third Blizzard Challenge,
participants were asked to build voices from a dataset,
a defined subset and, following certain constraints, a
subset of their choice. A set of test sentences was
then released to be synthesised. An online evaluation
of the submitted synthesised sentences focused on
naturalness and intelligibility, and added new sec-
tions for degree of similarity to the original speaker,
and similarity in terms of naturalness of pairs of
sentences from different systems. We summarise this
year's Blizzard Challenge and look ahead to possible
designs for Blizzard 2008 in the light of participant
and listener feedback.
Keywords: Blizzard
|
|
[82]
|
Volker Strom, Ani Nenkova, Robert Clark, Yolanda Vazquez-Alvarez, Jason
Brenier, Simon King, and Dan Jurafsky.
Modelling prominence and emphasis improves unit-selection synthesis.
In Proc. Interspeech 2007, Antwerp, Belgium, August 2007.
[ bib |
.pdf ]
We describe the results of large scale perception
experiments showing improvements in synthesising two
distinct kinds of prominence: standard pitch-accent and
strong emphatic accents. Previously prominence
assignment has been mainly evaluated by computing
accuracy on a prominence-labelled test set. By contrast
we integrated an automatic pitch-accent classifier into
the unit selection target cost and showed that
listeners preferred these synthesised sentences. We
also describe an improved recording script for
collecting emphatic accents, and show that generating
emphatic accents leads to further improvements in the
fiction genre over incorporating pitch accent only.
Finally, we show differences in the effects of
prominence between child-directed speech and news and
fiction genres. Index Terms: speech synthesis, prosody,
prominence, pitch accent, unit selection
|
|
[83]
|
Peter Bell and Simon King.
Sparse gaussian graphical models for speech recognition.
In Proc. Interspeech 2007, Antwerp, Belgium, August 2007.
[ bib |
.pdf ]
We address the problem of learning the structure of
Gaussian graphical models for use in automatic speech
recognition, a means of controlling the form of the
inverse covariance matrices of such systems. With
particular focus on data sparsity issues, we implement
a method for imposing graphical model structure on a
Gaussian mixture system, using a convex optimisation
technique to maximise a penalised likelihood
expression. The results of initial experiments on a
phone recognition task show a performance improvement
over an equivalent full-covariance system.
|
|
[84]
|
Ö. Çetin, A. Kantor, S. King, C. Bartels, M. Magimai-Doss, J. Frankel, and
K. Livescu.
An articulatory feature-based tandem approach and factored
observation modeling.
In Proc. ICASSP, Honolulu, April 2007.
[ bib |
.pdf ]
The so-called tandem approach, where the posteriors of
a multilayer perceptron (MLP) classifier are used as
features in an automatic speech recognition (ASR)
system has proven to be a very effective method. Most
tandem approaches up to date have relied on MLPs
trained for phone classification, and appended the
posterior features to some standard feature hidden
Markov model (HMM). In this paper, we develop an
alternative tandem approach based on MLPs trained for
articulatory feature (AF) classification. We also
develop a factored observation model for characterizing
the posterior and standard features at the HMM outputs,
allowing for separate hidden mixture and state-tying
structures for each factor. In experiments on a subset
of Switchboard, we show that the AFbased tandem
approach is as effective as the phone-based approach,
and that the factored observation model significantly
outperforms the simple feature concatenation approach
while using fewer parameters.
|
|
[85]
|
K. Livescu, Ö. Çetin, M. Hasegawa-Johnson, S. King, C. Bartels, N. Borges,
A. Kantor, P. Lal, L. Yung, S. Bezman, Dawson-Haggerty, B. Woods, J. Frankel,
M. Magimai-Doss, and K. Saenko.
Articulatory feature-based methods for acoustic and audio-visual
speech recognition: Summary from the 2006 JHU Summer Workshop.
In Proc. ICASSP, Honolulu, April 2007.
[ bib |
.pdf ]
We report on investigations, conducted at the 2006
Johns HopkinsWorkshop, into the use of articulatory
features (AFs) for observation and pronunciation models
in speech recognition. In the area of observation
modeling, we use the outputs of AF classiers both
directly, in an extension of hybrid HMM/neural network
models, and as part of the observation vector, an
extension of the tandem approach. In the area of
pronunciation modeling, we investigate a model having
multiple streams of AF states with soft synchrony
constraints, for both audio-only and audio-visual
recognition. The models are implemented as dynamic
Bayesian networks, and tested on tasks from the
Small-Vocabulary Switchboard (SVitchboard) corpus and
the CUAVE audio-visual digits corpus. Finally, we
analyze AF classication and forced alignment using a
newly collected set of feature-level manual
transcriptions.
|
|
[86]
|
K. Livescu, A. Bezman, N. Borges, L. Yung, Ö. Çetin, J. Frankel, S. King,
M. Magimai-Doss, X. Chi, and L. Lavoie.
Manual transcription of conversational speech at the articulatory
feature level.
In Proc. ICASSP, Honolulu, April 2007.
[ bib |
.pdf ]
We present an approach for the manual labeling of
speech at the articulatory feature level, and a new set
of labeled conversational speech collected using this
approach. A detailed transcription, including
overlapping or reduced gestures, is useful for studying
the great pronunciation variability in conversational
speech. It also facilitates the testing of feature
classiers, such as those used in articulatory
approaches to automatic speech recognition. We describe
an effort to transcribe a small set of utterances drawn
from the Switchboard database using eight articulatory
tiers. Two transcribers have labeled these utterances
in a multi-pass strategy, allowing for correction of
errors. We describe the data collection methods and
analyze the data to determine how quickly and reliably
this type of transcription can be done. Finally, we
demonstrate one use of the new data set by testing a
set of multilayer perceptron feature classiers against
both the manual labels and forced alignments.
|
|
[87]
|
S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester.
Speech production knowledge in automatic speech recognition.
Journal of the Acoustical Society of America, 121(2):723-742,
February 2007.
[ bib |
.pdf ]
Although much is known about how speech is produced,
and research into speech production has resulted in
measured articulatory data, feature systems of
different kinds and numerous models, speech production
knowledge is almost totally ignored in current
mainstream approaches to automatic speech recognition.
Representations of speech production allow simple
explanations for many phenomena observed in speech
which cannot be easily analyzed from either acoustic
signal or phonetic transcription alone. In this
article, we provide a survey of a growing body of work
in which such representations are used to improve
automatic speech recognition.
|
|
[88]
|
J. Frankel and S. King.
Speech recognition using linear dynamic models.
IEEE Transactions on Speech and Audio Processing,
15(1):246-256, January 2007.
[ bib |
.ps |
.pdf ]
The majority of automatic speech recognition (ASR)
systems rely on hidden Markov models, in which Gaussian
mixtures model the output distributions associated with
sub-phone states. This approach, whilst successful,
models consecutive feature vectors (augmented to
include derivative information) as statistically
independent. Furthermore, spatial correlations present
in speech parameters are frequently ignored through the
use of diagonal covariance matrices. This paper
continues the work of Digalakis and others who proposed
instead a first-order linear state-space model which
has the capacity to model underlying dynamics, and
furthermore give a model of spatial correlations. This
paper examines the assumptions made in applying such a
model and shows that the addition of a hidden dynamic
state leads to increases in accuracy over otherwise
equivalent static models. We also propose a
time-asynchronous decoding strategy suited to
recognition with segment models. We describe
implementation of decoding for linear dynamic models
and present TIMIT phone recognition results.
|
|
[89]
|
Robert A. J. Clark, Korin Richmond, and Simon King.
Multisyn: Open-domain unit selection for the Festival speech
synthesis system.
Speech Communication, 49(4):317-330, 2007.
[ bib |
DOI |
.pdf ]
We present the implementation and evaluation of an
open-domain unit selection speech synthesis engine
designed to be flexible enough to encourage further
unit selection research and allow rapid voice
development by users with minimal speech synthesis
knowledge and experience. We address the issues of
automatically processing speech data into a usable
voice using automatic segmentation techniques and how
the knowledge obtained at labelling time can be
exploited at synthesis time. We describe target cost
and join cost implementation for such a system and
describe the outcome of building voices with a number
of different sized datasets. We show that, in a
competitive evaluation, voices built using this
technology compare favourably to other systems.
|
|
[90]
|
Jithendra Vepa and Simon King.
Subjective evaluation of join cost and smoothing methods for unit
selection speech synthesis.
IEEE Transactions on Speech and Audio Processing,
14(5):1763-1771, September 2006.
[ bib |
.pdf ]
In unit selection-based concatenative speech
synthesis, join cost (also known as concatenation
cost), which measures how well two units can be joined
together, is one of the main criteria for selecting
appropriate units from the inventory. Usually, some
form of local parameter smoothing is also needed to
disguise the remaining discontinuities. This paper
presents a subjective evaluation of three join cost
functions and three smoothing methods. We describe the
design and performance of a listening test. The three
join cost functions were taken from our previous study,
where we proposed join cost functions derived from
spectral distances, which have good correlations with
perceptual scores obtained for a range of concatenation
discontinuities. This evaluation allows us to further
validate their ability to predict concatenation
discontinuities. The units for synthesis stimuli are
obtained from a state-of-the-art unit selection
text-to-speech system: rVoice from Rhetorical Systems
Ltd. In this paper, we report listeners' preferences
for each join cost in combination with each smoothing
method.
|
|
[91]
|
J. Frankel and S. King.
Observation process adaptation for linear dynamic models.
Speech Communication, 48(9):1192-1199, September 2006.
[ bib |
.ps |
.pdf ]
This work introduces two methods for adapting the
observation process parameters of linear dynamic models
(LDM) or other linear-Gaussian models. The first method
uses the expectation-maximization (EM) algorithm to
estimate transforms for location and covariance
parameters, and the second uses a generalized EM (GEM)
approach which reduces computation in making updates
from O(p6) to O(p3), where p is the feature
dimension. We present the results of speaker adaptation
on TIMIT phone classification and recognition
experiments with relative error reductions of up to
6%. Importantly, we find minimal differences in the
results from EM and GEM. We therefore propose that the
GEM approach be applied to adaptation of hidden Markov
models which use non-diagonal covariances. We provide
the necessary update equations.
|
|
[92]
|
R. Clark, K. Richmond, V. Strom, and S. King.
Multisyn voices for the Blizzard Challenge 2006.
In Proc. Blizzard Challenge Workshop (Interspeech Satellite),
Pittsburgh, USA, September 2006.
(http://festvox.org/blizzard/blizzard2006.html).
[ bib |
.pdf ]
This paper describes the process of building unit
selection voices for the Festival Multisyn engine using
the ATR dataset provided for the Blizzard Challenge
2006. We begin by discussing recent improvements that
we have made to the Multisyn voice building process,
prompted by our participation in the Blizzard Challenge
2006. We then go on to discuss our interpretation of
the results observed. Finally, we conclude with some
comments and suggestions for the formulation of future
Blizzard Challenges.
|
|
[93]
|
Robert A. J. Clark and Simon King.
Joint prosodic and segmental unit selection speech synthesis.
In Proc. Interspeech 2006, Pittsburgh, USA, September 2006.
[ bib |
.ps |
.pdf ]
We describe a unit selection technique for
text-to-speech synthesis which jointly searches the
space of possible diphone sequences and the space of
possible prosodic unit sequences in order to produce
synthetic speech with more natural prosody. We
demonstrates that this search, although currently
computationally expensive, can achieve improved
intonation compared to a baseline in which only the
space of possible diphone sequences is searched. We
discuss ways in which the search could be made
sufficiently efficient for use in a real-time system.
|
|
[94]
|
Simon King.
Handling variation in speech and language processing.
In Keith Brown, editor, Encyclopedia of Language and
Linguistics. Elsevier, 2nd edition, 2006.
[ bib ]
|
|
[95]
|
Simon King.
Language variation in speech technologies.
In Keith Brown, editor, Encyclopedia of Language and
Linguistics. Elsevier, 2nd edition, 2006.
[ bib ]
|
|
[96]
|
Volker Strom, Robert Clark, and Simon King.
Expressive prosody for unit-selection speech synthesis.
In Proc. Interspeech, Pittsburgh, 2006.
[ bib |
.ps |
.pdf ]
Current unit selection speech synthesis voices cannot
produce emphasis or interrogative contours because of a
lack of the necessary prosodic variation in the
recorded speech database. A method of recording script
design is proposed which addresses this shortcoming.
Appropriate components were added to the target cost
function of the Festival Multisyn engine, and a
perceptual evaluation showed a clear preference over
the baseline system.
|
|
[97]
|
Robert A.J. Clark, Korin Richmond, and Simon King.
Multisyn voices from ARCTIC data for the Blizzard challenge.
In Proc. Interspeech 2005, September 2005.
[ bib |
.pdf ]
This paper describes the process of building unit
selection voices for the Festival Multisyn engine using
four ARCTIC datasets, as part of the Blizzard
evaluation challenge. The build process is almost
entirely automatic, with very little need for human
intervention. We discuss the difference in the
evaluation results for each voice and evaluate the
suitability of the ARCTIC datasets for building this
type of voice.
|
|
[98]
|
C. Mayo, R. A. J. Clark, and S. King.
Multidimensional scaling of listener responses to synthetic speech.
In Proc. Interspeech 2005, Lisbon, Portugal, September 2005.
[ bib |
.pdf ]
|
|
[99]
|
J. Frankel and S. King.
A hybrid ANN/DBN approach to articulatory feature recognition.
In Proc. Eurospeech, Lisbon, September 2005.
[ bib |
.ps |
.pdf ]
Artificial neural networks (ANN) have proven to be
well suited to the task of articulatory feature (AF)
recognition. Previous studies have taken a cascaded
approach where separate ANNs are trained for each
feature group, making the assumption that features are
statistically independent. We address this by using
ANNs to provide virtual evidence to a dynamic Bayesian
network (DBN). This gives a hybrid ANN/DBN model and
allows modelling of inter-feature dependencies. We
demonstrate significant increases in AF recognition
accuracy from modelling dependencies between features,
and present the results of embedded training
experiments in which a set of asynchronous feature
changes are learned. Furthermore, we report on the
application of a Viterbi training scheme in which we
alternate between realigning the AF training labels and
retraining the ANNs.
|
|
[100]
|
Alexander Gutkin and Simon King.
Inductive String Template-Based Learning of Spoken
Language.
In Hugo Gamboa and Ana Fred, editors, Proc. 5th International
Workshop on Pattern Recognition in Information Systems (PRIS-2005), In
conjunction with the 7th International Conference on Enterprise Information
Systems (ICEIS-2005), pages 43-51, Miami, USA, May 2005. INSTICC Press.
[ bib |
.ps.gz |
.pdf ]
This paper deals with formulation of alternative
structural approach to the speech recognition problem.
In this approach, we require both the representation
and the learning algorithms defined on it to be
linguistically meaningful, which allows the speech
recognition system to discover the nature of the
linguistic classes of speech patterns corresponding to
the speech waveforms. We briefly discuss the current
formalisms and propose an alternative - a
phonologically inspired string-based inductive speech
representation, defined within an analytical framework
specifically designed to address the issues of class
and object representation. We also present the results
of the phoneme classification experiments conducted on
the TIMIT corpus of continuous speech.
|
|
[101]
|
Alexander Gutkin and Simon King.
Detection of Symbolic Gestural Events in Articulatory
Data for Use in Structural Representations of Continuous Speech.
In Proc. IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP-05), volume I, pages 885-888, Philadelphia, PA,
USA, March 2005. IEEE Signal Processing Society Press.
[ bib |
.ps.gz |
.pdf ]
One of the crucial issues which often needs to be
addressed in structural approaches to speech
representation is the choice of fundamental symbolic
units of representation. In this paper, a
physiologically inspired methodology for defining these
symbolic atomic units in terms of primitive
articulatory events is proposed. It is shown how the
atomic articulatory events (gestures) can be detected
directly in the articulatory data. An algorithm for
evaluating the reliability of the articulatory events
is described and promising results of the experiments
conducted on MOCHA articulatory database are presented.
|
|
[102]
|
Simon King, Chris Bartels, and Jeff Bilmes.
Svitchboard 1: Small vocabulary tasks from switchboard 1.
In Proc. Interspeech 2005, Lisbon, Portugal, 2005.
[ bib |
.pdf ]
We present a conversational telephone speech data set
designed to support research on novel acoustic models.
Small vocabulary tasks from 10 words up to 500 words
are defined using subsets of the Switchboard-1 corpus;
each task has a completely closed vocabulary (an OOV
rate of 0%). We justify the need for these tasks, de-
scribe the algorithm for selecting them from a large
cor- pus, give a statistical analysis of the data and
present baseline whole-word hidden Markov model
recognition results. The goal of the paper is to define
a common data set and to encourage other researchers to
use it.
|
|
[103]
|
Olga Goubanova and Simon King.
Predicting consonant duration with Bayesian belief networks.
In Proc. Interspeech 2005, Lisbon, Portugal, 2005.
[ bib |
.pdf ]
Consonant duration is influenced by a number of
linguistic factors such as the consonant s identity,
within-word position, stress level of the previous and
following vowels, phrasal position of the word
containing the target consonant, its syllabic position,
identity of the previous and following segments. In our
work, consonant duration is predicted from a Bayesian
belief network (BN) consisting of discrete nodes for
the linguistic factors and a single continuous node for
the consonant s duration. Interactions between factors
are represented as conditional dependency arcs in this
graphical model. Given the parameters of the belief
network, the duration of each consonant in the test set
is then predicted as the value with the maximum
probability. We compare the results of the belief
network model with those of sums-of-products (SoP) and
classification and regression tree (CART) models using
the same data. In terms of RMS error, our BN model
performs better than both CART and SoP models. In terms
of the correlation coefficient, our BN model performs
better than SoP model, and no worse than CART model. In
addition, the Bayesian model reliably predicts
consonant duration in cases of missing or hidden
linguistic factors.
|
|
[104]
|
M. Wester, J. Frankel, and S. King.
Asynchronous articulatory feature recognition using dynamic
Bayesian networks.
In Proc. IEICI Beyond HMM Workshop, Kyoto, December 2004.
[ bib |
.ps |
.pdf ]
This paper builds on previous work where dynamic
Bayesian networks (DBN) were proposed as a model for
articulatory feature recognition. Using DBNs makes it
possible to model the dependencies between features, an
addition to previous approaches which was found to
improve feature recognition performance. The DBN
results were promising, giving close to the accuracy of
artificial neural nets (ANNs). However, the system was
trained on canonical labels, leading to an overly
strong set of constraints on feature co-occurrence. In
this study, we describe an embedded training scheme
which learns a set of data-driven asynchronous feature
changes where supported in the data. Using a subset of
the OGI Numbers corpus, we describe articulatory
feature recognition experiments using both
canonically-trained and asynchronous DBNs. Performance
using DBNs is found to exceed that of ANNs trained on
an identical task, giving a higher recognition
accuracy. Furthermore, inter-feature dependencies
result in a more structured model, giving rise to fewer
feature combinations in the recognition output. In
addition to an empirical evaluation of this modelling
approach, we give a qualitative analysis, comparing
asynchrony found through our data-driven methods to the
asynchrony which may be expected on the basis of
linguistic knowledge.
|
|
[105]
|
Yoshinori Shiga and Simon King.
Source-filter separation for articulation-to-speech synthesis.
In Proc. ICSLP, Jeju, Korea, October 2004.
[ bib |
.ps |
.pdf ]
In this paper we examine a method for separating out
the vocal-tract filter response from the voice source
characteristic using a large articulatory database. The
method realises such separation for voiced speech using
an iterative approximation procedure under the
assumption that the speech production process is a
linear system composed of a voice source and a
vocal-tract filter, and that each of the components is
controlled independently by different sets of factors.
Experimental results show that the spectral variation
is evidently influenced by the fundamental frequency or
the power of speech, and that the tendency of the
variation may be related closely to speaker identity.
The method enables independent control over the voice
source characteristic in our articulation-to-speech
synthesis.
|
|
[106]
|
Jithendra Vepa and Simon King.
Subjective evaluation of join cost functions used in unit selection
speech synthesis.
In Proc. 8th International Conference on Spoken Language
Processing (ICSLP), Jeju, Korea, October 2004.
[ bib |
.pdf ]
In our previous papers, we have proposed join cost
functions derived from spectral distances, which have
good correlations with perceptual scores obtained for a
range of concatenation discontinuities. To further
validate their ability to predict concatenation
discontinuities, we have chosen the best three spectral
distances and evaluated them subjectively in a
listening test. The unit sequences for synthesis
stimuli are obtained from a state-of-the-art unit
selection text-tospeech system: rVoice from Rhetorical
Systems Ltd. In this paper, we report listeners
preferences for each of the three join cost functions.
|
|
[107]
|
Yoshinori Shiga and Simon King.
Estimating detailed spectral envelopes using articulatory clustering.
In Proc. ICSLP, Jeju, Korea, October 2004.
[ bib |
.ps |
.pdf ]
This paper presents an articulatory-acoustic mapping
where detailed spectral envelopes are estimated. During
the estimation, the harmonics of a range of F0 values
are derived from the spectra of multiple voiced speech
signals vocalized with similar articulator settings.
The envelope formed by these harmonics is represented
by a cepstrum, which is computed by fitting the peaks
of all the harmonics based on the weighted least square
method in the frequency domain. The experimental result
shows that the spectral envelopes are estimated with
the highest accuracy when the cepstral order is 48-64
for a female speaker, which suggests that representing
the real response of the vocal tract requires
high-quefrency elements that conventional speech
synthesis methods are forced to discard in order to
eliminate the pitch component of speech.
|
|
[108]
|
Alexander Gutkin and Simon King.
Phone classification in pseudo-Euclidean vector spaces.
In Proc. 8th International Conference on Spoken Language
Processing (ICSLP), volume II, pages 1453-1457, Jeju Island, Korea, October
2004.
[ bib |
.ps.gz |
.pdf ]
Recently we have proposed a structural framework for
modelling speech, which is based on patterns of
phonological distinctive features, a linguistically
well-motivated alternative to standard vector-space
acoustic models like HMMs. This framework gives
considerable representational freedom by working with
features that have explicit linguistic interpretation,
but at the expense of the ability to apply the wide
range of analytical decision algorithms available in
vector spaces, restricting oneself to more
computationally expensive and less-developed symbolic
metric tools. In this paper we show that a
dissimilarity-based distance-preserving transition from
the original structural representation to a
corresponding pseudo-Euclidean vector space is
possible. Promising results of phone classification
experiments conducted on the TIMIT database are
reported.
|
|
[109]
|
J. Frankel, M. Wester, and S. King.
Articulatory feature recognition using dynamic Bayesian networks.
In Proc. ICSLP, September 2004.
[ bib |
.ps |
.pdf ]
This paper describes the use of dynamic Bayesian
networks for the task of articulatory feature
recognition. We show that by modeling the dependencies
between a set of 6 multi-leveled articulatory features,
recognition accuracy is increased over an equivalent
system in which features are considered independent.
Results are compared to those found using artificial
neural networks on an identical task.
|
|
[110]
|
Alexander Gutkin and Simon King.
Structural Representation of Speech for Phonetic
Classification.
In Proc. 17th International Conference on Pattern Recognition
(ICPR), volume 3, pages 438-441, Cambridge, UK, August 2004. IEEE Computer
Society Press.
[ bib |
.ps.gz |
.pdf ]
This paper explores the issues involved in using
symbolic metric algorithms for automatic speech
recognition (ASR), via a structural representation of
speech. This representation is based on a set of
phonological distinctive features which is a
linguistically well-motivated alternative to the
“beads-on-a-string” view of speech that is standard
in current ASR systems. We report the promising results
of phoneme classification experiments conducted on a
standard continuous speech task.
|
|
[111]
|
J. Vepa and S. King.
Subjective evaluation of join cost and smoothing methods.
In Proc. 5th ISCA speech synthesis workshop, Pittsburgh, USA,
June 2004.
[ bib |
.pdf ]
In our previous papers, we have proposed join cost
functions derived from spectral distances, which have
good correlations with perceptual scores obtained for a
range of concatenation discontinuities. To further
validate their ability to predict concatenation
discontinuities, we have chosen the best three spectral
distances and evaluated them subjectively in a
listening test. The units for synthesis stimuli are
obtained from a state-of-the-art unit selection
text-to-speech system: `rVoice' from Rhetorical Systems
Ltd. We also compared three different smoothing methods
in this listening test. In this paper, we report
listeners' preferences for each join costs in
combination with each smoothing method.
|
|
[112]
|
Yoshinori Shiga and Simon King.
Accurate spectral envelope estimation for articulation-to-speech
synthesis.
In Proc. 5th ISCA Speech Synthesis Workshop, pages 19-24, CMU,
Pittsburgh, USA, June 2004.
[ bib |
.ps |
.pdf ]
This paper introduces a novel articulatory-acoustic
mapping in which detailed spectral envelopes are
estimated based on the cepstrum, inclusive of the
high-quefrency elements which are discarded in
conventional speech synthesis to eliminate the pitch
component of speech. For this estimation, the method
deals with the harmonics of multiple voiced-speech
spectra so that several sets of harmonics can be
obtained at various pitch frequencies to form a
spectral envelope. The experimental result shows that
the method estimates spectral envelopes with the
highest accuracy when the cepstral order is 48-64,
which suggests that the higher order coeffcients are
required to represent detailed envelopes reflecting the
real vocal-tract responses.
|
|
[113]
|
Jithendra Vepa and Simon King.
Join cost for unit selection speech synthesis.
In Abeer Alwan and Shri Narayanan, editors, Speech Synthesis.
Prentice Hall, 2004.
[ bib |
.ps ]
|
|
[114]
|
Robert A.J. Clark, Korin Richmond, and Simon King.
Festival 2 - build your own general purpose unit selection speech
synthesiser.
In Proc. 5th ISCA workshop on speech synthesis, 2004.
[ bib |
.ps |
.pdf ]
This paper describes version 2 of the Festival speech
synthesis system. Festival 2 provides a development
environment for concatenative speech synthesis, and now
includes a general purpose unit selection speech
synthesis engine. We discuss various aspects of unit
selection speech synthesis, focusing on the research
issues that relate to voice design and the automation
of the voice development process.
|
|
[115]
|
Ben Gillett and Simon King.
Transforming F0 contours.
In Proc. Eurospeech, Geneva, September 2003.
[ bib |
.pdf ]
Voice transformation is the process of transforming
the characteristics of speech uttered by a source
speaker, such that a listener would believe the speech
was uttered by a target speaker. Training F0 contour
generation models for speech synthesis requires a large
corpus of speech. If it were possible to adapt the F0
contour of one speaker to sound like that of another
speaker, using a small, easily obtainable parameter
set, this would be extremely valuable. We present a new
method for the transformation of F0 contours from one
speaker to another based on a small linguistically
motivated parameter set. The system performs a
piecewise linear mapping using these parameters. A
perceptual experiment clearly demonstrates that the
presented system is at least as good as an existing
technique for all speaker pairs, and that in many cases
it is much better and almost as good as using the
target F0 contour
|
|
[116]
|
Yoshinori Shiga and Simon King.
Estimating the spectral envelope of voiced speech using multi-frame
analysis.
In Proc. Eurospeech-2003, volume 3, pages 1737-1740, Geneva,
Switzerland, September 2003.
[ bib |
.ps |
.pdf ]
This paper proposes a novel approach for estimating
the spectral envelope of voiced speech independently of
its harmonic structure. Because of the
quasi-periodicity of voiced speech, its spectrum
indicates harmonic structure and only has energy at
frequencies corresponding to integral multiples of F0.
It is hence impossible to identify transfer
characteristics between the adjacent harmonics. In
order to resolve this problem, Multi-frame Analysis
(MFA) is introduced. The MFA estimates a spectral
envelope using many portions of speech which are
vocalised using the same vocal-tract shape. Since each
of the portions usually has a different F0 and ensuing
different harmonic structure, a number of harmonics can
be obtained at various frequencies to form a spectral
envelope. The method thereby gives a closer
approximation to the vocal-tract transfer function.
|
|
[117]
|
James Horlock and Simon King.
Named entity extraction from word lattices.
In Proc. Eurospeech, Geneva, September 2003.
[ bib |
.pdf ]
We present a method for named entity extraction from
word lattices produced by a speech recogniser. Previous
work by others on named entity extraction from speech
has used either a manual transcript or 1-best
recogniser output. We describe how a single Viterbi
search can recover both the named entity sequence and
the corresponding word sequence from a word lattice,
and further that it is possible to trade off an
increase in word error rate for improved named entity
extraction.
|
|
[118]
|
James Horlock and Simon King.
Discriminative methods for improving named entity extraction on
speech data.
In Proc. Eurospeech, Geneva, September 2003.
[ bib |
.pdf ]
In this paper we present a method of discriminatively
training language models for spoken language
understanding; we show improvements in named entity
F-scores on speech data using these improved language
models. A comparison between theoretical probabilities
associated with manual markup and the actual
probabilities of output markup is used to identify
probabilities requiring adjustment. We present results
which support our hypothesis that improvements in
F-scores are possible by using either previously used
training data or held out development data to improve
discrimination amongst a set of N-gram language models.
|
|
[119]
|
Ben Gillett and Simon King.
Transforming voice quality.
In Proc. Eurospeech, Geneva, September 2003.
[ bib |
.pdf ]
Voice transformation is the process of transforming
the characteristics of speech uttered by a source
speaker, such that a listener would believe the speech
was uttered by a target speaker. In this paper we
address the problem of transforming voice quality. We
do not attempt to transform prosody. Our system has two
main parts corresponding to the two components of the
source-filter model of speech production. The first
component transforms the spectral envelope as
represented by a linear prediction model. The
transformation is achieved using a Gaussian mixture
model, which is trained on aligned speech from source
and target speakers. The second part of the system
predicts the spectral detail from the transformed
linear prediction coefficients. A novel approach is
proposed, which is based on a classifier and residual
codebooks. On the basis of a number of performance
metrics it outperforms existing systems.
|
|
[120]
|
Yoshinori Shiga and Simon King.
Estimation of voice source and vocal tract characteristics based on
multi-frame analysis.
In Proc. Eurospeech, volume 3, pages 1749-1752, Geneva,
Switzerland, September 2003.
[ bib |
.ps |
.pdf ]
This paper presents a new approach for estimating
voice source and vocal tract filter characteristics of
voiced speech. When it is required to know the transfer
function of a system in signal processing, the input
and output of the system are experimentally observed
and used to calculate the function. However, in the
case of source-filter separation we deal with in this
paper, only the output (speech) is observed and the
characteristics of the system (vocal tract) and the
input (voice source) must simultaneously be estimated.
Hence the estimate becomes extremely difficult, and it
is usually solved approximately using oversimplified
models. We demonstrate that these characteristics are
separable under the assumption that they are
independently controlled by different factors. The
separation is realised using an iterative approximation
along with the Multi-frame Analysis method, which we
have proposed to find spectral envelopes of voiced
speech with minimum interference of the harmonic
structure.
|
|
[121]
|
K. Richmond, S. King, and P. Taylor.
Modelling the uncertainty in recovering articulation from acoustics.
Computer Speech and Language, 17:153-172, 2003.
[ bib |
.pdf ]
This paper presents an experimental comparison of the
performance of the multilayer perceptron (MLP) with
that of the mixture density network (MDN) for an
acoustic-to-articulatory mapping task. A corpus of
acoustic-articulatory data recorded by electromagnetic
articulography (EMA) for a single speaker was used as
training and test data for this purpose. In theory, the
MDN is able to provide a richer, more flexible
description of the target variables in response to a
given input vector than the least-squares trained MLP.
Our results show that the mean likelihoods of the
target articulatory parameters for an unseen test set
were indeed consistently higher with the MDN than with
the MLP. The increase ranged from approximately 3% to
22%, depending on the articulatory channel in
question. On the basis of these results, we argue that
using a more flexible description of the target domain,
such as that offered by the MDN, can prove beneficial
when modelling the acoustic-to-articulatory mapping.
|
|
[122]
|
Christophe Van Bael and Simon King.
An accent-independent lexicon for automatic speech recognition.
In Proc. ICPhS, pages 1165-1168, 2003.
[ bib |
.pdf ]
Recent work at the Centre for Speech Technology Re-
search (CSTR) at the University of Edinburgh has de-
veloped an accent-independent lexicon for speech syn-
thesis (the Unisyn project). The main purpose of this
lexicon is to avoid the problems and cost of writing a
new lexicon for every new accent needed for synthesis.
Only recently, a first attempt has been made to use the
Keyword Lexicon for automatic speech recognition.
|
|
[123]
|
J. Vepa and S. King.
Kalman-filter based join cost for unit-selection speech synthesis.
In Proc. Eurospeech, Geneva, Switzerland, 2003.
[ bib |
.pdf ]
We introduce a new method for computing join cost in
unit-selection speech synthesis which uses a linear
dynamical model (also known as a Kalman filter) to
model line spectral frequency trajectories. The model
uses an underlying subspace in which it makes smooth,
continuous trajectories. This subspace can be seen as
an analogy for underlying articulator movement. Once
trained, the model can be used to measure how well
concatenated speech segments join together. The
objective join cost is based on the error between model
predictions and actual observations. We report
correlations between this measure and mean listener
scores obtained from a perceptual listening experiment.
Our experiments use a state-of-the art unit-selection
text-to-speech system: `rVoice' from Rhetorical Systems
Ltd.
|
|
[124]
|
Simon King.
Dependence and independence in automatic speech recognition and
synthesis.
Journal of Phonetics, 31(3-4):407-411, 2003.
[ bib |
.pdf ]
A short review paper
|
|
[125]
|
J. Vepa, S. King, and P. Taylor.
Objective distance measures for spectral discontinuities in
concatenative speech synthesis.
In Proc. ICSLP, Denver, USA, September 2002.
[ bib |
.pdf ]
In unit selection based concatenative speech systems,
`join cost', which measures how well two units can be
joined together, is one of the main criteria for
selecting appropriate units from the inventory. The
ideal join cost will measure `perceived' discontinuity,
based on easily measurable spectral properties of the
units being joined, in order to ensure smooth and
natural-sounding synthetic speech. In this paper we
report a perceptual experiment conducted to measure the
correlation between `subjective' human perception and
various `objective' spectrally-based measures proposed
in the literature. Our experiments used a
state-of-the-art unit-selection text-to-speech system:
`rVoice' from Rhetorical Systems Ltd.
|
|
[126]
|
J. Vepa, S. King, and P. Taylor.
New objective distance measures for spectral discontinuities in
concatenative speech synthesis.
In Proc. IEEE 2002 workshop on speech synthesis, Santa
Monica, USA, September 2002.
[ bib |
.pdf ]
The quality of unit selection based concatenative
speech synthesis mainly depends on how well two
successive units can be joined together to minimise the
audible discontinuities. The objective measure of
discontinuity used when selecting units is known as the
`join cost'. The ideal join cost will measure
`perceived' discontinuity, based on easily measurable
spectral properties of the units being joined, in order
to ensure smooth and natural-sounding synthetic speech.
In this paper we describe a perceptual experiment
conducted to measure the correlation between
`subjective' human perception and various `objective'
spectrally-based measures proposed in the literature.
Also we report new objective distance measures derived
from various distance metrics based on these spectral
features, which have good correlation with human
perception to concatenation discontinuities. Our
experiments used a state-of-the art unit-selection
text-to-speech system: `rVoice' from Rhetorical Systems
Ltd.
|
|
[127]
|
Jesper Salomon, Simon King, and Miles Osborne.
Framewise phone classification using support vector machines.
In Proceedings International Conference on Spoken Language
Processing, Denver, 2002.
[ bib |
.ps |
.pdf ]
We describe the use of Support Vector Machines for
phonetic classification on the TIMIT corpus. Unlike
previous work, in which entire phonemes are classified,
our system operates in a framewise manner and
is intended for use as the front-end of a hybrid system
similar to ABBOT. We therefore avoid the problems of
classifying variable-length vectors. Our frame-level
phone classification accuracy on the complete TIMIT
test set is competitive with other results from the
literature. In addition, we address the serious problem
of scaling Support Vector Machines by using
the Kernel Fisher Discriminant.
|
|
[128]
|
J. Frankel and S. King.
ASR - articulatory speech recognition.
In Proc. Eurospeech, pages 599-602, Aalborg, Denmark,
September 2001.
[ bib |
.ps |
.pdf ]
In this paper we report recent work on a speech
recognition system using a combination of acoustic and
articulatory features as input. Linear dynamic models
are used to capture the trajectories which characterize
each segment type. We describe classification and
recognition tasks for systems based on acoustic data in
conjunction with both real and automatically recovered
articulatory parameters.
|
|
[129]
|
J. Frankel and S. King.
Speech recognition in the articulatory domain: investigating an
alternative to acoustic HMMs.
In Proc. Workshop on Innovations in Speech Processing, April
2001.
[ bib |
.ps |
.pdf ]
We describe a speech recognition system which uses a
combination of acoustic and articulatory features as
input. Linear dynamic models capture the trajectories
which characterize each segment type. We describe
classification and recognition tasks for systems based
on acoustic data in conjunction with both real and
automatically recovered articulatory parameters.
|
|
[130]
|
J. Frankel, K. Richmond, S. King, and P. Taylor.
An automatic speech recognition system using neural networks and
linear dynamic models to recover and model articulatory traces.
In Proc. ICSLP, 2000.
[ bib |
.ps |
.pdf ]
In this paper we describe a speech recognition system
using linear dynamic models and articulatory features.
Experiments are reported in which measured articulation
from the MOCHA corpus has been used, along with those
where the articulatory parameters are estimated from
the speech signal using a recurrent neural network.
|
|
[131]
|
S. King, P. Taylor, J. Frankel, and K. Richmond.
Speech recognition via phonetically-featured syllables.
In PHONUS, volume 5, pages 15-34, Institute of Phonetics,
University of the Saarland, 2000.
[ bib |
.ps |
.pdf ]
We describe recent work on two new automatic speech
recognition systems. The first part of this paper
describes the components of a system based on
phonological features (which we call EspressoA) in
which the values of these features are estimated from
the speech signal before being used as the basis for
recognition. In the second part of the paper, another
system (which we call EspressoB) is described in which
articulatory parameters are used instead of
phonological features and a linear dynamical system
model is used to perform recognition from automatically
estimated values of these articulatory parameters.
|
|
[132]
|
Simon King and Paul Taylor.
Detection of phonological features in continuous speech using neural
networks.
Computer Speech and Language, 14(4):333-353, 2000.
[ bib |
.ps |
.pdf ]
We report work on the first component of a two stage
speech recognition architecture based on phonological
features rather than phones. The paper reports
experiments on three phonological feature systems: 1)
the Sound Pattern of English (SPE) system which uses
binary features, 2)a multi valued (MV) feature system
which uses traditional phonetic categories such as
manner, place etc, and 3) Government Phonology (GP)
which uses a set of structured primes. All experiments
used recurrent neural networks to perform feature
detection. In these networks the input layer is a
standard framewise cepstral representation, and the
output layer represents the values of the features. The
system effectively produces a representation of the
most likely phonological features for each input frame.
All experiments were carried out on the TIMIT speaker
independent database. The networks performed well in
all cases, with the average accuracy for a single
feature ranging from 86 to 93 percent. We describe
these experiments in detail, and discuss the
justification and potential advantages of using
phonological features rather than phones for the basis
of speech recognition.
|
|
[133]
|
Simon King and Alan Wrench.
Dynamical system modelling of articulator movement.
In Proc. ICPhS 99, pages 2259-2262, San Francisco, August
1999.
[ bib |
.ps |
.pdf ]
We describe the modelling of articulatory movements
using (hidden) dynamical system models trained on
Electro-Magnetic Articulograph (EMA) data. These models
can be used for automatic speech recognition and to
give insights into articulatory behaviour. They belong
to a class of continuous-state Markov models, which we
believe can offer improved performance over
conventional Hidden Markov Models (HMMs) by better
accounting for the continuous nature of the underlying
speech production process - that is, the movements of
the articulators. To assess the performance of our
models, a simple speech recognition task was used, on
which the models show promising results.
|
|
[134]
|
Simon King, Todd Stephenson, Stephen Isard, Paul Taylor, and Alex Strachan.
Speech recognition via phonetically featured syllables.
In Proc. ICSLP `98, pages 1031-1034, Sydney, Australia,
December 1998.
[ bib |
.ps |
.pdf ]
We describe a speech recogniser which uses a speech
production-motivated phonetic-feature description of
speech. We argue that this is a natural way to describe
the speech signal and offers an efficient intermediate
parameterisation for use in speech recognition. We also
propose to model this description at the syllable
rather than phone level. The ultimate goal of this work
is to generate syllable models whose parameters
explicitly describe the trajectories of the phonetic
features of the syllable. We hope to move away from
Hidden Markov Models (HMMs) of context-dependent phone
units. As a step towards this, we present a preliminary
system which consists of two parts: recognition of the
phonetic features from the speech signal using a neural
network; and decoding of the feature-based description
into phonemes using HMMs.
|
|
[135]
|
Paul A. Taylor, S. King, S. D. Isard, and H. Wright.
Intonation and dialogue context as constraints for speech
recognition.
Language and Speech, 41(3):493-512, 1998.
[ bib |
.ps |
.pdf ]
|
|
[136]
|
Simon King.
Using Information Above the Word Level for Automatic Speech
Recognition.
PhD thesis, University of Edinburgh, 1998.
[ bib |
.ps |
.pdf ]
This thesis introduces a general method for using
information at the utterance level and across
utterances for automatic speech recognition. The method
involves classification of utterances into types. Using
constraints at the utterance level via this
classification method allows information sources to be
exploited which cannot necessarily be used directly for
word recognition. The classification power of three
sources of information is investigated: the language
model in the speech recogniser, dialogue context and
intonation. The method is applied to a challenging
task: the recognition of spontaneous dialogue speech.
The results show success in automatic utterance type
classification, and subsequent word error rate
reduction over a baseline system, when all three
information sources are probabilistically combined.
|
|
[137]
|
Simon King, Thomas Portele, and Florian Höfer.
Speech synthesis using non-uniform units in the Verbmobil project.
In Proc. Eurospeech 97, volume 2, pages 569-572, Rhodes,
Greece, September 1997.
[ bib |
.ps |
.pdf ]
We describe a concatenative speech synthesiser for
British English which uses the HADIFIX inventory
structure originally developed for German by Portele.
An inventory of non-uniform units was investigated with
the aim of improving segmental quality compared to
diphones. A combination of soft (diphone) and hard
concatenation was used, which allowed a dramatic
reduction in inventory size. We also present a unit
selection algorithm which selects an optimum sequence
of units from this inventory for a given phoneme
sequence. The work described is part of the
concept-to-speech synthesiser for the language and
speech project Verbmobil which is funded by the German
Ministry of Science (BMBF).
|
|
[138]
|
Simon King.
Final report for Verbmobil Teilprojekt 4.4.
Technical Report ISSN 1434-8845, IKP, Universitt Bonn, January 1997.
Verbmobil-Report 195 available at http://verbmobil.dfki.de.
[ bib ]
Final report for Verbmobil English speech synthesis
|
|
[139]
|
Paul A. Taylor, Simon King, Stephen Isard, Helen Wright, and Jacqueline Kowtko.
Using intonation to constrain language models in speech recognition.
In Proc. Eurospeech'97, Rhodes, 1997.
[ bib |
.pdf ]
This paper describes a method for using intonation to
reduce word error rate in a speech recognition system
designed to recognise spontaneous dialogue speech. We
use a form of dialogue analysis based on the theory of
conversational games. Different move types under this
analysis conform to different language models.
Different move types are also characterised by
different intonational tunes. Our overall recognition
strategy is first to predict from intonation the type
of game move that a test utterance represents, and then
to use a bigram language model for that type of move
during recognition. point in a game.
|
|
[140]
|
Simon King.
Users Manual for Verbmobil Teilprojekt 4.4.
IKP, Universitt Bonn, October 1996.
[ bib ]
Verbmobil English synthesiser users manual
|
|
[141]
|
Simon King.
Inventory design for Verbmobil Teilprojekt 4.4.
Technical report, IKP, Universitt Bonn, October 1996.
[ bib ]
Inventory design for Verbmobil English speech
synthesis
|
|
[142]
|
Paul A. Taylor, Hiroshi Shimodaira, Stephen Isard, Simon King, and Jacqueline
Kowtko.
Using prosodic information to constrain language models for spoken
dialogue.
In Proc. ICSLP `96, Philadelphia, 1996.
[ bib |
.ps |
.pdf ]
We present work intended to improve speech recognition
performance for computer dialogue by taking into
account the way that dialogue context and intonational
tune interact to limit the possibilities for what an
utterance might be. We report here on the extra
constraint achieved in a bigram language model
expressed in terms of entropy by using separate
submodels for different sorts of dialogue acts and
trying to predict which submodel to apply by analysis
of the intonation of the sentence being recognised.
|
|
[143]
|
Stephen Isard, Simon King, Paul A. Taylor, and Jacqueline Kowtko.
Prosodic information in a speech recognition system intended for
dialogue.
In IEEE Workshop in speech recognition, Snowbird, Utah, 1995.
[ bib ]
We report on an automatic speech recognition system
intended for use in dialogue, whose original aspect is
its use of prosodic information for two different
purposes. The first is to improve the word level
accuracy of the system. The second is to constrain the
language model applied to a given utterance by taking
into account the way that dialogue context and
intonational tune interact to limit the possibilities
for what an utterance might be.
|