|
[1]
|
Z. Ling, K. Richmond, and J. Yamagishi.
Articulatory control of HMM-based parametric speech synthesis using
feature-space-switched multiple regression.
Audio, Speech, and Language Processing, IEEE Transactions on,
21(1):207-219, 2013.
[ bib |
DOI ]
In previous work we proposed a method to control the
characteristics of synthetic speech flexibly by
integrating articulatory features into a hidden Markov
model (HMM) based parametric speech synthesiser. In
this method, a unified acoustic-articulatory model is
trained, and context-dependent linear transforms are
used to model the dependency between the two feature
streams. In this paper, we go significantly further and
propose a feature-space-switched multiple regression
HMM to improve the performance of articulatory control.
A multiple regression HMM (MRHMM) is adopted to model
the distribution of acoustic features, with
articulatory features used as exogenous explanatory
variables. A separate Gaussian mixture model (GMM) is
introduced to model the articulatory space, and
articulatory-to-acoustic regression matrices are
trained for each component of this GMM, instead of for
the context-dependent states in the HMM. Furthermore,
we propose a task-specific context feature tailoring
method to ensure compatibility between state context
features and articulatory features that are manipulated
at synthesis time. The proposed method is evaluated on
two tasks, using a speech database with acoustic
waveforms and articulatory movements recorded in
parallel by electromagnetic articulography (EMA). In a
vowel identity modification task, the new method
achieves better performance when reconstructing target
vowels by varying articulatory inputs than our previous
approach. A second vowel creation task shows our new
method is highly effective at producing a new vowel
from appropriate articulatory representations which,
even though no acoustic samples for this vowel are
present in the training data, is shown to sound highly
natural.
|
|
[2]
|
Korin Richmond and Steve Renals.
Ultrax: An animated midsagittal vocal tract display for speech
therapy.
In Proc. Interspeech, Portland, Oregon, USA, September 2012.
[ bib |
.pdf ]
Speech sound disorders (SSD) are the most common
communication impairment in childhood, and can hamper
social development and learning. Current speech therapy
interventions rely predominantly on the auditory skills
of the child, as little technology is available to
assist in diagnosis and therapy of SSDs. Realtime
visualisation of tongue movements has the potential to
bring enormous benefit to speech therapy. Ultrasound
scanning offers this possibility, although its display
may be hard to interpret. Our ultimate goal is to
exploit ultrasound to track tongue movement, while
displaying a simplified, diagrammatic vocal tract that
is easier for the user to interpret. In this paper, we
outline a general approach to this problem, combining a
latent space model with a dimensionality reducing model
of vocal tract shapes. We assess the feasibility of
this approach using magnetic resonance imaging (MRI)
scans to train a model of vocal tract shapes, which is
animated using electromagnetic articulography (EMA)
data from the same speaker.
Keywords: Ultrasound, speech therapy, vocal tract visualisation
|
|
[3]
|
Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi.
Vowel creation by articulatory control in HMM-based parametric
speech synthesis.
In Proc. Interspeech, Portland, Oregon, USA, September 2012.
[ bib |
.pdf ]
This paper presents a method to produce a new vowel
by articulatory control in hidden Markov model (HMM)
based parametric speech synthesis. A multiple
regression HMM (MRHMM) is adopted to model the
distribution of acoustic features, with articulatory
features used as external auxiliary variables. The
dependency between acoustic and articulatory features
is modelled by a group of linear transforms that are
either estimated context-dependently or determined by
the distribution of articulatory features. Vowel
identity is removed from the set of context features
used to ensure compatibility between the
context-dependent model parameters and the articulatory
features of a new vowel. At synthesis time, acoustic
features are predicted according to the input
articulatory features as well as context information.
With an appropriate articulatory feature sequence, a
new vowel can be generated even when it does not exist
in the training set. Experimental results show this
method is effective in creating the English vowel /2/
by articulatory control without using any acoustic
samples of this vowel.
Keywords: Speech synthesis, articulatory features,
multiple-regression hidden Markov model
|
|
[4]
|
Benigno Uria, Iain Murray, Steve Renals, and Korin Richmond.
Deep architectures for articulatory inversion.
In Proc. Interspeech, Portland, Oregon, USA, September 2012.
[ bib |
.pdf ]
We implement two deep architectures for the
acoustic-articulatory inversion mapping problem: a deep
neural network and a deep trajectory mixture density
network. We find that in both cases, deep architectures
produce more accurate predictions than shallow
architectures and that this is due to the higher
expressive capability of a deep model and not a
consequence of adding more adjustable parameters. We
also find that a deep trajectory mixture density
network is able to obtain better inversion accuracies
than smoothing the results of a deep neural network.
Our best model obtained an average root mean square
error of 0.885 mm on the MNGU0 test dataset.
Keywords: Articulatory inversion, deep neural network, deep
belief network, deep regression network, pretraining
|
|
[5]
|
Zhenhua Ling, Korin Richmond, and Junichi Yamagishi.
Vowel creation by articulatory control in HMM-based parametric
speech synthesis.
In Proc. The Listening Talker Workshop, page 72, Edinburgh, UK,
May 2012.
[ bib |
.pdf ]
|
|
[6]
|
Ingmar Steiner, Korin Richmond, Ian Marshall, and Calum D. Gray.
The magnetic resonance imaging subset of the mngu0 articulatory
corpus.
The Journal of the Acoustical Society of America,
131(2):EL106-EL111, January 2012.
[ bib |
DOI |
.pdf ]
This paper announces the availability of the magnetic
resonance imaging (MRI) subset of the mngu0 corpus, a
collection of articulatory speech data from one speaker
containing different modalities. This subset comprises
volumetric MRI scans of the speaker's vocal tract
during sustained production of vowels and consonants,
as well as dynamic mid-sagittal scans of repetitive
consonant-vowel (CV) syllable production. For
reference, high-quality acoustic recordings of the
speech material are also available. The raw data are
made freely available for research purposes.
Keywords: audio recording; magnetic resonance imaging; speech
processing
|
|
[7]
|
Benigno Uria, Steve Renals, and Korin Richmond.
A deep neural network for acoustic-articulatory speech inversion.
In Proc. NIPS 2011 Workshop on Deep Learning and Unsupervised
Feature Learning, Sierra Nevada, Spain, December 2011.
[ bib |
.pdf ]
In this work, we implement a deep belief network for
the acoustic-articulatory inversion mapping problem. We
find that adding up to 3 hidden-layers improves
inversion accuracy. We also show that this improvement
is due to the higher ex- pressive capability of a deep
model and not a consequence of adding more adjustable
parameters. Additionally, we show unsupervised
pretraining of the sys- tem improves its performance in
all cases, even for a 1 hidden-layer model. Our
implementation obtained an average root mean square
error of 0.95 mm on the MNGU0 test dataset, beating all
previously published results.
|
|
[8]
|
Korin Richmond, Phil Hoole, and Simon King.
Announcing the electromagnetic articulography (day 1) subset of the
mngu0 articulatory corpus.
In Proc. Interspeech, pages 1505-1508, Florence, Italy, August
2011.
[ bib |
.pdf ]
This paper serves as an initial announcement of the
availability of a corpus of articulatory data called
mngu0. This corpus will ultimately consist of a
collection of multiple sources of articulatory data
acquired from a single speaker: electromagnetic
articulography (EMA), audio, video, volumetric MRI
scans, and 3D scans of dental impressions. This data
will be provided free for research use. In this first
stage of the release, we are making available one
subset of EMA data, consisting of more than 1,300
phonetically diverse utterances recorded with a
Carstens AG500 electromagnetic articulograph.
Distribution of mngu0 will be managed by a dedicated
“forum-style” web site. This paper both outlines the
general goals motivating the distribution of the data
and the creation of the mngu0 web forum, and also
provides a description of the EMA data contained in
this initial release.
|
|
[9]
|
Ming Lei, Junichi Yamagishi, Korin Richmond, Zhen-Hua Ling, Simon King, and
Li-Rong Dai.
Formant-controlled HMM-based speech synthesis.
In Proc. Interspeech, pages 2777-2780, Florence, Italy, August
2011.
[ bib |
.pdf ]
This paper proposes a novel framework that enables us
to manipulate and control formants in HMM-based speech
synthesis. In this framework, the dependency between
formants and spectral features is modelled by piecewise
linear transforms; formant parameters are effectively
mapped by these to the means of Gaussian distributions
over the spectral synthesis parameters. The spectral
envelope features generated under the influence of
formants in this way may then be passed to high-quality
vocoders to generate the speech waveform. This provides
two major advantages over conventional frameworks.
First, we can achieve spectral modification by changing
formants only in those parts where we want control,
whereas the user must specify all formants manually in
conventional formant synthesisers (e.g. Klatt). Second,
this can produce high-quality speech. Our results show
the proposed method can control vowels in the
synthesized speech by manipulating F 1 and F 2 without
any degradation in synthesis quality.
|
|
[10]
|
Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi.
Feature-space transform tying in unified acoustic-articulatory
modelling of articulatory control of HMM-based speech synthesis.
In Proc. Interspeech, pages 117-120, Florence, Italy, August
2011.
[ bib |
.pdf ]
In previous work, we have proposed a method to control
the characteristics of synthetic speech flexibly by
integrating articulatory features into hidden Markov
model (HMM) based parametric speech synthesis. A
unified acoustic-articulatory model was trained and a
piecewise linear transform was adopted to describe the
dependency between these two feature streams. The
transform matrices were trained for each HMM state and
were tied based on each state's context. In this paper,
an improved acoustic-articulatory modelling method is
proposed. A Gaussian mixture model (GMM) is introduced
to model the articulatory space and the cross-stream
transform matrices are trained for each Gaussian
mixture instead of context-dependently. This means the
dependency relationship can vary with the change of
articulatory features flexibly. Our results show this
method improves the effectiveness of control over vowel
quality by modifing articulatory trajectories without
degrading naturalness.
|
|
[11]
|
J.P. Cabral, S. Renals, J. Yamagishi, and K. Richmond.
HMM-based speech synthesiser using the LF-model of the glottal
source.
In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE
International Conference on, pages 4704-4707, May 2011.
[ bib |
DOI |
.pdf ]
A major factor which causes a deterioration in speech
quality in HMM-based speech synthesis is the use of a
simple delta pulse signal to generate the excitation of
voiced speech. This paper sets out a new approach to
using an acoustic glottal source model in HMM-based
synthesisers instead of the traditional pulse signal.
The goal is to improve speech quality and to better
model and transform voice characteristics. We have
found the new method decreases buzziness and also
improves prosodic modelling. A perceptual evaluation
has supported this finding by showing a 55.6 preference for the new system, as against the baseline.
This improvement, while not being as significant as we
had initially expected, does encourage us to work on
developing the proposed speech synthesiser further.
|
|
[12]
|
Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi.
An analysis of HMM-based prediction of articulatory movements.
Speech Communication, 52(10):834-846, October 2010.
[ bib |
DOI ]
This paper presents an investigation into predicting
the movement of a speaker's mouth from text input using
hidden Markov models (HMM). A corpus of human
articulatory movements, recorded by electromagnetic
articulography (EMA), is used to train HMMs. To predict
articulatory movements for input text, a suitable model
sequence is selected and a maximum-likelihood parameter
generation (MLPG) algorithm is used to generate output
articulatory trajectories. Unified
acoustic-articulatory HMMs are introduced to integrate
acoustic features when an acoustic signal is also
provided with the input text. Several aspects of this
method are analyzed in this paper, including the
effectiveness of context-dependent modeling, the role
of supplementary acoustic input, and the
appropriateness of certain model structures for the
unified acoustic-articulatory models. When text is the
sole input, we find that fully context-dependent models
significantly outperform monophone and quinphone
models, achieving an average root mean square (RMS)
error of 1.945 mm and an average correlation
coefficient of 0.600. When both text and acoustic
features are given as input to the system, the
difference between the performance of quinphone models
and fully context-dependent models is no longer
significant. The best performance overall is achieved
using unified acoustic-articulatory quinphone HMMs with
separate clustering of acoustic and articulatory model
parameters, a synchronous-state sequence, and a
dependent-feature model structure, with an RMS error of
0.900 mm and a correlation coefficient of 0.855 on
average. Finally, we also apply the same quinphone HMMs
to the acoustic-articulatory, or inversion, mapping
problem, where only acoustic input is available. An
average root mean square (RMS) error of 1.076 mm and an
average correlation coefficient of 0.812 are achieved.
Taken together, our results demonstrate how text and
acoustic inputs both contribute to the prediction of
articulatory movements in the method used.
Keywords: Hidden Markov model; Articulatory features; Parameter
generation
|
|
[13]
|
Zhen-Hua Ling, Korin Richmond, and Junichi Yamagishi.
HMM-based text-to-articulatory-movement prediction and analysis of
critical articulators.
In Proc. Interspeech, pages 2194-2197, Makuhari, Japan,
September 2010.
[ bib |
.pdf ]
In this paper we present a method to predict the
movement of a speaker's mouth from text input using
hidden Markov models (HMM). We have used a corpus of
human articulatory movements, recorded by
electromagnetic articulography (EMA), to train HMMs. To
predict articulatory movements from text, a suitable
model sequence is selected and the maximum-likelihood
parameter generation (MLPG) algorithm is used to
generate output articulatory trajectories. In our
experiments, we find that fully context-dependent
models outperform monophone and quinphone models,
achieving an average root mean square (RMS) error of
1.945mm when state durations are predicted from text,
and 0.872mm when natural state durations are used.
Finally, we go on to analyze the prediction error for
different EMA dimensions and phone types. We find a
clear pattern emerges that the movements of so-called
critical articulators can be predicted more accurately
than the average performance.
Keywords: Hidden Markov model, articulatory features, parameter
generation, critical articulators
|
|
[14]
|
Daniel Felps, Christian Geng, Michael Berger, Korin Richmond, and Ricardo
Gutierrez-Osuna.
Relying on critical articulators to estimate vocal tract spectra in
an articulatory-acoustic database.
In Proc. Interspeech, pages 1990-1993, September 2010.
[ bib |
.pdf ]
We present a new phone-dependent feature weighting
scheme that can be used to map articulatory
configurations (e.g. EMA) onto vocal tract spectra
(e.g. MFCC) through table lookup. The approach consists
of assigning feature weights according to a feature's
ability to predict the acoustic distance between
frames. Since an articulator's predictive accuracy is
phone-dependent (e.g., lip location is a better
predictor for bilabial sounds than for palatal sounds),
a unique weight vector is found for each phone.
Inspection of the weights reveals a correspondence with
the expected critical articulators for many phones. The
proposed method reduces overall cepstral error by 6%
when compared to a uniform weighting scheme. Vowels
show the greatest benefit, though improvements occur
for 80% of the tested phones.
Keywords: speech production, speech synthesis
|
|
[15]
|
Korin Richmond, Robert Clark, and Sue Fitt.
On generating Combilex pronunciations via morphological analysis.
In Proc. Interspeech, pages 1974-1977, Makuhari, Japan,
September 2010.
[ bib |
.pdf ]
Combilex is a high-quality lexicon that has been
developed specifically for speech technology purposes
and recently released by CSTR. Combilex benefits from
many advanced features. This paper explores one of
these: the ability to generate fully-specified
transcriptions for morphologically derived words
automatically. This functionality was originally
implemented to encode the pronunciations of derived
words in terms of their constituent morphemes, thus
accelerating lexicon development and ensuring a high
level of consistency. In this paper, we propose this
method of modelling pronunciations can be exploited
further by combining it with a morphological parser,
thus yielding a method to generate full transcriptions
for unknown derived words. Not only could this
accelerate adding new derived words to Combilex, but it
could also serve as an alternative to conventional
letter-to-sound rules. This paper presents preliminary
work indicating this is a promising direction.
Keywords: combilex lexicon, letter-to-sound rules,
grapheme-to-phoneme conversion, morphological
decomposition
|
|
[16]
|
João Cabral, Steve Renals, Korin Richmond, and Junichi Yamagishi.
Transforming voice source parameters in a HMM-based speech
synthesiser with glottal post-filtering.
In Proc. 7th ISCA Speech Synthesis Workshop (SSW7), pages
365-370, NICT/ATR, Kyoto, Japan, September 2010.
[ bib |
.pdf ]
Control over voice quality, e.g. breathy and tense
voice, is important for speech synthesis applications.
For example, transformations can be used to modify
aspects of the voice re- lated to speaker's identity
and to improve expressiveness. How- ever, it is hard to
modify voice characteristics of the synthetic speech,
without degrading speech quality. State-of-the-art sta-
tistical speech synthesisers, in particular, do not
typically al- low control over parameters of the
glottal source, which are strongly correlated with
voice quality. Consequently, the con- trol of voice
characteristics in these systems is limited. In con-
trast, the HMM-based speech synthesiser proposed in
this paper uses an acoustic glottal source model. The
system passes the glottal signal through a whitening
filter to obtain the excitation of voiced sounds. This
technique, called glottal post-filtering, allows to
transform voice characteristics of the synthetic speech
by modifying the source model parameters. We evaluated
the proposed synthesiser in a perceptual ex- periment,
in terms of speech naturalness, intelligibility, and
similarity to the original speaker's voice. The results
show that it performed as well as a HMM-based
synthesiser, which generates the speech signal with a
commonly used high-quality speech vocoder.
Keywords: HMM-based speech synthesis, voice quality, glottal
post-filter
|
|
[17]
|
Gregor Hofer and Korin Richmond.
Comparison of HMM and TMDN methods for lip synchronisation.
In Proc. Interspeech, pages 454-457, Makuhari, Japan,
September 2010.
[ bib |
.pdf ]
This paper presents a comparison between a hidden
Markov model (HMM) based method and a novel artificial
neural network (ANN) based method for lip
synchronisation. Both model types were trained on
motion tracking data, and a perceptual evaluation was
carried out comparing the output of the models, both to
each other and to the original tracked data. It was
found that the ANN-based method was judged
significantly better than the HMM based method.
Furthermore, the original data was not judged
significantly better than the output of the ANN method.
Keywords: hidden Markov model (HMM), mixture density network,
lip synchronisation, inversion mapping
|
|
[18]
|
Alice Turk, James Scobbie, Christian Geng, Barry Campbell, Catherine Dickie,
Eddie Dubourg, Ellen Gurman Bard, William Hardcastle, Mariam Hartinger, Simon
King, Robin Lickley, Cedric Macmartin, Satsuki Nakai, Steve Renals, Korin
Richmond, Sonja Schaeffler, Kevin White, Ronny Wiegand, and Alan Wrench.
An Edinburgh speech production facility.
Poster presented at the 12th Conference on Laboratory Phonology,
Albuquerque, New Mexico., July 2010.
[ bib |
.pdf ]
|
|
[19]
|
Gregor Hofer, Korin Richmond, and Michael Berger.
Lip synchronization by acoustic inversion.
Poster at Siggraph 2010, 2010.
[ bib |
.pdf ]
|
|
[20]
|
Alice Turk, James Scobbie, Christian Geng, Cedric Macmartin, Ellen Bard, Barry
Campbell, Catherine Dickie, Eddie Dubourg, Bill Hardcastle, Phil Hoole, Evia
Kanaida, Robin Lickley, Satsuki Nakai, Marianne Pouplier, Simon King, Steve
Renals, Korin Richmond, Sonja Schaeffler, Ronnie Wiegand, Kevin White, and
Alan Wrench.
The Edinburgh Speech Production Facility's articulatory corpus of
spontaneous dialogue.
The Journal of the Acoustical Society of America,
128(4):2429-2429, 2010.
[ bib |
DOI ]
The EPSRC‐funded Edinburgh Speech Production is
built around two synchronized Carstens AG500
electromagnetic articulographs (EMAs) in order to
capture articulatory∕acoustic data from spontaneous
dialogue. An initial articulatory corpus was designed
with two aims. The first was to elicit a range of
speech styles∕registers from speakers, and therefore
provide an alternative to fully scripted corpora. The
second was to extend the corpus beyond monologue, by
using tasks that promote natural discourse and
interaction. A subsidiary driver was to use dialects
from outwith North America: dialogues paired up a
Scottish English and a Southern British English
speaker. Tasks. Monologue: Story reading of “Comma
Gets a Cure” [Honorof et al. (2000)], lexical sets
[Wells (1982)], spontaneous story telling,
diadochokinetic tasks. Dialogue: Map tasks [Anderson et
al. (1991)], “Spot the Difference” picture tasks
[Bradlow et al. (2007)], story‐recall. Shadowing of
the spontaneous story telling by the second
participant. Each dialogue session includes
approximately 30 min of speech, and there are
acoustics‐only baseline materials. We will introduce
the corpus and highlight the role of articulatory
production data in helping provide a fuller
understanding of various spontaneous speech phenomena
by presenting examples of naturally occurring covert
speech errors, accent accommodation, turn taking
negotiation, and shadowing.
|
|
[21]
|
K. Richmond.
Preliminary inversion mapping results with a new EMA corpus.
In Proc. Interspeech, pages 2835-2838, Brighton, UK, September
2009.
[ bib |
.pdf ]
In this paper, we apply our inversion mapping method,
the trajectory mixture density network (TMDN), to a new
corpus of articulatory data, recorded with a Carstens
AG500 electromagnetic articulograph. This new data set,
mngu0, is relatively large and phonetically rich, among
other beneficial characteristics. We obtain good
results, with a root mean square (RMS) error of only
0.99mm. This compares very well with our previous
lowest result of 1.54mm RMS error for equivalent coils
of the MOCHA fsew0 EMA data. We interpret this as
showing the mngu0 data set is potentially more
consistent than the fsew0 data set, and is very useful
for research which calls for articulatory trajectory
data. It also supports our view that the TMDN is very
much suited to the inversion mapping problem.
Keywords: acoustic-articulatory inversion mapping, neural
network
|
|
[22]
|
K. Richmond, R. Clark, and S. Fitt.
Robust LTS rules with the Combilex speech technology lexicon.
In Proc. Interspeech, pages 1295-1298, Brighton, UK, September
2009.
[ bib |
.pdf ]
Combilex is a high quality pronunciation lexicon aimed
at speech technology applications that has recently
been released by CSTR. Combilex benefits from several
advanced features. This paper evaluates one of these:
the explicit alignment of phones to graphemes in a
word. This alignment can help to rapidly develop robust
and accurate letter-to-sound (LTS) rules, without
needing to rely on automatic alignment methods. To
evaluate this, we used Festival's LTS module, comparing
its standard automatic alignment with Combilex's
explicit alignment. Our results show using Combilex's
alignment improves LTS accuracy: 86.50% words correct
as opposed to 84.49%, with our most general form of
lexicon. In addition, building LTS models is greatly
accelerated, as the need to list allowed alignments is
removed. Finally, loose comparison with other studies
indicates Combilex is a superior quality lexicon in
terms of consistency and size.
Keywords: combilex, letter-to-sound rules, grapheme-to-phoneme
conversion
|
|
[23]
|
I. Steiner and K. Richmond.
Towards unsupervised articulatory resynthesis of German utterances
using EMA data.
In Proc. Interspeech, pages 2055-2058, Brighton, UK, September
2009.
[ bib |
.pdf ]
As part of ongoing research towards integrating an
articulatory synthesizer into a text-to-speech (TTS)
framework, a corpus of German utterances recorded with
electromagnetic articulography (EMA) is resynthesized
to provide training data for statistical models. The
resynthesis is based on a measure of similarity between
the original and resynthesized EMA trajectories,
weighted by articulatory relevance. Preliminary results
are discussed and future work outlined.
Keywords: articulatory speech synthesis, copy synthesis,
electromagnetic articulography, EMA
|
|
[24]
|
Z. Ling, K. Richmond, J. Yamagishi, and R. Wang.
Integrating articulatory features into HMM-based parametric speech
synthesis.
IEEE Transactions on Audio, Speech and Language Processing,
17(6):1171-1185, August 2009.
IEEE SPS 2010 Young Author Best Paper Award.
[ bib |
DOI ]
This paper presents an investigation of ways to
integrate articulatory features into Hidden Markov
Model (HMM)-based parametric speech synthesis,
primarily with the aim of improving the performance of
acoustic parameter generation. The joint distribution
of acoustic and articulatory features is estimated
during training and is then used for parameter
generation at synthesis time in conjunction with a
maximum-likelihood criterion. Different model
structures are explored to allow the articulatory
features to influence acoustic modeling: model
clustering, state synchrony and cross-stream feature
dependency. The results of objective evaluation show
that the accuracy of acoustic parameter prediction can
be improved when shared clustering and
asynchronous-state model structures are adopted for
combined acoustic and articulatory features. More
significantly, our experiments demonstrate that
modeling the dependency between these two feature
streams can make speech synthesis more flexible. The
characteristics of synthetic speech can be easily
controlled by modifying generated articulatory features
as part of the process of acoustic parameter
generation.
|
|
[25]
|
J. Cabral, S. Renals, K. Richmond, and J. Yamagishi.
HMM-based speech synthesis with an acoustic glottal source model.
In Proc. The First Young Researchers Workshop in Speech
Technology, April 2009.
[ bib |
.pdf ]
A major cause of degradation of speech quality in
HMM-based speech synthesis is the use of a simple delta
pulse signal to generate the excitation of voiced
speech. This paper describes a new approach to using an
acoustic glottal source model in HMM-based
synthesisers. The goal is to improve speech quality and
parametric flexibility to better model and transform
voice characteristics.
|
|
[26]
|
I. Steiner and K. Richmond.
Generating gestural timing from EMA data using articulatory
resynthesis.
In Proc. 8th International Seminar on Speech Production,
Strasbourg, France, December 2008.
[ bib ]
As part of ongoing work to integrate an articulatory
synthesizer into a modular TTS platform, a method is
presented which allows gestural timings to be generated
automatically from EMA data. Further work is outlined
which will adapt the vocal tract model and phoneset to
English using new articulatory data, and use
statistical trajectory models.
|
|
[27]
|
Zhen-Hua Ling, Korin Richmond, Junichi Yamagishi, and Ren-Hua Wang.
Articulatory control of HMM-based parametric speech synthesis
driven by phonetic knowledge.
In Proc. Interspeech, pages 573-576, Brisbane, Australia,
September 2008.
[ bib |
.PDF ]
This paper presents a method to control the
characteristics of synthetic speech flexibly by
integrating articulatory features into a Hidden Markov
Model (HMM)-based parametric speech synthesis system.
In contrast to model adaptation and interpolation
approaches for speaking style control, this method is
driven by phonetic knowledge, and target speech samples
are not required. The joint distribution of parallel
acoustic and articulatory features considering
cross-stream feature dependency is estimated. At
synthesis time, acoustic and articulatory features are
generated simultaneously based on the
maximum-likelihood criterion. The synthetic speech can
be controlled flexibly by modifying the generated
articulatory features according to arbitrary phonetic
rules in the parameter generation process. Our
experiments show that the proposed method is effective
in both changing the overall character of synthesized
speech and in controlling the quality of a specific
vowel.
|
|
[28]
|
C. Qin, M. Carreira-Perpiñán, K. Richmond, A. Wrench, and S. Renals.
Predicting tongue shapes from a few landmark locations.
In Proc. Interspeech, pages 2306-2309, Brisbane, Australia,
September 2008.
[ bib |
.PDF ]
We present a method for predicting the midsagittal
tongue contour from the locations of a few landmarks
(metal pellets) on the tongue surface, as used in
articulatory databases such as MOCHA and the Wisconsin
XRDB. Our method learns a mapping using ground-truth
tongue contours derived from ultrasound data and
drastically improves over spline interpolation. We also
determine the optimal locations of the landmarks, and
the number of landmarks required to achieve a desired
prediction error: 3-4 landmarks are enough to achieve
0.3-0.2 mm error per point on the tongue.
|
|
[29]
|
J. Cabral, S. Renals, K. Richmond, and J. Yamagishi.
Glottal spectral separation for parametric speech synthesis.
In Proc. Interspeech, pages 1829-1832, Brisbane, Australia,
September 2008.
[ bib |
.PDF ]
This paper presents a method to control the
characteristics of synthetic speech flexibly by
integrating articulatory features into a Hidden Markov
Model (HMM)-based parametric speech synthesis system.
In contrast to model adaptation and interpolation
approaches for speaking style control, this method is
driven by phonetic knowledge, and target speech samples
are not required. The joint distribution of parallel
acoustic and articulatory features considering
cross-stream feature dependency is estimated. At
synthesis time, acoustic and articulatory features are
generated simultaneously based on the
maximum-likelihood criterion. The synthetic speech can
be controlled flexibly by modifying the generated
articulatory features according to arbitrary phonetic
rules in the parameter generation process. Our
experiments show that the proposed method is effective
in both changing the overall character of synthesized
speech and in controlling the quality of a specific
vowel.
|
|
[30]
|
K. Richmond.
Trajectory mixture density networks with multiple mixtures for
acoustic-articulatory inversion.
In M. Chetouani, A. Hussain, B. Gas, M. Milgram, and J.-L. Zarader,
editors, Advances in Nonlinear Speech Processing, International
Conference on Non-Linear Speech Processing, NOLISP 2007, volume 4885 of
Lecture Notes in Computer Science, pages 263-272. Springer-Verlag Berlin
Heidelberg, December 2007.
[ bib |
DOI |
.pdf ]
We have previously proposed a trajectory model which
is based on a mixture density network (MDN) trained
with target variables augmented with dynamic features
together with an algorithm for estimating maximum
likelihood trajectories which respects the constraints
between those features. In this paper, we have extended
that model to allow diagonal covariance matrices and
multiple mixture components in the trajectory MDN
output probability density functions. We have evaluated
this extended model on an inversion mapping task and
found the trajectory model works well, outperforming
smoothing of equivalent trajectories using low-pass
filtering. Increasing the number of mixture components
in the TMDN improves results further.
|
|
[31]
|
K. Richmond.
A multitask learning perspective on acoustic-articulatory inversion.
In Proc. Interspeech, Antwerp, Belgium, August 2007.
[ bib |
.pdf ]
This paper proposes the idea that by viewing an
inversion mapping MLP from a Multitask Learning
perspective, we may be able to relax two constraints
which are inherent in using electromagnetic
articulography as a source of articulatory information
for speech technology purposes. As a first step to
evaluating this idea, we perform an inversion mapping
experiment in an attempt to ascertain whether the
hidden layer of a “multitask” MLP can act
beneficially as a hidden representation that is shared
between inversion mapping subtasks for multiple
articulatory targets. Our results in the case of the
tongue dorsum x-coordinate indicate this is indeed the
case and show good promise. Results for the tongue
dorsum y-coordinate however are not so clear-cut, and
will require further investigation.
|
|
[32]
|
K. Richmond, V. Strom, R. Clark, J. Yamagishi, and S. Fitt.
Festival multisyn voices for the 2007 blizzard challenge.
In Proc. Blizzard Challenge Workshop (in Proc. SSW6), Bonn,
Germany, August 2007.
[ bib |
.pdf ]
This paper describes selected aspects of the Festival
Multisyn entry to the Blizzard Challenge 2007. We
provide an overview of the process of building the
three required voices from the speech data provided.
This paper focuses on new features of Multisyn which
are currently under development and which have been
employed in the system used for this Blizzard
Challenge. These differences are the application of a
more flexible phonetic lattice representation during
forced alignment labelling and the use of a pitch
accent target cost component. Finally, we also examine
aspects of the speech data provided for this year's
Blizzard Challenge and raise certain issues for
discussion concerning the aim of comparing voices made
with differing subsets of the data provided.
|
|
[33]
|
S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester.
Speech production knowledge in automatic speech recognition.
Journal of the Acoustical Society of America, 121(2):723-742,
February 2007.
[ bib |
.pdf ]
Although much is known about how speech is produced,
and research into speech production has resulted in
measured articulatory data, feature systems of
different kinds and numerous models, speech production
knowledge is almost totally ignored in current
mainstream approaches to automatic speech recognition.
Representations of speech production allow simple
explanations for many phenomena observed in speech
which cannot be easily analyzed from either acoustic
signal or phonetic transcription alone. In this
article, we provide a survey of a growing body of work
in which such representations are used to improve
automatic speech recognition.
|
|
[34]
|
J. Cabral, S. Renals, K. Richmond, and J. Yamagishi.
Towards an improved modeling of the glottal source in statistical
parametric speech synthesis.
In Proc.of the 6th ISCA Workshop on Speech Synthesis, Bonn,
Germany, 2007.
[ bib |
.pdf ]
This paper proposes the use of the Liljencrants-Fant
model (LF-model) to represent the glottal source signal
in HMM-based speech synthesis systems. These systems
generally use a pulse train to model the periodicity of
the excitation signal of voiced speech. However, this
model produces a strong and uniform harmonic structure
throughout the spectrum of the excitation which makes
the synthetic speech sound buzzy. The use of a mixed
band excitation and phase manipulation reduces this
effect but it can result in degradation of the speech
quality if the noise component is not weighted
carefully. In turn, the LF-waveform has a decaying
spectrum at higher frequencies, which is more similar
to the real glottal source excitation signal. We
conducted a perceptual experiment to test the
hypothesis that the LF-model can perform as well as or
better than the pulse train in a HMM-based speech
synthesizer. In the synthesis, we used the mean values
of the LF-parameters, calculated by measurements of the
recorded speech. The result of this study is important
not only regarding the improvement in speech quality of
these type of systems, but also because the LF-model
can be used to model many characteristics of the
glottal source, such as voice quality, which are
important for voice transformation and generation of
expressive speech.
|
|
[35]
|
Robert A. J. Clark, Korin Richmond, and Simon King.
Multisyn: Open-domain unit selection for the Festival speech
synthesis system.
Speech Communication, 49(4):317-330, 2007.
[ bib |
DOI |
.pdf ]
We present the implementation and evaluation of an
open-domain unit selection speech synthesis engine
designed to be flexible enough to encourage further
unit selection research and allow rapid voice
development by users with minimal speech synthesis
knowledge and experience. We address the issues of
automatically processing speech data into a usable
voice using automatic segmentation techniques and how
the knowledge obtained at labelling time can be
exploited at synthesis time. We describe target cost
and join cost implementation for such a system and
describe the outcome of building voices with a number
of different sized datasets. We show that, in a
competitive evaluation, voices built using this
technology compare favourably to other systems.
|
|
[36]
|
Sue Fitt and Korin Richmond.
Redundancy and productivity in the speech technology lexicon - can we
do better?
In Proc. Interspeech 2006, September 2006.
[ bib |
.pdf ]
Current lexica for speech technology typically contain
much redundancy, while omitting useful information. A
comparison with lexica in other media and for other
purposes is instructive, as it highlights some features
we may borrow for text-to-speech and speech recognition
lexica. We describe some aspects of the new lexicon we
are producing, Combilex, whose structure and
implementation is specifically designed to reduce
redundancy and improve the representation of productive
elements of English. Most importantly, many English
words are predictable derivations of baseforms, or
compounds. Storing the lexicon as a combination of
baseforms and derivational rules speeds up lexicon
development, and improves coverage and maintainability.
|
|
[37]
|
R. Clark, K. Richmond, V. Strom, and S. King.
Multisyn voices for the Blizzard Challenge 2006.
In Proc. Blizzard Challenge Workshop (Interspeech Satellite),
Pittsburgh, USA, September 2006.
(http://festvox.org/blizzard/blizzard2006.html).
[ bib |
.pdf ]
This paper describes the process of building unit
selection voices for the Festival Multisyn engine using
the ATR dataset provided for the Blizzard Challenge
2006. We begin by discussing recent improvements that
we have made to the Multisyn voice building process,
prompted by our participation in the Blizzard Challenge
2006. We then go on to discuss our interpretation of
the results observed. Finally, we conclude with some
comments and suggestions for the formulation of future
Blizzard Challenges.
|
|
[38]
|
K. Richmond.
A trajectory mixture density network for the acoustic-articulatory
inversion mapping.
In Proc. Interspeech, Pittsburgh, USA, September 2006.
[ bib |
.pdf ]
This paper proposes a trajectory model which is based
on a mixture density network trained with target
features augmented with dynamic features together with
an algorithm for estimating maximum likelihood
trajectories which respects constraints between the
static and derived dynamic features. This model was
evaluated on an inversion mapping task. We found the
introduction of the trajectory model successfully
reduced root mean square error by up to 7.5%, as
well as increasing correlation scores.
|
|
[39]
|
Robert A.J. Clark, Korin Richmond, and Simon King.
Multisyn voices from ARCTIC data for the Blizzard challenge.
In Proc. Interspeech 2005, September 2005.
[ bib |
.pdf ]
This paper describes the process of building unit
selection voices for the Festival Multisyn engine using
four ARCTIC datasets, as part of the Blizzard
evaluation challenge. The build process is almost
entirely automatic, with very little need for human
intervention. We discuss the difference in the
evaluation results for each voice and evaluate the
suitability of the ARCTIC datasets for building this
type of voice.
|
|
[40]
|
G. Hofer, K. Richmond, and R. Clark.
Informed blending of databases for emotional speech synthesis.
In Proc. Interspeech, September 2005.
[ bib |
.ps |
.pdf ]
The goal of this project was to build a unit selection
voice that could portray emotions with varying
intensities. A suitable definition of an emotion was
developed along with a descriptive framework that
supported the work carried out. A single speaker was
recorded portraying happy and angry speaking styles.
Additionally a neutral database was also recorded. A
target cost function was implemented that chose units
according to emotion mark-up in the database. The
Dictionary of Affect supported the emotional target
cost function by providing an emotion rating for words
in the target utterance. If a word was particularly
'emotional', units from that emotion were favoured. In
addition intensity could be varied which resulted in a
bias to select a greater number emotional units. A
perceptual evaluation was carried out and subjects were
able to recognise reliably emotions with varying
amounts of emotional units present in the target
utterance.
|
|
[41]
|
L. Onnis, P. Monaghan, K. Richmond, and N. Chater.
Phonology impacts segmentation in speech processing.
Journal of Memory and Language, 53(2):225-237, 2005.
[ bib |
.pdf ]
Pea, Bonatti, Nespor and Mehler(2002) investigated an
artificial language where the structure of words was
determined by nonadjacent dependencies between
syllables. They found that segmentation of continuous
speech could proceed on the basis of these
dependencies. However, Pea et al.'s artificial
language contained a confound in terms of phonology, in
that the dependent syllables began with plosives and
the intervening syllables began with continuants. We
consider three hypotheses concerning the role of
phonology in speech segmentation in this task: (1)
participants may recruit probabilistic phonotactic
information from their native language to the
artificial language learning task; (2) phonetic
properties of the stimuli, such as the gaps that
precede unvoiced plosives, can influence segmentation;
and (3) grouping by phonological similarity between
dependent syllables contributes to learning the
dependency. In a series of experiments controlling the
phonological and statistical structure of the language,
we found that segmentation performance is influenced by
the three factors in different degrees. Learning of
non-adjacent dependencies did not occur when (3) is
eliminated. We suggest that phonological processing
provides a fundamental contribution to distributional
analysis.
|
|
[42]
|
D. Toney, D. Feinberg, and K. Richmond.
Acoustic features for profiling mobile users of conversational
interfaces.
In S. Brewster and M. Dunlop, editors, 6th International
Symposium on Mobile Human-Computer Interaction - MobileHCI 2004, pages
394-398, Glasgow, Scotland, September 2004. Springer.
[ bib ]
Conversational interfaces allow human users to use
spoken language to interact with computer-based
information services. In this paper, we examine the
potential for personalizing speech-based human-computer
interaction according to the user's gender and age. We
describe a system that uses acoustic features of the
user's speech to automatically estimate these physical
characteristics. We discuss the difficulties of
implementing this process in relation to the high level
of environmental noise that is typical of mobile
human-computer interaction.
|
|
[43]
|
Robert A.J. Clark, Korin Richmond, and Simon King.
Festival 2 - build your own general purpose unit selection speech
synthesiser.
In Proc. 5th ISCA workshop on speech synthesis, 2004.
[ bib |
.ps |
.pdf ]
This paper describes version 2 of the Festival speech
synthesis system. Festival 2 provides a development
environment for concatenative speech synthesis, and now
includes a general purpose unit selection speech
synthesis engine. We discuss various aspects of unit
selection speech synthesis, focusing on the research
issues that relate to voice design and the automation
of the voice development process.
|
|
[44]
|
K. Richmond, S. King, and P. Taylor.
Modelling the uncertainty in recovering articulation from acoustics.
Computer Speech and Language, 17:153-172, 2003.
[ bib |
.pdf ]
This paper presents an experimental comparison of the
performance of the multilayer perceptron (MLP) with
that of the mixture density network (MDN) for an
acoustic-to-articulatory mapping task. A corpus of
acoustic-articulatory data recorded by electromagnetic
articulography (EMA) for a single speaker was used as
training and test data for this purpose. In theory, the
MDN is able to provide a richer, more flexible
description of the target variables in response to a
given input vector than the least-squares trained MLP.
Our results show that the mean likelihoods of the
target articulatory parameters for an unseen test set
were indeed consistently higher with the MDN than with
the MLP. The increase ranged from approximately 3% to
22%, depending on the articulatory channel in
question. On the basis of these results, we argue that
using a more flexible description of the target domain,
such as that offered by the MDN, can prove beneficial
when modelling the acoustic-to-articulatory mapping.
|
|
[45]
|
K. Richmond.
Estimating Articulatory Parameters from the Acoustic Speech
Signal.
PhD thesis, The Centre for Speech Technology Research, Edinburgh
University, 2002.
[ bib |
.ps ]
A successful method for inferring articulation from
the acoustic speech signal would find many
applications: low bit-rate speech coding, visual
representation of speech, and the possibility of
improved automatic speech recognition to name but a
few. It is unsurprising, therefore, that researchers
have been investigating the acoustic-to-articulatory
inversion mapping for several decades now. A great
variety of approaches and models have been applied to
the problem. Unfortunately, the overwhelming majority
of these attempts have faced difficulties in
satisfactorily assessing performance in terms of
genuine human articulation. However, technologies such
as electromagnetic articulography (EMA) mean that
measurement of human articulation during speech has
become increasingly accessible. Crucially, a large
corpus of acoustic-articulatory data during
phonetically-diverse, continuous speech has recently
been recorded at Queen Margaret College, Edinburgh. One
of the primary motivations of this thesis is to exploit
the availability of this remarkable resource. Among the
data-driven models which have been employed in previous
studies, the feedforward multilayer perceptron (MLP) in
particular has been used several times with promising
results. Researchers have cited advantages in terms of
memory requirement and execution speed as a significant
factor motivating their use. Furthermore, the MLP is
well known as a universal function approximator; an MLP
of suitable form can in theory represent any arbitrary
mapping function. Therefore, using an MLP in
conjunction with the relatively large quantities of
acoustic-articulatory data arguably represents a
promising and useful first research step for the
current thesis, and a significant part of this thesis
is occupied with doing this. Having demonstrated an MLP
which performs well enough to provide a reasonable
baseline, we go on to critically evaluate the
suitability of the MLP for the inversion mapping. The
aim is to find ways to improve modelling accuracy
further. Considering what model of the target
articulatory domain is provided in the MLP is key in
this respect. It has been shown that the outputs of an
MLP trained with the sum-of-squares error function
approximate the mean of the target data points
conditioned on the input vector. In many situations,
this is an appropriate and sufficient solution. In
other cases, however, this conditional mean is an
inconveniently limiting model of data in the target
domain, particularly for ill-posed problems where the
mapping may be multi-valued. Substantial evidence
exists which shows that multiple articulatory
configurations are able to produce the same acoustic
signal. This means that a system intended to map from a
point in acoustic space can be faced with multiple
candidate articulatory configurations. Therefore,
despite the impressive ability of the MLP to model
mapping functions, it may prove inadequate in certain
respects for performing the acoustic-to-articulatory
inversion mapping. Mixture density networks (MDN)
provide a principled method to model arbitrary
probability density functions over the target domain,
conditioned on the input vector. In theory, therefore,
the MDN offers a superior model of the target domain
compared to the MLP. We hypothesise that this advantage
will prove beneficial in the case of the
acoustic-to-articulatory inversion mapping.
Accordingly, this thesis aims to test this hypothesis
and directly compare the performance of MDN with MLP on
exactly the same acoustic-to-articulatory inversion
task.
|
|
[46]
|
K. Richmond.
Mixture density networks, human articulatory data and
acoustic-to-articulatory inversion of continuous speech.
In Proc. Workshop on Innovation in Speech Processing, pages
259-276. Institute of Acoustics, April 2001.
[ bib |
.ps ]
|
|
[47]
|
A. Wrench and K. Richmond.
Continuous speech recognition using articulatory data.
In Proc. ICSLP 2000, Beijing, China, 2000.
[ bib |
.ps |
.pdf ]
In this paper we show that there is measurable
information in the articulatory system which can help
to disambiguate the acoustic signal. We measure
directly the movement of the lips, tongue, jaw, velum
and larynx and parameterise this articulatory feature
space using principle components analysis. The
parameterisation is developed and evaluated using a
speaker dependent phone recognition task on a specially
recorded TIMIT corpus of 460 sentences. The results
show that there is useful supplementary information
contained in the articulatory data which yields a small
but significant improvement in phone recognition
accuracy of 2%. However, preliminary attempts to
estimate the articulatory data from the acoustic signal
and use this to supplement the acoustic input have not
yielded any significant improvement in phone accuracy.
|
|
[48]
|
J. Frankel, K. Richmond, S. King, and P. Taylor.
An automatic speech recognition system using neural networks and
linear dynamic models to recover and model articulatory traces.
In Proc. ICSLP, 2000.
[ bib |
.ps |
.pdf ]
In this paper we describe a speech recognition system
using linear dynamic models and articulatory features.
Experiments are reported in which measured articulation
from the MOCHA corpus has been used, along with those
where the articulatory parameters are estimated from
the speech signal using a recurrent neural network.
|
|
[49]
|
S. King, P. Taylor, J. Frankel, and K. Richmond.
Speech recognition via phonetically-featured syllables.
In PHONUS, volume 5, pages 15-34, Institute of Phonetics,
University of the Saarland, 2000.
[ bib |
.ps |
.pdf ]
We describe recent work on two new automatic speech
recognition systems. The first part of this paper
describes the components of a system based on
phonological features (which we call EspressoA) in
which the values of these features are estimated from
the speech signal before being used as the basis for
recognition. In the second part of the paper, another
system (which we call EspressoB) is described in which
articulatory parameters are used instead of
phonological features and a linear dynamical system
model is used to perform recognition from automatically
estimated values of these articulatory parameters.
|
|
[50]
|
K. Richmond.
Estimating velum height from acoustics during continuous speech.
In Proc. Eurospeech, volume 1, pages 149-152, Budapest,
Hungary, 1999.
[ bib |
.ps |
.pdf ]
This paper reports on present work, in which a
recurrent neural network is trained to estimate `velum
height' during continuous speech. Parallel
acoustic-articulatory data comprising more than 400
read TIMIT sentences is obtained using
electromagnetic articulography (EMA). This data is
processed and used as training data for a range of
neural network sizes. The network demonstrating the
highest accuracy is identified. This performance is
then evaluated in detail by analysing the network's
output for each phonetic segment contained in 50
hand-labelled utterances set aside for testing
purposes.
|
|
[51]
|
K. Richmond.
A proposal for the compartmental modelling of stellate cells in the
anteroventral cochlear nucleus, using realistic auditory nerve inputs.
Master's thesis, Centre for Cognitive Science, University of
Edinburgh, September 1997.
[ bib ]
|
|
[52]
|
K. Richmond, A. Smith, and E. Amitay.
Detecting subject boundaries within text: A language-independent
statistical approach.
In Proc. The Second Conference on Empirical Methods in Natural
Language Processing, pages 47-54, Brown University, Providence, USA, August
1997.
[ bib |
.ps |
.pdf ]
We describe here an algorithm for detecting subject
boundaries within text based on a statistical lexical
similarity measure. Hearst has already tackled this
problem with good results (Hearst, 1994). One of her
main assumptions is that a change in subject is
accompanied by a change in vocabulary. Using this
assumption, but by introducing a new measure of word
significance, we have been able to build a robust and
reliable algorithm which exhibits improved accuracy
without sacrificing language independency.
|