|
[1]
|
P. Swietojanski, A. Ghoshal, and S. Renals.
Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR.
In Proc. IEEE Workshop on Spoken Language Technology, Miami,
Florida, USA, December 2012.
[ bib |
.pdf ]
We investigate the use of cross-lingual acoustic data to initialise deep neural network (DNN) acoustic models by means
of unsupervised restricted Boltzmann machine (RBM) pretraining.
DNNs for German are pretrained using one or all of German, Portuguese, Spanish and Swedish. The DNNs are used in a tandem configuration, where the network outputs are used as features for a hidden Markov model (HMM) whose
emission densities are modeled by Gaussian mixture models (GMMs), as well as in a hybrid configuration, where the network outputs are used as the HMM state likelihoods. The experiments show that unsupervised pretraining is more crucial
for the hybrid setups, particularly with limited amounts of transcribed training data. More importantly, unsupervised pretraining is shown to be language-independent.
|
|
[2]
|
P. Bell, M. Gales, P. Lanchantin, X. Liu, Y. Long, S. Renals, P. Swietojanski,
and P. Woodland.
Transcription of multi-genre media archives using out-of-domain data.
In Proc. IEEE Workshop on Spoken Language Technology, Miami,
Florida, USA, December 2012.
[ bib |
.pdf ]
We describe our work on developing a speech
recognition system for multi-genre media archives. The
high diversity of the data makes this a challenging
recognition task, which may benefit from systems
trained on a combination of in-domain and out-of-domain
data. Working with tandem HMMs, we present Multi-level
Adaptive Networks (MLAN), a novel technique for
incorporating information from out-of-domain posterior
features using deep neural networks. We show that it
provides a substantial reduction in WER over other
systems, with relative WER reductions of 15% over a
PLP baseline, 9% over in-domain tandem features and
8% over the best out-of-domain tandem features.
|
|
[3]
|
Korin Richmond and Steve Renals.
Ultrax: An animated midsagittal vocal tract display for speech
therapy.
In Proc. Interspeech, Portland, Oregon, USA, September 2012.
[ bib |
.pdf ]
Speech sound disorders (SSD) are the most common
communication impairment in childhood, and can hamper
social development and learning. Current speech therapy
interventions rely predominantly on the auditory skills
of the child, as little technology is available to
assist in diagnosis and therapy of SSDs. Realtime
visualisation of tongue movements has the potential to
bring enormous benefit to speech therapy. Ultrasound
scanning offers this possibility, although its display
may be hard to interpret. Our ultimate goal is to
exploit ultrasound to track tongue movement, while
displaying a simplified, diagrammatic vocal tract that
is easier for the user to interpret. In this paper, we
outline a general approach to this problem, combining a
latent space model with a dimensionality reducing model
of vocal tract shapes. We assess the feasibility of
this approach using magnetic resonance imaging (MRI)
scans to train a model of vocal tract shapes, which is
animated using electromagnetic articulography (EMA)
data from the same speaker.
Keywords: Ultrasound, speech therapy, vocal tract visualisation
|
|
[4]
|
Benigno Uria, Iain Murray, Steve Renals, and Korin Richmond.
Deep architectures for articulatory inversion.
In Proc. Interspeech, Portland, Oregon, USA, September 2012.
[ bib |
.pdf ]
We implement two deep architectures for the
acoustic-articulatory inversion mapping problem: a deep
neural network and a deep trajectory mixture density
network. We find that in both cases, deep architectures
produce more accurate predictions than shallow
architectures and that this is due to the higher
expressive capability of a deep model and not a
consequence of adding more adjustable parameters. We
also find that a deep trajectory mixture density
network is able to obtain better inversion accuracies
than smoothing the results of a deep neural network.
Our best model obtained an average root mean square
error of 0.885 mm on the MNGU0 test dataset.
Keywords: Articulatory inversion, deep neural network, deep
belief network, deep regression network, pretraining
|
|
[5]
|
L. Lu, A. Ghoshal, and S. Renals.
Maximum a posteriori adaptation of subspace Gaussian mixture models
for cross-lingual speech recognition.
In Proc. ICASSP, 2012.
[ bib |
.pdf ]
This paper concerns cross-lingual acoustic modeling in
the case when there are limited target language
resources. We build on an approach in which a subspace
Gaussian mixture model (SGMM) is adapted to the target
language by reusing the globally shared parameters
estimated from out-of-language training data. In
current cross-lingual systems, these parameters are
fixed when training the target system, which can give
rise to a mismatch between the source and target
systems. We investigate a maximum a posteriori (MAP)
adaptation approach to alleviate the potential
mismatch. In particular, we focus on the adaptation of
phonetic subspace parameters using a matrix variate
Gaussian prior distribution. Experiments on the
GlobalPhone corpus using the MAP adaptation approach
results in word error rate reductions, compared with
the cross-lingual baseline systems and systems updated
using maximum likelihood, for training conditions with
1 hour and 5 hours of target language data.
Keywords: Subspace Gaussian Mixture Model, Maximum a Posteriori
Adaptation, Cross-lingual Speech Recognition
|
|
[6]
|
E. Zwyssig, S. Renals, and M. Lincoln.
Determining the number of speakers in a meeting using microphone
array features.
In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE
International Conference on, pages 4765-4768, 2012.
[ bib ]
|
|
[7]
|
E. Zwyssig, S. Renals, and M. Lincoln.
On the effect of SNR and superdirective beamforming in speaker
diarisation in meetings.
In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE
International Conference on, pages 4177-4180, 2012.
[ bib ]
|
|
[8]
|
L. Lu, A. Ghoshal, and S. Renals.
Joint uncertainty decoding with unscented transform for noise robust
subspace Gaussian mixture model.
In Proc. Sapa-Scale workshop, 2012.
[ bib |
.pdf ]
Common noise compensation techniques use vector Taylor
series (VTS) to approximate the mismatch function.
Recent work shows that the approximation accuracy may
be improved by sampling. One such sampling technique is
the unscented transform (UT), which draws samples
deterministically from clean speech and noise model to
derive the noise corrupted speech parameters. This
paper applies UT to noise compensation of the subspace
Gaussian mixture model (SGMM). Since UT requires
relatively smaller number of samples for accurate
estimation, it has significantly lower computational
cost compared to other random sampling techniques.
However, the number of surface Gaussians in an SGMM is
typically very large, making the direct application of
UT, for compensating individual Gaussian components,
computationally impractical. In this paper, we avoid
the computational burden by employing UT in the
framework of joint uncertainty decoding (JUD), which
groups all the Gaussian components into small number of
classes, sharing the compensation parameters by class.
We evaluate the JUD-UT technique for an SGMM system
using the Aurora 4 corpus. Experimental results
indicate that UT can lead to increased accuracy
compared to VTS approximation if the JUD phase factor
is untuned, and to similar accuracy if the phase factor
is tuned empirically
Keywords: noise compensation, SGMM, JUD, UT
|
|
[9]
|
L. Lu, KK Chin, A. Ghoshal, and S. Renals.
Noise compensation for subspace Gaussian mixture models.
In Proc. INTERSPEECH, 2012.
[ bib |
.pdf ]
Joint uncertainty decoding (JUD) is an effective
model-based noise compensation technique for
conventional Gaussian mixture model (GMM) based speech
recognition systems. In this paper, we apply JUD to
subspace Gaussian mixture model (SGMM) based acoustic
models. The total number of Gaussians in the SGMM
acoustic model is usually much larger than for
conventional GMMs, which limits the application of
approaches which explicitly compensate each Gaussian,
such as vector Taylor series (VTS). However, by
clustering the Gaussian components into a number of
regression classes, JUD-based noise compensation can be
successfully applied to SGMM systems. We evaluate the
JUD/SGMM technique using the Aurora 4 corpus, and the
experimental results indicated that it is more accurate
than conventional GMM-based systems using either VTS or
JUD noise compensation.
Keywords: acoustic modelling, noise compensation, SGMM, JUD
|
|
[10]
|
Junichi Yamagishi, Christophe Veaux, Simon King, and Steve Renals.
Speech synthesis technologies for individuals with vocal
disabilities: Voice banking and reconstruction.
Acoustical Science and Technology, 33(1):1-5, 2012.
[ bib |
http ]
|
|
[11]
|
Benigno Uria, Steve Renals, and Korin Richmond.
A deep neural network for acoustic-articulatory speech inversion.
In Proc. NIPS 2011 Workshop on Deep Learning and Unsupervised
Feature Learning, Sierra Nevada, Spain, December 2011.
[ bib |
.pdf ]
In this work, we implement a deep belief network for
the acoustic-articulatory inversion mapping problem. We
find that adding up to 3 hidden-layers improves
inversion accuracy. We also show that this improvement
is due to the higher ex- pressive capability of a deep
model and not a consequence of adding more adjustable
parameters. Additionally, we show unsupervised
pretraining of the sys- tem improves its performance in
all cases, even for a 1 hidden-layer model. Our
implementation obtained an average root mean square
error of 0.95 mm on the MNGU0 test dataset, beating all
previously published results.
|
|
[12]
|
J.P. Cabral, S. Renals, J. Yamagishi, and K. Richmond.
HMM-based speech synthesiser using the LF-model of the glottal
source.
In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE
International Conference on, pages 4704-4707, May 2011.
[ bib |
DOI |
.pdf ]
A major factor which causes a deterioration in speech
quality in HMM-based speech synthesis is the use of a
simple delta pulse signal to generate the excitation of
voiced speech. This paper sets out a new approach to
using an acoustic glottal source model in HMM-based
synthesisers instead of the traditional pulse signal.
The goal is to improve speech quality and to better
model and transform voice characteristics. We have
found the new method decreases buzziness and also
improves prosodic modelling. A perceptual evaluation
has supported this finding by showing a 55.6 preference for the new system, as against the baseline.
This improvement, while not being as significant as we
had initially expected, does encourage us to work on
developing the proposed speech synthesiser further.
|
|
[13]
|
L. Lu, A. Ghoshal, and S. Renals.
Regularized subspace gausian mixture models for speech recognition.
IEEE Signal Processing Letters, 18(7):419-422, 2011.
[ bib |
.pdf ]
Subspace Gaussian mixture models (SGMMs) provide a
compact representation of the Gaussian parameters in an
acoustic model, but may still suffer from over-fitting
with insufficient training data. In this letter, the
SGMM state parameters are estimated using a penalized
maximum-likelihood objective, based on 1 and
2 regularization, as well as their combination,
referred to as the elastic net, for robust model
estimation. Experiments on the 5000-word Wall Street
Journal transcription task show word error rate
reduction and improved model robustness with
regularization.
|
|
[14]
|
Jonathan Kilgour, Jean Carletta, and Steve Renals.
The Ambient Spotlight: Personal meeting capture with a microphone
array.
In Proc. HSCMA, 2011.
[ bib |
DOI |
.pdf ]
We present the Ambient Spotlight system for personal
meeting capture based on a portable USB microphone
array and a laptop. The system combined distant speech
recognition and content linking with personal
productivity tools, and enables recognised meeting
recordings to be integrated with desktop search,
calender, and email.
|
|
[15]
|
L. Lu, A. Ghoshal, and S. Renals.
Regularized subspace Gausian mixture models for cross-lingual
speech recognition.
In Proc. ASRU, 2011.
[ bib |
.pdf ]
We investigate cross-lingual acoustic modelling for
low resource languages using the subspace Gaussian
mixture model (SGMM). We assume the presence of
acoustic models trained on multiple source languages,
and use the global subspace parameters from those
models for improved modelling in a target language with
limited amounts of transcribed speech. Experiments on
the GlobalPhone corpus using Spanish, Portuguese, and
Swedish as source languages and German as target
language (with 1 hour and 5 hours of transcribed audio)
show that multilingually trained SGMM shared parameters
result in lower word error rates (WERs) than using
those from a single source language. We also show that
regularizing the estimation of the SGMM state vectors
by penalizing their 1-norm help to overcome
numerical instabilities and lead to lower WER.
|
|
[16]
|
João Cabral, Steve Renals, Korin Richmond, and Junichi Yamagishi.
Transforming voice source parameters in a HMM-based speech
synthesiser with glottal post-filtering.
In Proc. 7th ISCA Speech Synthesis Workshop (SSW7), pages
365-370, NICT/ATR, Kyoto, Japan, September 2010.
[ bib |
.pdf ]
Control over voice quality, e.g. breathy and tense
voice, is important for speech synthesis applications.
For example, transformations can be used to modify
aspects of the voice re- lated to speaker's identity
and to improve expressiveness. How- ever, it is hard to
modify voice characteristics of the synthetic speech,
without degrading speech quality. State-of-the-art sta-
tistical speech synthesisers, in particular, do not
typically al- low control over parameters of the
glottal source, which are strongly correlated with
voice quality. Consequently, the con- trol of voice
characteristics in these systems is limited. In con-
trast, the HMM-based speech synthesiser proposed in
this paper uses an acoustic glottal source model. The
system passes the glottal signal through a whitening
filter to obtain the excitation of voiced sounds. This
technique, called glottal post-filtering, allows to
transform voice characteristics of the synthetic speech
by modifying the source model parameters. We evaluated
the proposed synthesiser in a perceptual ex- periment,
in terms of speech naturalness, intelligibility, and
similarity to the original speaker's voice. The results
show that it performed as well as a HMM-based
synthesiser, which generates the speech signal with a
commonly used high-quality speech vocoder.
Keywords: HMM-based speech synthesis, voice quality, glottal
post-filter
|
|
[17]
|
Ravi Chander Vipperla, Steve Renals, and Joe Frankel.
Augmentation of adaptation data.
In Proc. Interspeech, pages 530-533, Makuhari, Japan,
September 2010.
[ bib |
.pdf ]
Linear regression based speaker adaptation approaches
can improve Automatic Speech Recognition (ASR) accuracy
significantly for a target speaker. However, when the
available adaptation data is limited to a few seconds,
the accuracy of the speaker adapted models is often
worse compared with speaker independent models. In this
paper, we propose an approach to select a set of
reference speakers acoustically close to the target
speaker whose data can be used to augment the
adaptation data. To determine the acoustic similarity
of two speakers, we propose a distance metric based on
transforming sample points in the acoustic space with
the regression matrices of the two speakers. We show
the validity of this approach through a speaker
identification task. ASR results on SCOTUS and AMI
corpora with limited adaptation data of 10 to 15
seconds augmented by data from selected reference
speakers show a significant improvement in Word Error
Rate over speaker independent and speaker adapted
models.
|
|
[18]
|
Alice Turk, James Scobbie, Christian Geng, Barry Campbell, Catherine Dickie,
Eddie Dubourg, Ellen Gurman Bard, William Hardcastle, Mariam Hartinger, Simon
King, Robin Lickley, Cedric Macmartin, Satsuki Nakai, Steve Renals, Korin
Richmond, Sonja Schaeffler, Kevin White, Ronny Wiegand, and Alan Wrench.
An Edinburgh speech production facility.
Poster presented at the 12th Conference on Laboratory Phonology,
Albuquerque, New Mexico., July 2010.
[ bib |
.pdf ]
|
|
[19]
|
Erich Zwyssig, Mike Lincoln, and Steve Renals.
A digital microphone array for distant speech recognition.
In Proc. IEEE ICASSP-10, pages 5106-5109, 2010.
[ bib |
DOI |
.pdf ]
In this paper, the design, implementation and testing
of a digital microphone array is presented. The array
uses digital MEMS microphones which integrate the
microphone, amplifier and analogue to digital converter
on a single chip in place of the analogue microphones
and external audio interfaces currently used. The
device has the potential to be smaller, cheaper and
more flexible than typical analogue arrays, however the
effect on speech recognition performance of using
digital microphones is as yet unknown. In order to
evaluate the effect, an analogue array and the new
digital array are used to simultaneously record test
data for a speech recognition experiment. Initial
results employing no adaptation show that performance
using the digital array is significantly worse (14%
absolute WER) than the analogue device. Subsequent
experiments using MLLR and CMLLR channel adaptation
reduce this gap, and employing MLLR for both channel
and speaker adaptation reduces the difference between
the arrays to 4.5% absolute WER.
|
|
[20]
|
Steve Renals.
Recognition and understanding of meetings.
In Proc. NAACL/HLT, pages 1-9, 2010.
[ bib |
.pdf ]
This paper is about interpreting human communication
in meetings using audio, video and other signals.
Automatic meeting recognition and understanding is
extremely challenging, since communication in a meeting
is spontaneous and conversational, and involves
multiple speakers and multiple modalities. This leads
to a number of significant research problems in signal
processing, in speech recognition, and in discourse
interpretation, taking account of both individual and
group behaviours. Addressing these problems requires an
interdisciplinary effort. In this paper, I discuss the
capture and annotation of multimodal meeting recordings
- resulting in the AMI meeting corpus - and how we have
built on this to develop techniques and applications
for the recognition and interpretation of meetings.
|
|
[21]
|
Jonathan Kilgour, Jean Carletta, and Steve Renals.
The Ambient Spotlight: Queryless desktop search from meeting
speech.
In Proc ACM Multimedia 2010 Workshop SSCS 2010, 2010.
[ bib |
DOI |
.pdf ]
It has recently become possible to record any small
meeting using a laptop equipped with a plug-and-play
USB microphone array. We show the potential for such
recordings in a personal aid that allows project
managers to record their meetings and, when reviewing
them afterwards through a standard calendar interface,
to find relevant documents on their computer. This
interface is intended to supplement or replace the
textual searches that managers typically perform. The
prototype, which relies on meeting speech recognition
and topic segmentation, formulates and runs desktop
search queries in order to present its results.
|
|
[22]
|
Songfang Huang and Steve Renals.
Hierarchical Bayesian language models for conversational speech
recognition.
IEEE Transactions on Audio, Speech and Language Processing,
18(8):1941-1954, January 2010.
[ bib |
DOI |
http |
.pdf ]
Traditional n-gram language models are widely used in
state-of-the-art large vocabulary speech recognition
systems. This simple model suffers from some
limitations, such as overfitting of maximum-likelihood
estimation and the lack of rich contextual knowledge
sources. In this paper, we exploit a hierarchical
Bayesian interpretation for language modeling, based on
a nonparametric prior called the Pitman-Yor process.
This offers a principled approach to language model
smoothing, embedding the power-law distribution for
natural language. Experiments on the recognition of
conversational speech in multiparty meetings
demonstrate that by using hierarchical Bayesian
language models, we are able to achieve significant
reductions in perplexity and word error rate.
Keywords: AMI corpus , conversational speech recognition ,
hierarchical Bayesian model , language model (LM) ,
meetings , smoothing
|
|
[23]
|
Songfang Huang and Steve Renals.
Power law discounting for n-gram language models.
In Proc. IEEE ICASSP-10, pages 5178-5181, 2010.
[ bib |
DOI |
http |
.pdf ]
We present an approximation to the Bayesian
hierarchical Pitman-Yor process language model which
maintains the power law distribution over word tokens,
while not requiring a computationally expensive
approximate inference process. This approximation,
which we term power law discounting, has a similar
computational complexity to interpolated and modified
Kneser-Ney smoothing. We performed experiments on
meeting transcription using the NIST RT06s evaluation
data and the AMI corpus, with a vocabulary of 50,000
words and a language model training set of up to 211
million words. Our results indicate that power law
discounting results in statistically significant
reductions in perplexity and word error rate compared
to both interpolated and modified Kneser-Ney smoothing,
while producing similar results to the hierarchical
Pitman-Yor process language model.
|
|
[24]
|
Maria K. Wolters, Karl B. Isaac, and Steve Renals.
Evaluating speech synthesis intelligibility using Amazon Mechanical
Turk.
In Proc. 7th Speech Synthesis Workshop (SSW7), pages 136-141,
2010.
[ bib |
.pdf ]
Microtask platforms such as Amazon Mechanical Turk
(AMT) are increasingly used to create speech and
language resources. AMT in particular allows
researchers to quickly recruit a large number of fairly
demographically diverse participants. In this study, we
investigated whether AMT can be used for comparing the
intelligibility of speech synthesis systems. We
conducted two experiments in the lab and via AMT, one
comparing US English diphone to US English
speaker-adaptive HTS synthesis and one comparing UK
English unit selection to UK English speaker-dependent
HTS synthesis. While AMT word error rates were worse
than lab error rates, AMT results were more sensitive
to relative differences between systems. This is mainly
due to the larger number of listeners. Boxplots and
multilevel modelling allowed us to identify listeners
who performed particularly badly, while thresholding
was sufficient to eliminate rogue workers. We conclude
that AMT is a viable platform for synthetic speech
intelligibility comparisons.
|
|
[25]
|
Steve Renals and Simon King.
Automatic speech recognition.
In William J. Hardcastle, John Laver, and Fiona E. Gibbon, editors,
Handbook of Phonetic Sciences, chapter 22. Wiley Blackwell, 2010.
[ bib ]
|
|
[26]
|
Ravi Chander Vipperla, Steve Renals, and Joe Frankel.
Ageing voices: The effect of changes in voice parameters on ASR
performance.
EURASIP Journal on Audio, Speech, and Music Processing, 2010.
[ bib |
DOI |
http |
.pdf ]
With ageing, human voices undergo several changes
which are typically characterized by increased
hoarseness and changes in articulation patterns. In
this study, we have examined the effect on Automatic
Speech Recognition (ASR) and found that the Word Error
Rates (WER) on older voices is about 9% absolute
higher compared to those of adult voices. Subsequently,
we compared several voice source parameters including
fundamental frequency, jitter, shimmer, harmonicity and
cepstral peak prominence of adult and older males.
Several of these parameters show statistically
significant difference for the two groups. However,
artificially increasing jitter and shimmer measures do
not effect the ASR accuracies significantly.
Artificially lowering the fundamental frequency
degrades the ASR performance marginally but this drop
in performance can be overcome to some extent using
Vocal Tract Length Normalisation (VTLN). Overall, we
observe that the changes in the voice source parameters
do not have a significant impact on ASR performance.
Comparison of the likelihood scores of all the phonemes
for the two age groups show that there is a systematic
mismatch in the acoustic space of the two age groups.
Comparison of the phoneme recognition rates show that
mid vowels, nasals and phonemes that depend on the
ability to create constrictions with tongue tip for
articulation are more affected by ageing than other
phonemes.
|
|
[27]
|
Steve Renals and Thomas Hain.
Speech recognition.
In Alex Clark, Chris Fox, and Shalom Lappin, editors, Handbook
of Computational Linguistics and Natural Language Processing. Wiley
Blackwell, 2010.
[ bib ]
|
|
[28]
|
Alice Turk, James Scobbie, Christian Geng, Cedric Macmartin, Ellen Bard, Barry
Campbell, Catherine Dickie, Eddie Dubourg, Bill Hardcastle, Phil Hoole, Evia
Kanaida, Robin Lickley, Satsuki Nakai, Marianne Pouplier, Simon King, Steve
Renals, Korin Richmond, Sonja Schaeffler, Ronnie Wiegand, Kevin White, and
Alan Wrench.
The Edinburgh Speech Production Facility's articulatory corpus of
spontaneous dialogue.
The Journal of the Acoustical Society of America,
128(4):2429-2429, 2010.
[ bib |
DOI ]
The EPSRC‐funded Edinburgh Speech Production is
built around two synchronized Carstens AG500
electromagnetic articulographs (EMAs) in order to
capture articulatory∕acoustic data from spontaneous
dialogue. An initial articulatory corpus was designed
with two aims. The first was to elicit a range of
speech styles∕registers from speakers, and therefore
provide an alternative to fully scripted corpora. The
second was to extend the corpus beyond monologue, by
using tasks that promote natural discourse and
interaction. A subsidiary driver was to use dialects
from outwith North America: dialogues paired up a
Scottish English and a Southern British English
speaker. Tasks. Monologue: Story reading of “Comma
Gets a Cure” [Honorof et al. (2000)], lexical sets
[Wells (1982)], spontaneous story telling,
diadochokinetic tasks. Dialogue: Map tasks [Anderson et
al. (1991)], “Spot the Difference” picture tasks
[Bradlow et al. (2007)], story‐recall. Shadowing of
the spontaneous story telling by the second
participant. Each dialogue session includes
approximately 30 min of speech, and there are
acoustics‐only baseline materials. We will introduce
the corpus and highlight the role of articulatory
production data in helping provide a fuller
understanding of various spontaneous speech phenomena
by presenting examples of naturally occurring covert
speech errors, accent accommodation, turn taking
negotiation, and shadowing.
|
|
[29]
|
Jonathan Kilgour, Jean Carletta, and Steve Renals.
The Ambient Spotlight: Personal multimodal search without query.
In Proc. ICMI-MLMI, 2010.
[ bib |
DOI |
http |
.pdf ]
The Ambient Spotlight is a prototype system based on
personal meeting capture using a laptop and a portable
microphone array. The system automatically recognises
and structures the meeting content using automatic
speech recognition, topic segmentation and extractive
summarisation. The recognised speech in the meeting is
used to construct queries to automatically link meeting
segments to other relevant material, both multimodal
and textual. The interface to the system is constructed
around a standard calendar interface, and it is
integrated with the laptop's standard indexing, search
and retrieval.
|
|
[30]
|
Maria Wolters, Ravichander Vipperla, and Steve Renals.
Age recognition for spoken dialogue systems: Do we need it?
In Proc. Interspeech, September 2009.
[ bib |
.pdf ]
When deciding whether to adapt relevant aspects of the
system to the particular needs of older users, spoken
dialogue systems often rely on automatic detection of
chronological age. In this paper, we show that vocal
ageing as measured by acoustic features is an
unreliable indicator of the need for adaptation. Simple
lexical features greatly improve the prediction of both
relevant aspects of cognition and interactions style.
Lexical features also boost age group prediction. We
suggest that adaptation should be based on observed
behaviour, not on chronological age, unless it is not
feasible to build classifiers for relevant adaptation
decisions.
|
|
[31]
|
Songfang Huang and Steve Renals.
A parallel training algorithm for hierarchical Pitman-Yor process
language models.
In Proc. Interspeech'09, pages 2695-2698, Brighton, UK,
September 2009.
[ bib |
.pdf ]
The Hierarchical Pitman Yor Process Language Model
(HPYLM) is a Bayesian language model based on a
non-parametric prior, the Pitman-Yor Process. It has
been demonstrated, both theoretically and practically,
that the HPYLM can provide better smoothing for
language modeling, compared with state-of-the-art
approaches such as interpolated Kneser-Ney and modified
Kneser-Ney smoothing. However, estimation of Bayesian
language models is expensive in terms of both
computation time and memory; the inference is
approximate and requires a number of iterations to
converge. In this paper, we present a parallel training
algorithm for the HPYLM, which enables the approach to
be applied in the context of automatic speech
recognition, using large training corpora with large
vocabularies. We demonstrate the effectiveness of the
proposed algorithm by estimating language models from
corpora for meeting transcription containing over 200
million words, and observe significant reductions in
perplexity and word error rate.
|
|
[32]
|
J. Cabral, S. Renals, K. Richmond, and J. Yamagishi.
HMM-based speech synthesis with an acoustic glottal source model.
In Proc. The First Young Researchers Workshop in Speech
Technology, April 2009.
[ bib |
.pdf ]
A major cause of degradation of speech quality in
HMM-based speech synthesis is the use of a simple delta
pulse signal to generate the excitation of voiced
speech. This paper describes a new approach to using an
acoustic glottal source model in HMM-based
synthesisers. The goal is to improve speech quality and
parametric flexibility to better model and transform
voice characteristics.
|
|
[33]
|
Gabriel Murray, Thomas Kleinbauer, Peter Poller, Tilman Becker, Steve Renals,
and Jonathan Kilgour.
Extrinsic summarization evaluation: A decision audit task.
ACM Transactions on Speech and Language Processing, 6(2):1-29,
2009.
[ bib |
DOI |
http |
.pdf ]
In this work we describe a large-scale extrinsic
evaluation of automatic speech summarization
technologies for meeting speech. The particular task is
a decision audit, wherein a user must satisfy a complex
information need, navigating several meetings in order
to gain an understanding of how and why a given
decision was made. We compare the usefulness of
extractive and abstractive technologies in satisfying
this information need, and assess the impact of
automatic speech recognition (ASR) errors on user
performance. We employ several evaluation methods for
participant performance, including post-questionnaire
data, human subjective and objective judgments, and a
detailed analysis of participant browsing behavior. We
find that while ASR errors affect user satisfaction on
an information retrieval task, users can adapt their
browsing behavior to complete the task satisfactorily.
Results also indicate that users consider extractive
summaries to be intuitive and useful tools for browsing
multimodal meeting data. We discuss areas in which
automatic summarization techniques can be improved in
comparison with gold-standard meeting abstracts.
|
|
[34]
|
Ravi Chander Vipperla, Maria Wolters, Kallirroi Georgila, and Steve Renals.
Speech input from older users in smart environments: Challenges and
perspectives.
In Proc. HCI International: Universal Access in Human-Computer
Interaction. Intelligent and Ubiquitous Interaction Environments, number
5615 in Lecture Notes in Computer Science. Springer, 2009.
[ bib |
DOI |
http |
.pdf ]
Although older people are an important user group for
smart environments, there has been relatively little
work on adapting natural language interfaces to their
requirements. In this paper, we focus on a particularly
thorny problem: processing speech input from older
users. Our experiments on the MATCH corpus show clearly
that we need age-specific adaptation in order to
recognize older users' speech reliably. Language models
need to cover typical interaction patterns of older
people, and acoustic models need to accommodate older
voices. Further research is needed into intelligent
adaptation techniques that will allow existing large,
robust systems to be adapted with relatively small
amounts of in-domain, age appropriate data. In
addition, older users need to be supported with
adequate strategies for handling speech recognition
errors.
|
|
[35]
|
Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira.
Evaluation of a hierarchical reinforcement learning spoken dialogue
system.
Computer Speech and Language, 24(2):395-429, 2009.
[ bib |
DOI |
.pdf ]
We describe an evaluation of spoken dialogue
strategies designed using hierarchical reinforcement
learning agents. The dialogue strategies were learnt in
a simulated environment and tested in a laboratory
setting with 32 users. These dialogues were used to
evaluate three types of machine dialogue behaviour:
hand-coded, fully-learnt and semi-learnt. These
experiments also served to evaluate the realism of
simulated dialogues using two proposed metrics
contrasted with ‘Precision-Recall’. The learnt
dialogue behaviours used the Semi-Markov Decision
Process (SMDP) model, and we report the first
evaluation of this model in a realistic conversational
environment. Experimental results in the travel
planning domain provide evidence to support the
following claims: (a) hierarchical semi-learnt dialogue
agents are a better alternative (with higher overall
performance) than deterministic or fully-learnt
behaviour; (b) spoken dialogue strategies learnt with
highly coherent user behaviour and conservative
recognition error rates (keyword error rate of 20%)
can outperform a reasonable hand-coded strategy; and
(c) hierarchical reinforcement learning dialogue agents
are feasible and promising for the (semi) automatic
design of optimized dialogue behaviours in larger-scale
systems.
|
|
[36]
|
Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhenhua Ling, Tomoki Toda, Keiichi
Tokuda, Simon King, and Steve Renals.
Robust speaker-adaptive HMM-based text-to-speech synthesis.
IEEE Transactions on Audio, Speech and Language Processing,
17(6):1208-1230, 2009.
[ bib |
http |
www: ]
This paper describes a speaker-adaptive HMM-based
speech synthesis system. The new system, called
“HTS-2007,” employs speaker adaptation (CSMAPLR+MAP),
feature-space adaptive training, mixed-gender modeling,
and full-covariance modeling using CSMAPLR transforms,
in addition to several other techniques that have
proved effective in our previous systems. Subjective
evaluation results show that the new system generates
significantly better quality synthetic speech than
speaker-dependent approaches with realistic amounts of
speech data, and that it bears comparison with
speaker-dependent approaches even when large amounts of
speech data are available. In addition, a comparison
study with several speech synthesis techniques shows
the new system is very robust: It is able to build
voices from less-than-ideal speech data and synthesize
good-quality speech even for out-of-domain sentences.
|
|
[37]
|
Y. Hifny and S. Renals.
Speech recognition using augmented conditional random fields.
IEEE Transactions on Audio, Speech and Language Processing,
17(2):354-365, 2009.
[ bib |
http |
.pdf ]
Acoustic modeling based on hidden Markov models (HMMs)
is employed by state-of-the-art stochastic speech
recognition systems. Although HMMs are a natural choice
to warp the time axis and model the temporal phenomena
in the speech signal, their conditional independence
properties limit their ability to model spectral
phenomena well. In this paper, a new acoustic modeling
paradigm based on augmented conditional random fields
(ACRFs) is investigated and developed. This paradigm
addresses some limitations of HMMs while maintaining
many of the aspects which have made them successful. In
particular, the acoustic modeling problem is
reformulated in a data driven, sparse, augmented space
to increase discrimination. Acoustic context modeling
is explicitly integrated to handle the sequential
phenomena of the speech signal. We present an efficient
framework for estimating these models that ensures
scalability and generality. In the TIMIT phone
recognition task, a phone error rate of 23.0% was
recorded on the full test set, a significant
improvement over comparable HMM-based systems.
|
|
[38]
|
Songfang Huang and Steve Renals.
Unsupervised language model adaptation based on topic and role
information in multiparty meetings.
In Proc. Interspeech'08, pages 833-836, Brisbane, Australia,
September 2008.
[ bib |
.pdf ]
We continue our previous work on the modeling of topic
and role information from multiparty meetings using a
hierarchical Dirichlet process (HDP), in the context of
language model adaptation. In this paper we focus on
three problems: 1) an empirical analysis of the HDP as
a nonparametric topic model; 2) the mismatch problem of
vocabularies of the baseline n-gram model and the HDP;
and 3) an automatic speech recognition experiment to
further verify the effectiveness of our adaptation
framework. Experiments on a large meeting corpus of
more than 70 hours speech data show consistent and
significant improvements in terms of word error rate
for language model adaptation based on the topic and
role information.
|
|
[39]
|
C. Qin, M. Carreira-Perpiñán, K. Richmond, A. Wrench, and S. Renals.
Predicting tongue shapes from a few landmark locations.
In Proc. Interspeech, pages 2306-2309, Brisbane, Australia,
September 2008.
[ bib |
.PDF ]
We present a method for predicting the midsagittal
tongue contour from the locations of a few landmarks
(metal pellets) on the tongue surface, as used in
articulatory databases such as MOCHA and the Wisconsin
XRDB. Our method learns a mapping using ground-truth
tongue contours derived from ultrasound data and
drastically improves over spline interpolation. We also
determine the optimal locations of the landmarks, and
the number of landmarks required to achieve a desired
prediction error: 3-4 landmarks are enough to achieve
0.3-0.2 mm error per point on the tongue.
|
|
[40]
|
J. Cabral, S. Renals, K. Richmond, and J. Yamagishi.
Glottal spectral separation for parametric speech synthesis.
In Proc. Interspeech, pages 1829-1832, Brisbane, Australia,
September 2008.
[ bib |
.PDF ]
This paper presents a method to control the
characteristics of synthetic speech flexibly by
integrating articulatory features into a Hidden Markov
Model (HMM)-based parametric speech synthesis system.
In contrast to model adaptation and interpolation
approaches for speaking style control, this method is
driven by phonetic knowledge, and target speech samples
are not required. The joint distribution of parallel
acoustic and articulatory features considering
cross-stream feature dependency is estimated. At
synthesis time, acoustic and articulatory features are
generated simultaneously based on the
maximum-likelihood criterion. The synthetic speech can
be controlled flexibly by modifying the generated
articulatory features according to arbitrary phonetic
rules in the parameter generation process. Our
experiments show that the proposed method is effective
in both changing the overall character of synthesized
speech and in controlling the quality of a specific
vowel.
|
|
[41]
|
Songfang Huang and Steve Renals.
Using participant role in multiparty meetings as prior knowledge for
nonparametric topic modeling.
In Proc. ICML/UAI/COLT Workshop on Prior Knowledge for Text and
Language Processing, pages 21-24, Helsinki, Finland, July 2008.
[ bib |
.pdf ]
In this paper we introduce our attempts to incorporate
the participant role information in multiparty meetings
for document modeling using the hierarchical Dirichlet
process. The perplexity and automatic speech
recognition results demonstrate that the participant
role information is a promising prior knowledge source
to be combined with language models for automatic
speech recognition and interaction modeling for
multiparty meetings.
|
|
[42]
|
Steve Renals, Thomas Hain, and Hervé Bourlard.
Interpretation of multiparty meetings: The AMI and AMIDA
projects.
In IEEE Workshop on Hands-Free Speech Communication and
Microphone Arrays, 2008. HSCMA 2008, pages 115-118, 2008.
[ bib |
DOI |
http |
.pdf ]
The AMI and AMIDA projects are collaborative EU
projects concerned with the automatic recognition and
interpretation of multiparty meetings. This paper
provides an overview of the advances we have made in
these projects with a particular focus on the
multimodal recording infrastructure, the publicly
available AMI corpus of annotated meeting recordings,
and the speech recognition framework that we have
developed for this domain.
Keywords: AMI corpus; Meetings; evaluation; speech recognition
|
|
[43]
|
Ravichander Vipperla, Steve Renals, and Joe Frankel.
Longitudinal study of ASR performance on ageing voices.
In Proc. Interspeech, Brisbane, 2008.
[ bib |
.pdf ]
This paper presents the results of a longitudinal
study of ASR performance on ageing voices. Experiments
were conducted on the audio recordings of the
proceedings of the Supreme Court Of The United States
(SCOTUS). Results show that the Automatic Speech
Recognition (ASR) Word Error Rates (WERs) for elderly
voices are significantly higher than those of adult
voices. The word error rate increases gradually as the
age of the elderly speakers increase. Use of maximum
likelihood linear regression (MLLR) based speaker
adaptation on ageing voices improves the WER though the
performance is still considerably lower compared to
adult voices. Speaker adaptation however reduces the
increase in WER with age during old age.
|
|
[44]
|
Le Zhang and Steve Renals.
Acoustic-articulatory modelling with the trajectory HMM.
IEEE Signal Processing Letters, 15:245-248, 2008.
[ bib |
.pdf ]
In this letter, we introduce an hidden Markov model
(HMM)-based inversion system to recovery articulatory
movements from speech acoustics. Trajectory HMMs are
used as generative models for modelling articulatory
data. Experiments on the MOCHA-TIMIT corpus indicate
that the jointly trained acoustic-articulatory models
are more accurate (lower RMS error) than the separately
trained ones, and that trajectory HMM training results
in greater accuracy compared with conventional maximum
likelihood HMM training. Moreover, the system has the
ability to synthesize articulatory movements directly
from a textual representation.
|
|
[45]
|
Gabriel Murray, Thomas Kleinbauer, Peter Poller, Steve Renals, and Jonathan
Kilgour.
Extrinsic summarization evaluation: A decision audit task.
In Machine Learning for Multimodal Interaction (Proc. MLMI
'08), number 5237 in Lecture Notes in Computer Science, pages 349-361.
Springer, 2008.
[ bib |
DOI |
.pdf ]
In this work we describe a large-scale extrinsic
evaluation of automatic speech summarization
technologies for meeting speech. The particular task is
a decision audit, wherein a user must satisfy a complex
information need, navigating several meetings in order
to gain an understanding of how and why a given
decision was made. We compare the usefulness of
extractive and abstractive technologies in satisfying
this information need, and assess the impact of
automatic speech recognition (ASR) errors on user
performance. We employ several evaluation methods for
participant performance, including post-questionnaire
data, human subjective and objective judgments, and an
analysis of participant browsing behaviour.
|
|
[46]
|
Gabriel Murray and Steve Renals.
Detecting action items in meetings.
In Machine Learning for Multimodal Interaction (Proc. MLMI
'08), number 5237 in Lecture Notes in Computer Science, pages 208-213.
Springer, 2008.
[ bib |
DOI |
http |
.pdf ]
We present a method for detecting action items in
spontaneous meeting speech. Using a supervised approach
incorporating prosodic, lexical and structural
features, we can classify such items with a high degree
of accuracy. We also examine how well various feature
subclasses can perform this task on their own.
|
|
[47]
|
Giulia Garau and Steve Renals.
Combining spectral representations for large vocabulary continuous
speech recognition.
IEEE Transactions on Audio, Speech and Language Processing,
16(3):508-518, 2008.
[ bib |
DOI |
http |
.pdf ]
In this paper we investigate the combination of
complementary acoustic feature streams in large
vocabulary continuous speech recognition (LVCSR). We
have explored the use of acoustic features obtained
using a pitch-synchronous analysis, STRAIGHT, in
combination with conventional features such as mel
frequency cepstral coefficients. Pitch-synchronous
acoustic features are of particular interest when used
with vocal tract length normalisation (VTLN) which is
known to be affected by the fundamental frequency. We
have combined these spectral representations directly
at the acoustic feature level using heteroscedastic
linear discriminant analysis (HLDA) and at the system
level using ROVER. We evaluated this approach on three
LVCSR tasks: dictated newspaper text (WSJCAM0),
conversational telephone speech (CTS), and multiparty
meeting transcription. The CTS and meeting
transcription experiments were both evaluated using
standard NIST test sets and evaluation protocols. Our
results indicate that combining conventional and
pitch-synchronous acoustic feature sets using HLDA
results in a consistent, significant decrease in word
error rate across all three tasks. Combining at the
system level using ROVER resulted in a further
significant decrease in word error rate.
|
|
[48]
|
Heidi Christensen, Yoshihiko Gotoh, and Steve Renals.
A cascaded broadcast news highlighter.
IEEE Transactions on Audio, Speech and Language Processing,
16:151-161, 2008.
[ bib |
DOI |
http |
.pdf ]
This paper presents a fully automatic news skimming
system which takes a broadcast news audio stream and
provides the user with the segmented, structured and
highlighted transcript. This constitutes a system with
three different, cascading stages: converting the audio
stream to text using an automatic speech recogniser,
segmenting into utterances and stories and finally
determining which utterance should be highlighted using
a saliency score. Each stage must operate on the
erroneous output from the previous stage in the system;
an effect which is naturally amplified as the data
progresses through the processing stages. We present a
large corpus of transcribed broadcast news data
enabling us to investigate to which degree information
worth highlighting survives this cascading of
processes. Both extrinsic and intrinsic experimental
results indicate that mistakes in the story boundary
detection has a strong impact on the quality of
highlights, whereas erroneous utterance boundaries
cause only minor problems. Further, the difference in
transcription quality does not affect the overall
performance greatly.
|
|
[49]
|
Songfang Huang and Steve Renals.
Modeling topic and role information in meetings using the
hierarchical Dirichlet process.
In A. Popescu-Belis and R. Stiefelhagen, editors, Machine
Learning for Multimodal Interaction V, volume 5237 of Lecture Notes in
Computer Science, pages 214-225. Springer, 2008.
[ bib |
.pdf ]
In this paper, we address the modeling of topic and
role information in multiparty meetings, via a
nonparametric Bayesian model called the hierarchical
Dirichlet process. This model provides a powerful
solution to topic modeling and a flexible framework for
the incorporation of other cues such as speaker role
information. We present our modeling framework for
topic and role on the AMI Meeting Corpus, and
illustrate the effectiveness of the approach in the
context of adapting a baseline language model in a
large-vocabulary automatic speech recognition system
for multiparty meetings. The adapted LM produces
significant improvements in terms of both perplexity
and word error rate.
|
|
[50]
|
Gabriel Murray and Steve Renals.
Meta comments for summarizing meeting speech.
In Machine Learning for Multimodal Interaction (Proc. MLMI
'08), number 5237 in Lecture Notes in Computer Science, pages 236-247.
Springer, 2008.
[ bib |
DOI |
http |
.pdf ]
This paper is about the extractive summarization of
meeting speech, using the ICSI and AMI corpora. In the
first set of experiments we use prosodic, lexical,
structural and speaker-related features to select the
most informative dialogue acts from each meeting, with
the hypothesis being that such a rich mixture of
features will yield the best results. In the second
part, we present an approach in which the
identification of “meta-comments” is used to create
more informative summaries that provide an increased
level of abstraction. We find that the inclusion of
these meta comments improves summarization performance
according to several evaluation metrics.
|
|
[51]
|
Giulia Garau and Steve Renals.
Pitch adaptive features for LVCSR.
In Proc. Interspeech '08, 2008.
[ bib |
.pdf ]
We have investigated the use of a pitch adaptive
spectral representation on large vocabulary speech
recognition, in conjunction with speaker normalisation
techniques. We have compared the effect of a smoothed
spectrogram to the pitch adaptive spectral analysis by
decoupling these two components of STRAIGHT.
Experiments performed on a large vocabulary meeting
speech recognition task highlight the importance of
combining a pitch adaptive spectral representation with
a conventional fixed window spectral analysis. We found
evidence that STRAIGHT pitch adaptive features are more
speaker independent than conventional MFCCs without
pitch adaptation, thus they also provide better
performances when combined using feature combination
techniques such as Heteroscedastic Linear Discriminant
Analysis.
|
|
[52]
|
Alfred Dielmann and Steve Renals.
Recognition of dialogue acts in multiparty meetings using a switching
DBN.
IEEE Transactions on Audio, Speech and Language Processing,
16(7):1303-1314, 2008.
[ bib |
DOI |
http |
.pdf ]
This paper is concerned with the automatic recognition
of dialogue acts (DAs) in multiparty conversational
speech. We present a joint generative model for DA
recognition in which segmentation and classification of
DAs are carried out in parallel. Our approach to DA
recognition is based on a switching dynamic Bayesian
network (DBN) architecture. This generative approach
models a set of features, related to lexical content
and prosody, and incorporates a weighted interpolated
factored language model. The switching DBN coordinates
the recognition process by integrating the component
models. The factored language model, which is estimated
from multiple conversational data corpora, is used in
conjunction with additional task-specific language
models. In conjunction with this joint generative
model, we have also investigated the use of a
discriminative approach, based on conditional random
fields, to perform a reclassification of the segmented
DAs. We have carried out experiments on the AMI corpus
of multimodal meeting recordings, using both manually
transcribed speech, and the output of an automatic
speech recognizer, and using different configurations
of the generative model. Our results indicate that the
system performs well both on reference and fully
automatic transcriptions. A further significant
improvement in recognition accuracy is obtained by the
application of the discriminative reranking approach
based on conditional random fields.
|
|
[53]
|
Herve Bourlard and Steve Renals.
Recognition and understanding of meetings: Overview of the European
AMI and AMIDA projects.
In Proc. LangTech 2008, 2008.
[ bib |
.pdf ]
The AMI and AMIDA projects are concerned with the
recognition and interpretation of multiparty
(face-to-face and remote) meetings. Within these
projects we have developed the following: (1) an
infrastructure for recording meetings using multiple
microphones and cameras; (2) a one hundred hour,
manually annotated meeting corpus; (3) a number of
techniques for indexing, and summarizing of meeting
videos using automatic speech recognition and computer
vision, and (4) a extensible framework for browsing,
and searching of meeting videos. We give an overview of
the various techniques developed in AMI (mainly
involving face-to-face meetings), their integration
into our meeting browser framework, and future plans
for AMIDA (Augmented Multiparty Interaction with
Distant Access), the follow-up project to AMI.
Technical and business information related to these two
projects can be found at www.amiproject.org,
respectively on the Scientific and Business portals.
|
|
[54]
|
Songfang Huang and Steve Renals.
Hierarchical Pitman-Yor language models for ASR in meetings.
In Proc. IEEE Workshop on Automatic Speech Recognition and
Understanding (ASRU'07), pages 124-129, Kyoto, Japan, December 2007.
[ bib |
.pdf ]
In this paper we investigate the application of a
novel technique for language modeling - a
hierarchical Bayesian language model (LM) based on the
Pitman-Yor process - on automatic speech recognition
(ASR) for multiparty meetings. The hierarchical
Pitman-Yor language model (HPYLM), which was originally
proposed in the machine learning field, provides a
Bayesian interpretation to language modeling. An
approximation to the HPYLM recovers the exact
formulation of the interpolated Kneser-Ney smoothing
method in n-gram models. This paper focuses on the
application and scalability of HPYLM on a practical
large vocabulary ASR system. Experimental results on
NIST RT06s evaluation meeting data verify that HPYLM is
a competitive and promising language modeling
technique, which consistently performs better than
interpolated Kneser-Ney and modified Kneser-Ney n-gram
LMs in terms of both perplexity (PPL) and word error
rate (WER).
|
|
[55]
|
Junichi Yamagishi, Takao Kobayashi, Steve Renals, Simon King, Heiga Zen, Tomoki
Toda, and Keiichi Tokuda.
Improved average-voice-based speech synthesis using gender-mixed
modeling and a parameter generation algorithm considering GV.
In Proc. 6th ISCA Workshop on Speech Synthesis (SSW-6), August
2007.
[ bib |
.pdf ]
For constructing a speech synthesis system which can
achieve diverse voices, we have been developing a
speaker independent approach of HMM-based speech
synthesis in which statistical average voice models are
adapted to a target speaker using a small amount of
speech data. In this paper, we incorporate a
high-quality speech vocoding method STRAIGHT and a
parameter generation algorithm with global variance
into the system for improving quality of synthetic
speech. Furthermore, we introduce a feature-space
speaker adaptive training algorithm and a gender mixed
modeling technique for conducting further normalization
of the average voice model. We build an English
text-to-speech system using these techniques and show
the performance of the system.
|
|
[56]
|
Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira.
Hierarchical dialogue optimization using semi-markov decision
processes.
In Proc. of INTERSPEECH, August 2007.
[ bib |
.pdf ]
This paper addresses the problem of dialogue
optimization on large search spaces. For such a
purpose, in this paper we propose to learn dialogue
strategies using multiple Semi-Markov Decision
Processes and hierarchical reinforcement learning. This
approach factorizes state variables and actions in
order to learn a hierarchy of policies. Our experiments
are based on a simulated flight booking dialogue system
and compare flat versus hierarchical reinforcement
learning. Experimental results show that the proposed
approach produced a dramatic search space reduction
(99.36%), and converged four orders of magnitude
faster than flat reinforcement learning with a very
small loss in optimality (on average 0.3 system turns).
Results also report that the learnt policies
outperformed a hand-crafted one under three different
conditions of ASR confidence levels. This approach is
appealing to dialogue optimization due to faster
learning, reusable subsolutions, and scalability to
larger problems.
|
|
[57]
|
A. Dielmann and S. Renals.
DBN based joint dialogue act recognition of multiparty meetings.
In Proc. IEEE ICASSP, volume 4, pages 133-136, April 2007.
[ bib |
.pdf ]
Joint Dialogue Act segmentation and classification of
the new AMI meeting corpus has been performed through
an integrated framework based on a switching dynamic
Bayesian network and a set of continuous features and
language models. The recognition process is based on a
dictionary of 15 DA classes tailored for group
decision-making. Experimental results show that a novel
interpolated Factored Language Model results in a low
error rate on the automatic segmentation task, and thus
good recognition results can be achieved on AMI
multiparty conversational speech.
|
|
[58]
|
A. Dielmann and S. Renals.
Automatic dialogue act recognition using a dynamic Bayesian
network.
In S. Renals, S. Bengio, and J. Fiscus, editors, Proc.
Multimodal Interaction and Related Machine Learning Algorithms Workshop
(MLMI-06), pages 178-189. Springer, 2007.
[ bib |
.pdf ]
We propose a joint segmentation and classification
approach for the dialogue act recognition task on
natural multi-party meetings (ICSI Meeting Corpus).
Five broad DA categories are automatically recognised
using a generative Dynamic Bayesian Network based
infrastructure. Prosodic features and a switching
graphical model are used to estimate DA boundaries, in
conjunction with a factored language model which is
used to relate words and DA categories. This easily
generalizable and extensible system promotes a rational
approach to the joint DA segmentation and recognition
task, and is capable of good recognition performance.
|
|
[59]
|
Songfang Huang and Steve Renals.
Modeling prosodic features in language models for meetings.
In A. Popescu-Belis, S. Renals, and H. Bourlard, editors,
Machine Learning for Multimodal Interaction IV, volume 4892 of Lecture
Notes in Computer Science, pages 191-202. Springer, 2007.
[ bib |
.pdf ]
Prosody has been actively studied as an important
knowledge source for speech recognition and
understanding. In this paper, we are concerned with the
question of exploiting prosody for language models to
aid automatic speech recognition in the context of
meetings. Using an automatic syllable detection
algorithm, the syllable-based prosodic features are
extracted to form the prosodic representation for each
word. Two modeling approaches are then investigated.
One is based on a factored language model, which
directly uses the prosodic representation and treats it
as a `word'. Instead of direct association, the second
approach provides a richer probabilistic structure
within a hierarchical Bayesian framework by introducing
an intermediate latent variable to represent similar
prosodic patterns shared by groups of words. Four-fold
cross-validation experiments on the ICSI Meeting Corpus
show that exploiting prosody for language modeling can
significantly reduce the perplexity, and also have
marginal reductions in word error rate.
|
|
[60]
|
Alejandro Jaimes, Hervé Bourlard, Steve Renals, and Jean Carletta.
Recording, indexing, summarizing, and accessing meeting videos: An
overview of the AMI project.
In Proc IEEE ICIAPW, pages 59-64, 2007.
[ bib |
DOI |
http |
.pdf ]
n this paper we give an overview of the AMI project.
AMI developed the following: (1) an infrastructure for
recording meetings using multiple microphones and
cameras; (2) a one hundred hour, manually annotated
meeting corpus; (3) a number of techniques for
indexing, and summarizing of meeting videos using
automatic speech recognition and computer vision, and
(4) an extensible framework for browsing, and searching
of meeting videos. We give an overview of the various
techniques developed in AMI, their integration into our
meeting browser framework, and future plans for AMIDA
(Augmented Multiparty Interaction with Distant Access),
the follow-up project to AMI.
|
|
[61]
|
Steve Renals, Thomas Hain, and Hervé Bourlard.
Recognition and interpretation of meetings: The AMI and AMIDA
projects.
In Proc. IEEE Workshop on Automatic Speech Recognition and
Understanding (ASRU '07), 2007.
[ bib |
.pdf ]
The AMI and AMIDA projects are concerned with the
recognition and interpretation of multiparty meetings.
Within these projects we have: developed an
infrastructure for recording meetings using multiple
microphones and cameras; released a 100 hour annotated
corpus of meetings; developed techniques for the
recognition and interpretation of meetings based
primarily on speech recognition and computer vision;
and developed an evaluation framework at both component
and system levels. In this paper we present an overview
of these projects, with an emphasis on speech
recognition and content extraction.
|
|
[62]
|
Gabriel Murray and Steve Renals.
Towards online speech summarization.
In Proc. Interspeech '07, 2007.
[ bib |
.PDF ]
The majority of speech summarization research has
focused on extracting the most informative dialogue
acts from recorde d, archived data. However, a
potential use case for speech sum- marization in the
meetings domain is to facilitate a meeting in progress
by providing the participants - whether they are at
tend- ing in-person or remotely - with an indication of
the most im- portant parts of the discussion so far.
This requires being a ble to determine whether a
dialogue act is extract-worthy befor e the global
meeting context is available. This paper introduces a
novel method for weighting dialogue acts using only
very lim- ited local context, and shows that high
summary precision is possible even when information
about the meeting as a whole is lacking. A new
evaluation framework consisting of weighted precision,
recall and f-score is detailed, and the novel onl ine
summarization method is shown to significantly increase
recall and f-score compared with a method using no
contextual infor- mation.
|
|
[63]
|
Gabriel Murray and Steve Renals.
Term-weighting for summarization of multi-party spoken dialogues.
In A. Popescu-Belis, S. Renals, and H. Bourlard, editors,
Machine Learning for Multimodal Interaction IV, volume 4892 of Lecture
Notes in Computer Science, pages 155-166. Springer, 2007.
[ bib |
.pdf ]
This paper explores the issue of term-weighting in the
genre of spontaneous, multi-party spoken dialogues,
with the intent of using such term-weights in the
creation of extractive meeting summaries. The field of
text information retrieval has yielded many
term-weighting tech- niques to import for our purposes;
this paper implements and compares several of these,
namely tf.idf, Residual IDF and Gain. We propose that
term-weighting for multi-party dialogues can exploit
patterns in word us- age among participant speakers,
and introduce the su.idf metric as one attempt to do
so. Results for all metrics are reported on both manual
and automatic speech recognition (ASR) transcripts, and
on both the ICSI and AMI meeting corpora.
|
|
[64]
|
J. Cabral, S. Renals, K. Richmond, and J. Yamagishi.
Towards an improved modeling of the glottal source in statistical
parametric speech synthesis.
In Proc.of the 6th ISCA Workshop on Speech Synthesis, Bonn,
Germany, 2007.
[ bib |
.pdf ]
This paper proposes the use of the Liljencrants-Fant
model (LF-model) to represent the glottal source signal
in HMM-based speech synthesis systems. These systems
generally use a pulse train to model the periodicity of
the excitation signal of voiced speech. However, this
model produces a strong and uniform harmonic structure
throughout the spectrum of the excitation which makes
the synthetic speech sound buzzy. The use of a mixed
band excitation and phase manipulation reduces this
effect but it can result in degradation of the speech
quality if the noise component is not weighted
carefully. In turn, the LF-waveform has a decaying
spectrum at higher frequencies, which is more similar
to the real glottal source excitation signal. We
conducted a perceptual experiment to test the
hypothesis that the LF-model can perform as well as or
better than the pulse train in a HMM-based speech
synthesizer. In the synthesis, we used the mean values
of the LF-parameters, calculated by measurements of the
recorded speech. The result of this study is important
not only regarding the improvement in speech quality of
these type of systems, but also because the LF-model
can be used to model many characteristics of the
glottal source, such as voice quality, which are
important for voice transformation and generation of
expressive speech.
|
|
[65]
|
Alfred Dielmann and Steve Renals.
Automatic meeting segmentation using dynamic Bayesian networks.
IEEE Transactions on Multimedia, 9(1):25-36, 2007.
[ bib |
DOI |
http |
.pdf ]
Multiparty meetings are a ubiquitous feature of
organizations, and there are considerable economic
benefits that would arise from their automatic analysis
and structuring. In this paper, we are concerned with
the segmentation and structuring of meetings (recorded
using multiple cameras and microphones) into sequences
of group meeting actions such as monologue, discussion
and presentation. We outline four families of
multimodal features based on speaker turns, lexical
transcription, prosody, and visual motion that are
extracted from the raw audio and video recordings. We
relate these low-level features to more complex group
behaviors using a multistream modelling framework based
on multistream dynamic Bayesian networks (DBNs). This
results in an effective approach to the segmentation
problem, resulting in an action error rate of 12.2%,
compared with 43% using an approach based on hidden
Markov models. Moreover, the multistream DBN developed
here leaves scope for many further improvements and
extensions.
|
|
[66]
|
Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira.
Reinforcement learning of dialogue strategies with hierarchical
abstract machines.
In Proc. of IEEE/ACL Workshop on Spoken Language Technology
(SLT), December 2006.
[ bib |
.pdf ]
In this paper we propose partially specified dialogue
strategies for dialogue strategy optimization, where
part of the strategy is specified deterministically and
the rest optimized with Reinforcement Learning (RL). To
do this we apply RL with Hierarchical Abstract Machines
(HAMs). We also propose to build simulated users using
HAMs, incorporating a combination of hierarchical
deterministic and probabilistic behaviour. We performed
experiments using a single-goal flight booking dialogue
system, and compare two dialogue strategies
(deterministic and optimized) using three types of
simulated user (novice, experienced and expert). Our
results show that HAMs are promising for both dialogue
optimization and simulation, and provide evidence that
indeed partially specified dialogue strategies can
outperform deterministic ones (on average 4.7 fewer
system turns) with faster learning than the traditional
RL framework.
|
|
[67]
|
Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira.
Learning multi-goal dialogue strategies using reinforcement learning
with reduced state-action spaces.
In Proc. of INTERSPEECH, September 2006.
[ bib |
.pdf ]
Learning dialogue strategies using the reinforcement
learning framework is problematic due to its expensive
computational cost. In this paper we propose an
algorithm that reduces a state-action space to one
which includes only valid state-actions. We performed
experiments on full and reduced spaces using three
systems (with 5, 9 and 20 slots) in the travel domain
using a simulated environment. The task was to learn
multi-goal dialogue strategies optimizing single and
multiple confirmations. Average results using
strategies learnt on reduced spaces reveal the
following benefits against full spaces: 1) less
computer memory (94% reduction), 2) faster learning
(93% faster convergence) and better performance (8.4%
less time steps and 7.7% higher reward).
|
|
[68]
|
Le Zhang and Steve Renals.
Phone recognition analysis for trajectory HMM.
In Proc. Interspeech 2006, Pittsburgh, USA, September 2006.
[ bib |
.pdf ]
The trajectory HMM has been shown to be useful for
model-based speech synthesis where a smoothed
trajectory is generated using temporal constraints
imposed by dynamic features. To evaluate the
performance of such model on an ASR task, we present a
trajectory decoder based on tree search with delayed
path merging. Experiment on a speaker-dependent phone
recognition task using the MOCHA-TIMIT database shows
that the MLE-trained trajectory model, while retaining
attractive properties of being a proper generative
model, tends to favour over-smoothed trajectory among
competing hypothesises, and does not perform better
than a conventional HMM. We use this to build an
argument that models giving better fit on training data
may suffer a reduction of discrimination by being too
faithful to training data. This partially explains why
alternative acoustic models that try to explicitly
model temporal constraints do not achieve significant
improvements in ASR.
|
|
[69]
|
G. Murray and S. Renals.
Dialogue act compression via pitch contour preservation.
In Proceedings of the 9th International Conference on Spoken
Language Processing, Pittsburgh, USA, September 2006.
[ bib |
.pdf ]
This paper explores the usefulness of prosody in
automatically compressing dialogue acts from meeting
speech. Specifically, this work attempts to compress
utterances by preserving the pitch contour of the
original whole utterance. Two methods of doing this are
described in detail and are evaluated
subjectively using human annotators and
objectively using edit distance with a
human-authored gold-standard. Both metrics show that
such a prosodic approach is much better than the random
baseline approach and significantly better than a
simple text compression method.
|
|
[70]
|
G. Murray, S. Renals, J. Moore, and J. Carletta.
Incorporating speaker and discourse features into speech
summarization.
In Proceedings of the Human Language Technology Conference -
North American Chapter of the Association for Computational Linguistics
Meeting (HLT-NAACL) 2006, New York City, USA, June 2006.
[ bib |
.pdf ]
The research presented herein explores the usefulness
of incorporating speaker and discourse features in an
automatic speech summarization system applied to
meeting recordings from the ICSI Meetings corpus. By
analyzing speaker activity, turn-taking and discourse
cues, it is hypothesized that a system can outperform
solely text-based methods inherited from the field of
text summarization. The summarization methods are
described, two evaluation methods are applied and
compared, and the results clearly show that utilizing
such features is advantageous and efficient. Even
simple methods relying on discourse cues and speaker
activity can outperform text summarization approaches.
|
|
[71]
|
G. Murray, S. Renals, and M. Taboada.
Prosodic correlates of rhetorical relations.
In Proceedings of HLT/NAACL ACTS Workshop, 2006, New York City,
USA, June 2006.
[ bib |
.pdf ]
This paper investigates the usefulness of prosodic
features in classifying rhetorical relations between
utterances in meeting recordings. Five rhetorical
relations of contrast, elaboration,
summary, question and cause
are explored. Three training methods - supervised,
unsupervised, and combined - are compared, and
classification is carried out using support vector
machines. The results of this pilot study are
encouraging but mixed, with pairwise classification
achieving an average of 68% accuracy in discerning
between relation pairs using only prosodic features,
but multi-class classification performing only slightly
better than chance.
|
|
[72]
|
M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and
D. Zhang.
Multimodal integration for meeting group action segmentation and
recognition.
In S. Renals and S. Bengio, editors, Proc. Multimodal
Interaction and Related Machine Learning Algorithms Workshop (MLMI-05),
pages 52-63. Springer, 2006.
[ bib ]
We address the problem of segmentation and recognition
of sequences of multimodal human interactions in
meetings. These interactions can be seen as a rough
structure of a meeting, and can be used either as input
for a meeting browser or as a first step towards a
higher semantic analysis of the meeting. A common
lexicon of multimodal group meeting actions, a shared
meeting data set, and a common evaluation procedure
enable us to compare the different approaches. We
compare three different multimodal feature sets and our
modelling infrastructures: a higher semantic feature
approach, multi-layer HMMs, a multistream DBN, as well
as a multi-stream mixed-state DBN for disturbed data.
|
|
[73]
|
P. Hsueh, J. Moore, and S. Renals.
Automatic segmentation of multiparty dialogue.
In Proc. EACL06, 2006.
[ bib |
.pdf ]
In this paper, we investigate the prob- lem of
automatically predicting segment boundaries in spoken
multiparty dialogue. We extend prior work in two ways.
We first apply approaches that have been pro- posed for
predicting top-level topic shifts to the problem of
identifying subtopic boundaries. We then explore the
impact on performance of using ASR output as opposed to
human transcription. Exam- ination of the effect of
features shows that predicting top-level and predicting
subtopic boundaries are two distinct tasks: (1) for
predicting subtopic boundaries, the lexical
cohesion-based approach alone can achieve competitive
results, (2) for predicting top-level boundaries, the
ma- chine learning approach that combines
lexical-cohesion and conversational fea- tures performs
best, and (3) conversational cues, such as cue phrases
and overlapping speech, are better indicators for the
top- level prediction task. We also find that the
transcription errors inevitable in ASR output have a
negative impact on models that combine lexical-cohesion
and conver- sational features, but do not change the
general preference of approach for the two tasks.
|
|
[74]
|
Marc Al-Hames, Thomas Hain, Jan Cernocky, Sascha Schreiber, Mannes Poel, Ronald
Mueller, Sebastien Marcel, David van Leeuwen, Jean-Marc Odobez, Sileye Ba,
Hervé Bourlard, Fabien Cardinaux, Daniel Gatica-Perez, Adam Janin, Petr
Motlicek, Stephan Reiter, Steve Renals, Jeroen van Rest, Rutger Rienks,
Gerhard Rigoll, Kevin Smith, Andrew Thean, and Pavel Zemcik.
Audio-video processing in meetings: Seven questions and current AMI
answers.
In S. Renals, S. Bengio, and J. G. Fiscus, editors, Machine
Learning for Multimodal Interaction (Proc. MLMI '06), volume 4299 of
Lecture Notes in Computer Science, pages 24-35. Springer, 2006.
[ bib ]
|
|
[75]
|
Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira.
Human-computer dialogue simulation using hidden markov models.
In Proc. of IEEE Workshop on Automatic Speech Recognition and
Understanding (ASRU), November 2005.
[ bib |
.pdf ]
This paper presents a probabilistic method to simulate
task-oriented human-computer dialogues at the intention
level, that may be used to improve or to evaluate the
performance of spoken dialogue systems. Our method uses
a network of Hidden Markov Models (HMMs) to predict
system and user intentions, where a “language model”
predicts sequences of goals and the component HMMs
predict sequences of intentions. We compare standard
HMMs, Input HMMs and Input-Output HMMs in an effort to
better predict sequences of intentions. In addition, we
propose a dialogue similarity measure to evaluate the
realism of the simulated dialogues. We performed
experiments using the DARPA Communicator corpora and
report results with three different metrics: dialogue
length, dialogue similarity and precision-recall.
|
|
[76]
|
G. Garau, S. Renals, and T. Hain.
Applying vocal tract length normalization to meeting recordings.
In Proc. Interspeech, September 2005.
[ bib |
.pdf ]
Vocal Tract Length Normalisation (VTLN) is a commonly
used technique to normalise for inter-speaker
variability. It is based on the speaker-specific
warping of the frequency axis, parameterised by a
scalar warp factor. This factor is typically estimated
using maximum likelihood. We discuss how VTLN may be
applied to multiparty conversations, reporting a
substantial decrease in word error rate in experiments
using the ICSI meetings corpus. We investigate the
behaviour of the VTLN warping factor and show that a
stable estimate is not obtained. Instead it appears to
be influenced by the context of the meeting, in
particular the current conversational partner. These
results are consistent with predictions made by the
psycholinguistic interactive alignment account of
dialogue, when applied at the acoustic and phonological
levels.
|
|
[77]
|
G. Murray, S. Renals, and J. Carletta.
Extractive summarization of meeting recordings.
In Proc. Interspeech, September 2005.
[ bib |
.pdf ]
Several approaches to automatic speech summarization
are discussed below, using the ICSI Meetings corpus. We
contrast feature-based approaches using prosodic and
lexical features with maximal marginal relevance and
latent semantic analysis approaches to summarization.
While the latter two techniques are borrowed directly
from the field of text summarization, feature-based
approaches using prosodic information are able to
utilize characteristics unique to speech data. We also
investigate how the summarization results might
deteriorate when carried out on ASR output as opposed
to manual transcripts. All of the summaries are of an
extractive variety, and are compared using the software
ROUGE.
|
|
[78]
|
G. Murray, S. Renals, J. Carletta, and J. Moore.
Evaluating automatic summaries of meeting recordings.
In Proceedings of the 43rd Annual Meeting of the Association for
Computational Linguistics, Ann Arbor, MI, USA, June 2005.
[ bib |
.pdf ]
The research below explores schemes for evaluating
automatic summaries of business meetings, using the
ICSI Meeting Corpus. Both automatic and subjective
evaluations were carried out, with a central interest
being whether or not the two types of evaluations
correlate with each other. The evaluation metrics were
used to compare and contrast differing approaches to
automatic summarization, the deterioration of summary
quality on ASR output versus manual transcripts, and to
determine whether manual extracts are rated
significantly higher than automatic extracts.
|
|
[79]
|
H. Christensen, B. Kolluru, Y. Gotoh, and S. Renals.
Maximum entropy segmentation of broadcast news.
In Proc. IEEE ICASSP, 2005.
[ bib |
.ps.gz |
.pdf ]
This paper presents an automatic system for
structuring and preparing a news broadcast for
applications such as speech summarization, browsing,
archiving and information retrieval. This process
comprises transcribing the audio using an automatic
speech recognizer and subsequently segmenting the text
into utterances and topics. A maximum entropy approach
is used to build statistical models for both utterance
and topic segmentation. The experimental work addresses
the effect on performance of the topic boundary
detector of three factors: the information sources
used, the quality of the ASR transcripts, and the
quality of the utterance boundary detector. The results
show that the topic segmentation is not affected
severely by transcripts errors, whereas errors in the
utterance segmentation are more devastating.
|
|
[80]
|
T. Hain, J. Dines, G. Garau, M. Karafiat, D. Moore, V. Wan, R. Ordelman, and
S. Renals.
Transcription of conference room meetings: an investigation.
In Proc. Interspeech, 2005.
[ bib |
.pdf ]
The automatic processing of speech collected in
conference style meetings has attracted considerable
interest with several large scale projects devoted to
this area. In this paper we explore the use of various
meeting corpora for the purpose of automatic speech
recognition. In particular we investigate the
similarity of these resources and how to efficiently
use them in the construction of a meeting transcription
system. The analysis shows distinctive features for
each resource. However the benefit in pooling data and
hence the similarity seems sufficient to speak of a
generic conference meeting domain . In this context
this paper also presents work on development for the
AMI meeting transcription system, a joint effort by
seven sites working on the AMI (augmented multi-party
interaction) project.
|
|
[81]
|
T. Hain, L. Burget, J. Dines, G. Garau, M. Karafiat, M. Lincoln, I. McCowan,
D. Moore, V. Wan, R. Ordelman, and S. Renals.
The 2005 AMI system for the transcription of speech in meetings.
In Proceedings of the Rich Transcription 2005 Spring Meeting
Recognition Evaluation, 2005.
[ bib |
.pdf ]
In this paper we describe the 2005 AMI system for the
transcription of speech in meetings used in the 2005
NIST RT evaluations. The system was designed for
participation in the speech to text part of the
evaluations, in particular for transcription of speech
recorded with multiple distant microphones and
independent headset microphones. System performance was
tested on both conference room and lecture style
meetings. Although input sources are processed using
different frontends, the recognition process is based
on a unified system architecture. The system operates
in multiple passes and makes use of state of the art
technologies such as discriminative training, vocal
tract length normalisation, heteroscedastic linear
discriminant analysis, speaker adaptation with maximum
likelihood linear regression and minimum word error
rate decoding. In this paper we describe the system
performance on the official development and test sets
for the NIST RT05s evaluations. The system was jointly
developed in less than 10 months by a multi-site team
and was shown to achieve competitive performance.
|
|
[82]
|
S. J. Wrigley, G. J. Brown, V. Wan, and S. Renals.
Speech and crosstalk detection in multi-channel audio.
IEEE Trans. on Speech and Audio Processing, 13:84-91, 2005.
[ bib |
.pdf ]
The analysis of scenarios in which a number of
microphones record the activity of speakers, such as in
a roundtable meeting, presents a number of
computational challenges. For example, if each
participant wears a microphone, it can receive speech
from both the microphone's wearer (local speech) and
from other participants (crosstalk). The recorded audio
can be broadly classified in four ways: local speech,
crosstalk plus local speech, crosstalk alone and
silence. We describe two experiments related to the
automatic classification of audio into these four
classes. The first experiment attempted to optimise a
set of acoustic features for use with a Gaussian
mixture model (GMM) classifier. A large set of
potential acoustic features were considered, some of
which have been employed in previous studies. The
best-performing features were found to be kurtosis,
fundamentalness and cross-correlation metrics. The
second experiment used these features to train an
ergodic hidden Markov model classifier. Tests performed
on a large corpus of recorded meetings show
classification accuracies of up to 96%, and automatic
speech recognition performance close to that obtained
using ground truth segmentation.
|
|
[83]
|
Jerry Goldman, Steve Renals, Steven Bird, Franciska de Jong, Marcello
Federico, Carl Fleischhauer, Mark Kornbluh, Lori Lamel, Doug Oard, Clare
Stewart, and Richard Wright.
Accessing the spoken word.
International Journal of Digital Libraries, 5(4):287-298,
2005.
[ bib |
.ps.gz |
.pdf ]
Spoken word audio collections cover many domains,
including radio and television broadcasts, oral
narratives, governmental proceedings, lectures, and
telephone conversations. The collection, access and
preservation of such data is stimulated by political,
economic, cultural and educational needs. This paper
outlines the major issues in the field, reviews the
current state of technology, examines the rapidly
changing policy issues relating to privacy and
copyright, and presents issues relating to the
collection and preservation of spoken audio content.
|
|
[84]
|
Y. Hifny, S. Renals, and N. Lawrence.
A hybrid MaxEnt/HMM based ASR system.
In Proc. Interspeech, 2005.
[ bib |
.pdf ]
The aim of this work is to develop a practical
framework, which extends the classical Hidden Markov
Models (HMM) for continuous speech recognition based on
the Maximum Entropy (MaxEnt) principle. The MaxEnt
models can estimate the posterior probabilities
directly as with Hybrid NN/HMM connectionist speech
recognition systems. In particular, a new acoustic
modelling based on discriminative MaxEnt models is
formulated and is being developed to replace the
generative Gaussian Mixture Models (GMM) commonly used
to model acoustic variability. Initial experimental
results using the TIMIT phone task are reported.
|
|
[85]
|
A. Dielmann and S. Renals.
Multistream dynamic Bayesian network for meeting segmentation.
In S. Bengio and H. Bourlard, editors, Proc. Multimodal
Interaction and Related Machine Learning Algorithms Workshop (MLMI-04),
pages 76-86. Springer, 2005.
[ bib |
.ps.gz |
.pdf ]
This paper investigates the automatic analysis and
segmentation of meetings. A meeting is analysed in
terms of individual behaviours and group interactions,
in order to decompose each meeting in a sequence of
relevant phases, named meeting actions. Three feature
families are extracted from multimodal recordings:
prosody from individual lapel microphone signals,
speaker activity from microphone array data and lexical
features from textual transcripts. A statistical
approach is then used to relate low-level features with
a set of abstract categories. In order to provide a
flexible and powerful framework, we have employed a
dynamic Bayesian network based model, characterized by
multiple stream processing and flexible state duration
modelling. Experimental results demonstrate the
strength of this system, providing a meeting action
error rate of 9%.
|
|
[86]
|
V. Wan and S. Renals.
Speaker verification using sequence discriminant support vector
machines.
IEEE Trans. on Speech and Audio Processing, 13:203-210, 2005.
[ bib |
.ps.gz |
.pdf ]
This paper presents a text-independent speaker
verification system using support vector machines
(SVMs) with score-space kernels. Score-space kernels,
generalize Fisher kernels, and are based on an
underlying generative model, such as a Gaussian mixture
model (GMM). This approach provides direct
discrimination between whole sequences, in contrast to
the frame-level approaches at the heart of most current
systems. The resultant SVMs have a very high
dimensionality, since it is related to the number of
parameters in the underlying generative model. To
ameliorate problems that can arise in the resultant
optimization, we introduce a technique called spherical
normalization that preconditions the Hessian matrix. We
have performed speaker verification experiments using
the PolyVar database. The SVM system presented here
reduces the relative error rates by 34% compared to a
GMM likelihood ratio system.
|
|
[87]
|
Konstantinos Koumpis and Steve Renals.
Automatic summarization of voicemail messages using lexical and
prosodic features.
ACM Transactions on Speech and Language Processing, 2(1):1-24,
2005.
[ bib |
.ps.gz |
.pdf ]
This paper presents trainable methods for extracting
principal content words from voicemail messages. The
short text summaries generated are suitable for mobile
messaging applications. The system uses a set of
classifiers to identify the summary words, with each
word being identified by a vector of lexical and
prosodic features. We use an ROC-based algorithm,
Parcel, to select input features (and classifiers). We
have performed a series of objective and subjective
evaluations using unseen data from two different speech
recognition systems, as well as human transcriptions of
voicemail speech.
|
|
[88]
|
Konstantinos Koumpis and Steve Renals.
Content-based access to spoken audio.
IEEE Signal Processing Magazine, 22(5):61-69, 2005.
[ bib |
.pdf ]
"How analysis, retrieval and delivery phases make
spoken audio content more accessible"
|
|
[89]
|
T. Hain, L. Burget, J. Dines, G. Garau, M. Karafiat, M. Lincoln, I. McCowan,
D. Moore, V. Wan, R. Ordelman, and S. Renals.
The development of the AMI system for the transcription of speech
in meetings.
In 2nd Joint Workshop on Multimodal Interaction and Related
Machine Learning Algorithms, 2005.
[ bib |
.pdf ]
The automatic processing of speech collected in
conference style meetings has attracted considerable
interest with several large scale projects devoted to
this area. This paper describes the development of a
baseline automatic speech transcription system for
meetings in the context of the AMI (Augmented
Multiparty Interaction) project. We present several
techniques important to processing of this data and
show the performance in terms of word error rates
(WERs). An important aspect of transcription of this
data is the necessary flexibility in terms of audio
pre-processing. Real world systems have to deal with
flexible input, for example by using microphone arrays
or randomly placed microphones in a room. Automatic
segmentation and microphone array processing techniques
are described and the effect on WERs is discussed. The
system and its components presented in this paper yield
compettive performance and form a baseline for future
research in this domain.
|
|
[90]
|
H. Christensen, B. Kolluru, Y. Gotoh, and S. Renals.
From text summarisation to style-specific summarisation for broadcast
news.
In Proc. ECIR-2004, 2004.
[ bib |
.ps.gz |
.pdf ]
In this paper we report on a series of experiments
investigating the path from text-summarisation to
style-specific summarisation of spoken news stories. We
show that the portability of traditional text
summarisation features to broadcast news is dependent
on the diffusiveness of the information in the
broadcast news story. An analysis of two categories of
news stories (containing only read speech or some
spontaneous speech) demonstrates the importance of the
style and the quality of the transcript, when
extracting the summary-worthy information content.
Further experiments indicate the advantages of doing
style-specific summarisation of broadcast news.
|
|
[91]
|
A. Dielmann and S. Renals.
Dynamic Bayesian networks for meeting structuring.
In Proc. IEEE ICASSP, 2004.
[ bib |
.ps.gz |
.pdf ]
This paper is about the automatic structuring of
multiparty meetings using audio information. We have
used a corpus of 53 meetings, recorded using a
microphone array and lapel microphones for each
participant. The task was to segment meetings into a
sequence of meeting actions, or phases. We have adopted
a statistical approach using dynamic Bayesian networks
(DBNs). Two DBN architectures were investigated: a
two-level hidden Markov model (HMM) in which the
acoustic observations were concatenated; and a
multistream DBN in which two separate observation
sequences were modelled. Additionally we have also
explored the use of counter variables to constrain the
number of action transitions. Experimental results
indicate that the DBN architectures are an improvement
over a simple baseline HMM, with the multistream DBN
with counter constraints producing an action error rate
of 6%.
|
|
[92]
|
A. Dielmann and S. Renals.
Multi-stream segmentation of meetings.
In Proc. IEEE Workshop on Multimedia Signal Processing, 2004.
[ bib |
.ps.gz |
.pdf ]
This paper investigates the automatic segmentation of
meetings into a sequence of group actions or phases.
Our work is based on a corpus of multiparty meetings
collected in a meeting room instrumented with video
cameras, lapel microphones and a microphone array. We
have extracted a set of feature streams, in this case
extracted from the audio data, based on speaker turns,
prosody and a transcript of what was spoken. We have
related these signals to the higher level semantic
categories via a multistream statistical model based on
dynamic Bayesian networks (DBNs). We report on a set of
experiments in which different DBN architectures are
compared, together with the different feature streams.
The resultant system has an action error rate of 9%.
|
|
[93]
|
Y. H. Abdel-Haleem, S. Renals, and N. D. Lawrence.
Acoustic space dimensionality selection and combination using the
maximum entropy principle.
In Proc. IEEE ICASSP, 2004.
[ bib |
.pdf ]
In this paper we propose a discriminative approach to
acoustic space dimensionality selection based on
maximum entropy modelling. We form a set of constraints
by composing the acoustic space with the space of phone
classes, and use a continuous feature formulation of
maximum entropy modelling to select an optimal feature
set. The suggested approach has two steps: (1) the
selection of the best acoustic space that efficiently
and economically represents the acoustic data and its
variability; (2) the combination of selected acoustic
features in the maximum entropy framework to estimate
the posterior probabilities over the phonetic labels
given the acoustic input. Specific contributions of
this paper include a parameter estimation algorithm
(generalized improved iterative scaling) that enables
the use of negative features, the parameterization of
constraint functions using Gaussian mixture models, and
experimental results using the TIMIT database.
|
|
[94]
|
Y. Gotoh and S. Renals.
Language modelling.
In S. Renals and G. Grefenstette, editors, Text and Speech
Triggered Information Access, number 2705 in Lecture Notes in Computer
Science, pages 78-105. Springer-Verlag, 2003.
[ bib ]
This is a preprint of a tutorial on statistical
language modelling, based on Yoshi Gotoh's course at
the
ELSNET-2000 Summer School on Text and Speech
Triggered Information Access.
|
|
[95]
|
K. Koumpis and S. Renals.
Evaluation of extractive voicemail summarization.
In Proc. ISCA Workshop on Multilingual Spoken Document
Retrieval, pages 19-24, 2003.
[ bib |
.ps.gz |
.pdf ]
This paper is about the evaluation of a system that
generates short text summaries of voicemail messages,
suitable for transmission as text messages. Our
approach to summarization is based on a
speech-recognized transcript of the voicemail message,
from which a set of summary words is extracted. The
system uses a classifier to identify the summary words,
with each word being identified by a vector of lexical
and prosodic features. The features are selected using
Parcel, an ROC-based algorithm. Our evaluations of the
system, using a slot error rate metric, have compared
manual and automatic summarization, and manual and
automatic recognition (using two different
recognizers). We also report on two subjective
evaluations using mean opinion score of summaries, and
a set of comprehension tests. The main results from
these experiments were that the perceived difference in
quality of summarization was affected more by errors
resulting from automatic transcription, than by the
automatic summarization process.
|
|
[96]
|
S. Renals and D. Ellis.
Audio information access from meeting rooms.
In Proc. IEEE ICASSP, volume 4, pages 744-747, 2003.
[ bib |
.ps.gz |
.pdf ]
We investigate approaches to accessing information
from the streams of audio data that result from
multi-channel recordings of meetings. The methods
investigated use word-level transcriptions, and
information derived from models of speaker activity and
speaker turn patterns. Our experiments include spoken
document retrieval for meetings, automatic structuring
of meetings based on self-similarity matrices of
speaker turn patterns and a simple model of speaker
activity. Meeting recordings are rich in both lexical
and non-lexical information; our results illustrate
some novel kinds of analysis made possible by a
transcribed corpus of natural meetings.
|
|
[97]
|
B. Kolluru, H. Christensen, Y. Gotoh, and S. Renals.
Exploring the style-technique interaction in extractive summarization
of broadcast news.
In Proc. IEEE Automatic Speech Recognition and Understanding
Workshop, 2003.
[ bib |
.ps.gz |
.pdf ]
In this paper we seek to explore the interaction
between the style of a broadcast news story and its
summarization technique. We report the performance of
three different summarization techniques on broadcast
news stories, which are split into planned speech and
spontaneous speech. The initial results indicate that
some summarization techniques work better for the
documents with spontaneous speech than for those with
planned speech. Even for human beings some documents
are inherently dif cult to summarize. We observe this
correlation between degree of dif culty in summarizing
and performance of the three automatic summarizers.
Given the high frequency of named entities in broadcast
news and even greater number of references to these
named entities, we also gauge the effect of named
entity and coreference resolution in a news story, on
the performance of these summarizers.
|
|
[98]
|
K. Koumpis and S. Renals.
Multi-class extractive voicemail summarization.
In Proc. Eurospeech, pages 2785-2788, 2003.
[ bib |
.pdf ]
This paper is about a system that extracts principal
content words from speech-recognized transcripts of
voicemail messages and classifies them into proper
names, telephone numbers, dates/times and `other'. The
short text summaries generated are suitable for mobile
messaging applications. The system uses a set of
classifiers to identify the summary words, with each
word being identified by a vector of lexical and
prosodic features. The features are selected using
Parcel, an ROC-based algorithm. We visually compare the
role of a large number of individual features and
discuss effective ways to combine them. We finally
evaluate their performance on manual and automatic
transcriptions derived from two different speech
recognition systems.
|
|
[99]
|
V. Wan and S. Renals.
SVMSVM: Support vector machine speaker verification methodology.
In Proc. IEEE ICASSP, volume 2, pages 221-224, 2003.
[ bib |
.ps.gz |
.pdf ]
Support vector machines with the Fisher and
score-space kernels are used for text independent
speaker verification to provide direct q discrimination
between complete utterances. This is unlike approaches
such as discriminatively trained Gaussian mixture
models or other discriminative classifiers that
discriminate at the frame-level only. Using the
sequence-level discrimination approach we are able to
achieve error-rates that are significantly better than
the current state-of-the-art on the PolyVar database.
|
|
[100]
|
H. Christensen, Y. Gotoh, B. Kolluru, and S. Renals.
Are extractive text summarisation techniques portable to broadcast
news?
In Proc. IEEE Automatic Speech Recognition and Understanding
Workshop, 2003.
[ bib |
.ps.gz |
.pdf ]
In this paper we report on a series of experiments
which compare the effect of individual features on both
text and speech summarisation, the effect of basing the
speech summaries on automatic speech recognition
transcripts with varying word error rates, and the
effect of summarisation approach and transcript source
on summary quality. We show that classical text
summarisation features (based on stylistic and content
information) are portable to broadcast news. However,
the quality of the speech transcripts as well as the
difference in information structure between broadcast
and newspaper news affect the usability of the
individual features.
|
|
[101]
|
S. Wrigley, G. Brown, V. Wan, and S. Renals.
Feature selection for the classification of crosstalk in
multi-channel audio.
In Proc. Eurospeech, pages 469-472, 2003.
[ bib |
.pdf ]
An extension to the conventional speech / nonspeech
classification framework is presented for a scenario in
which a number of microphones record the activity of
speakers present at a meeting (one microphone per
speaker). Since each microphone can receive speech from
both the participant wearing the microphone (local
speech) and other participants (crosstalk), the
recorded audio can be broadly classified in four ways:
local speech, crosstalk plus local speech, crosstalk
alone and silence. We describe a classifier in which a
Gaussian mixture model (GMM) is used to model each
class. A large set of potential acoustic features are
considered, some of which have been employed in
previous speech / nonspeech classifiers. A combination
of two feature selection algorithms is used to identify
the optimal feature set for each class. Results from
the GMM classifier using the selected features are
superior to those of a previously published approach.
|
|
[102]
|
A. J. Robinson, G. D. Cook, D. P. W. Ellis, E. Fosler-Lussier, S. J. Renals,
and D. A. G. Williams.
Connectionist speech recognition of broadcast news.
Speech Communication, 37:27-45, 2002.
[ bib |
.ps.gz |
.pdf ]
This paper describes connectionist techniques for
recognition of Broadcast News. The fundamental
difference between connectionist systems and more
conventional mixture-of-Gaussian systems is that
connectionist models directly estimate posterior
probabilities as opposed to likelihoods. Access to
posterior probabilities has enabled us to develop a
number of novel approaches to confidence estimation,
pronunciation modelling and search. In addition we have
investigated a new feature extraction technique based
on the modulation-filtered spectrogram, and methods for
combining multiple information sources. We have
incorporated all of these techniques into a system for
the transcription of Broadcast News, and we present
results on the 1998 DARPA Hub-4E Broadcast News
evaluation data.
|
|
[103]
|
O. Pietquin and S. Renals.
ASR system modeling for automatic evaluation and optimization of
dialogue systems.
In Proc IEEE ICASSP, pages 46-49, 2002.
[ bib |
.pdf ]
Though the field of spoken dialogue systems has
developed quickly in the last decade, rapid design of
dialogue strategies remains uneasy. Several approaches
to the problem of automatic strategy learning have been
proposed and the use of Reinforcement Learning
introduced by Levin and Pieraccini is becoming part of
the state of the art in this area. However, the quality
of the strategy learned by the system depends on the
definition of the optimization criterion and on the
accuracy of the environment model. In this paper, we
propose to bring a model of an ASR system in the
simulated environment in order to enhance the learned
strategy. To do so, we introduced recognition error
rates and confidence levels produced by ASR systems in
the optimization criterion.
|
|
[104]
|
V. Wan and S. Renals.
Evaluation of kernel methods for speaker verification and
identification.
In Proc IEEE ICASSP, pages 669-672, 2002.
[ bib |
.pdf ]
Support vector machines are evaluated on speaker
verification and speaker identification tasks. We
compare the polynomial kernel, the Fisher kernel, a
likelihood ratio kernel and the pair hidden Markov
model kernel with baseline systems based on a
discriminative polynomial classifier and generative
Gaussian mixture model classifiers. Simulations were
carried out on the YOHO database and some promising
results were obtained.
|
|
[105]
|
K. Koumpis, S. Renals, and M. Niranjan.
Extractive summarization of voicemail using lexical and prosodic
feature subset selection.
In Proc. Eurospeech, pages 2377-2380, Aalborg, Denmark, 2001.
[ bib |
.ps.gz |
.pdf ]
This paper presents a novel data-driven approach to
summarizing spoken audio transcripts utilizing lexical
and prosodic features. The former are obtained from a
speech recognizer and the latter are extracted
automatically from speech waveforms. We employ a
feature subset selection algorithm, based on ROC
curves, which examines different combinations of
features at different target operating conditions. The
approach is evaluated on the IBM Voicemail corpus,
demonstrating that it is possible and desirable to
avoid complete commitment to a single best classifier
or feature set.
|
|
[106]
|
K. Koumpis, C. Ladas, and S. Renals.
An advanced integrated architecture for wireless voicemail retrieval.
In Proc. 15th IEEE International Conference on Information
Networking, pages 403-410, 2001.
[ bib |
.ps.gz ]
This paper describes an alternative architecture for
voicemail data retrieval on the move. It is comprised
of three distinct components: a speech recognizer, a
text summarizer and a WAP push service initiator,
enabling mobile users to receive a text summary of
their voicemail in realtime without an explicit
request. Our approach overcomes the cost and usability
limitations of the conventional voicemail retrieval
paradigm which requires a connection establishment in
order to listen to spoken messages. We report
performance results on all different components of the
system which has been trained on a database containing
1843 North American English messages as well as on the
duration of the corresponding data path. The proposed
architecture can be further customized to meet the
requirements of a complete voicemail value-added
service.
|
|
[107]
|
S. Renals and D. Abberley.
The THISL SDR system at TREC-9.
In Proc. Ninth Text Retrieval Conference (TREC-9), 2001.
[ bib |
.ps.gz |
.pdf ]
This paper describes our participation in the TREC-9
Spoken Document Retrieval (SDR) track. The THISL SDR
system consists of a realtime version of a hybrid
connectionist/HMM large vocabulary speech recognition
system and a probabilistic text retrieval system. This
paper describes the configuration of the speech
recognition and text retrieval systems, including
segmentation and query expansion. We report our results
for development tests using the TREC-8 queries, and for
the TREC-9 evaluation.
|
|
[108]
|
H. Christensen, Y. Gotoh, and S. Renals.
Punctuation annotation using statistical prosody models.
In Proc. ISCA Workshop on Prosody in Speech Recognition and
Understanding, Red Bank, NJ, USA, 2001.
[ bib |
.ps.gz |
.pdf ]
This paper is about the development of statistical
models of prosodic features to generate linguistic
meta-data for spoken language. In particular, we are
concerned with automatically punctuating the output of
a broadcast news speech recogniser. We present a
statistical finite state model that combines prosodic,
linguistic and punctuation class features. Experimental
results are presented using the Hub-4 Broadcast News
corpus, and in the light of our results we discuss the
issue of a suitable method of evaluating the present
task.
|
|
[109]
|
K. Koumpis and S. Renals.
The role of prosody in a voicemail summarization system.
In Proc. ISCA Workshop on Prosody in Speech Recognition and
Understanding, Red Bank, NJ, USA, 2001.
[ bib |
.ps.gz |
.pdf ]
When a speaker leaves a voicemail message there are
prosodic cues that emphasize the important points in
the message, in addition to lexical content. In this
paper we compare and visualize the relative
contribution of these two types of features within a
voicemail summarization system. We describe the
system's ability to generate summaries of two test
sets, having trained and validated using 700 messages
from the IBM Voicemail corpus. Results measuring the
quality of summary artifacts show that combined lexical
and prosodic features are at least as robust as
combined lexical features alone across all operating
conditions.
|
|
[110]
|
Y. Gotoh and S. Renals.
Information extraction from broadcast news.
Philosophical Transactions of the Royal Society of London,
Series A, 358:1295-1310, 2000.
[ bib |
.ps.gz |
.pdf ]
This paper discusses the development of trainable
statistical models for extracting content from
television and radio news broadcasts. In particular we
concentrate on statistical finite state models for
identifying proper names and other named entities in
broadcast speech. Two models are presented: the first
models name class information as a word attribute; the
second explicitly models both word-word and class-class
transitions. A common n-gram based formulation is used
for both models. The task of named entity
identification is characterized by relatively sparse
training data and issues related to smoothing are
discussed. Experiments are reported using the
DARPA/NIST Hub-4E evaluation for North American
Broadcast News.
|
|
[111]
|
S. Renals, D. Abberley, D. Kirby, and T. Robinson.
Indexing and retrieval of broadcast news.
Speech Communication, 32:5-20, 2000.
[ bib |
.ps.gz |
.pdf ]
This paper describes a spoken document retrieval (SDR)
system for British and North American Broadcast News.
The system is based on a connectionist large vocabulary
speech recognizer and a probabilistic information
retrieval system. We discuss the development of a
realtime Broadcast News speech recognizer, and its
integration into an SDR system. Two advances were made
for this task: automatic segmentation and statistical
query expansion using a secondary corpus. Precision and
recall results using the Text Retrieval Conference
(TREC) SDR evaluation infrastructure are reported
throughout the paper, and we discuss the application of
these developments to a large scale SDR task based on
an archive of British English broadcast news.
|
|
[112]
|
M. Carreira-Perpiñán and S. Renals.
Practical identifiability of finite mixtures of multivariate
Bernoulli distributions.
Neural Computation, 12:141-152, 2000.
[ bib |
.ps.gz |
.pdf ]
The class of finite mixtures of multivariate Bernoulli
distributions is known to be nonidentifiable, i.e.,
different values of the mixture parameters can
correspond to exactly the same probability
distribution. In principle, this would mean that sample
estimates using this model would give rise to different
interpretations. We give empirical support to the fact
that estimation of this class of mixtures can still
produce meaningful results in practice, thus lessening
the importance of the identifiability problem. We also
show that the EM algorithm is guaranteed to converge to
a proper maximum likelihood estimate, owing to a
property of the log-likelihood surface. Experiments
with synthetic data sets show that an original
generating distribution can be estimated from a sample.
Experiments with an electropalatography (EPG) data set
show important structure in the data.
|
|
[113]
|
K. Koumpis and S. Renals.
Transcription and summarization of voicemail speech.
In Proc. ICSLP, volume 2, pages 688-691, Beijing, 2000.
[ bib |
.ps.gz |
.pdf ]
This paper describes the development of a system to
transcribe and summarize voicemail messages. The
results of the research presented in this paper are
two-fold. First, a hybrid connectionist approach to the
Voicemail transcription task shows that competitive
performance can be achieved using a context-independent
system with fewer parameters than those based on
mixtures of Gaussian likelihoods. Second, an effective
and robust combination of statistical with prior
knowledge sources for term weighting is used to extract
information from the decoders output in order to
deliver summaries to the message recipients via a GSM
Short Message Service (SMS) gateway.
|
|
[114]
|
Y. Gotoh and S. Renals.
Variable word rate n-grams.
In Proc IEEE ICASSP, pages 1591-1594, Istanbul, 2000.
[ bib |
.ps.gz |
.pdf ]
The rate of occurrence of words is not uniform but
varies from document to document. Despite this
observation, parameters for conventional n-gram
language models are usually derived using the
assumption of a constant word rate. In this paper we
investigate the use of variable word rate assumption,
modelled by a Poisson distribution or a continuous
mixture of Poissons. We present an approach to
estimating the relative frequencies of words or n-grams
taking prior information of their occurrences into
account. Discounting and smoothing schemes are also
considered. Using the Broadcast News task, the approach
demonstrates a reduction of perplexity up to 10%.
|
|
[115]
|
Y. Gotoh and S. Renals.
Sentence boundary detection in broadcast speech transcripts.
In ISCA ITRW: ASR2000, pages 228-235, Paris, 2000.
[ bib |
.ps.gz |
.pdf ]
This paper presents an approach to identifying
sentence boundaries in broadcast speech transcripts. We
describe finite state models that extract sentence
boundary information statistically from text and audio
sources. An n-gram language model is constructed from a
collection of British English news broadcasts and
scripts. An alternative model is estimated from pause
duration information in speech recogniser outputs
aligned with their programme script counterparts.
Experimental results show that the pause duration model
alone outperforms the language modelling approach and
that, by combining these two models, it can be improved
further and precision and recall scores of over 70%
were attained for the task.
|
|
[116]
|
D. Abberley, S. Renals, D. Ellis, and T. Robinson.
The THISL SDR system at TREC-8.
In Proc. Eighth Text Retrieval Conference (TREC-8), 2000.
[ bib |
.ps.gz |
.pdf ]
This paper describes the participation of the THISL
group at the TREC-8 Spoken Document Retrieval (SDR)
track. The THISL SDR system consists of the realtime
version of the Abbot large vocabulary speech
recognition system and the thislIR text retrieval
system. The TREC-8 evaluation assessed SDR performance
on a corpus of 500 hours of broadcast news material
collected over a five month period. The main test
condition involved retrieval of stories defined by
manual segmentation of the corpus in which non-news
material, such as commercials, were excluded. An
optional test condition required required retrieval of
the same stories from the unsegmented audio stream. The
THISL SDR system participated at both test conditions.
The results show that a system such as THISL can
produce respectable information retrieval performance
on a realistically-sized corpus of unsegmented audio
material.
|
|
[117]
|
G. Cook, K. Al-Ghoneim, D. Ellis, E. Fosler-Lussier, Y. Gotoh, B. Kingsbury,
N. Morgan, S. Renals, T. Robinson, and G. Williams.
The SPRACH system for the transcription of broadcast news.
In Proc. DARPA Broadcast News Workshop, pages 161-166, 1999.
[ bib |
.html |
.ps.gz |
.pdf ]
This paper describes the SPRACH system developed for
the 1998 Hub-4E broadcast news evaluation. The system
is based on the connectionist-HMM framework and uses
both recurrent neural network and multi-layer
perceptron acoustic models. We describe both a system
designed for the primary transcription hub, and a
system for the less-than 10 times real-time spoke. We
then describe recent developments to CHRONOS, a
time-first stack decoder. We show how these
developments have simplified the evaluation system, and
led to significant reductions in the error rate of the
10x real-time system. We also present a system designed
to operate in real-time with negligible search error.
|
|
[118]
|
T. Robinson, D. Abberley, D. Kirby, and S. Renals.
Recognition, indexing and retrieval of British broadcast news with
the THISL system.
In Proc. Eurospeech, pages 1067-1070, Budapest, 1999.
[ bib |
.ps.gz |
.pdf ]
This paper described the THISL spoken document
retrieval system for British and North American
Broadcast News. The system is based on the Abbot large
vocabulary speech recognizer and a probabilistic text
retrieval system. We discuss the development of a
realtime British English Broadcast News system, and its
integration into a spoken document retrieval system.
Detailed evaluation is performed using a similar North
American Broadcast News system, to take advantage of
the TREC SDR evaluation methodology. We report results
on this evaluation, with particular reference to the
effect of query expansion and of automatic segmentation
algorithms.
|
|
[119]
|
Y. Gotoh and S. Renals.
Statistical annotation of named entities in spoken audio.
In Proc. ESCA Workshop on Accessing Information In Spoken
Audio, pages 43-48, Cambridge, 1999.
[ bib |
.ps.gz |
.pdf ]
In this paper we describe stochastic finite state
model for named entity (NE) identification, based on
explicit word-level n-gram relations. NE categories are
incorporated in the model as word attributes. We
present an overview of the approach, describing how the
extensible vocabulary model may be used for NE
identification. We report development and evaluation
results on a North American Broadcast News task. This
approach resulted in average precision and recall
scores of around 83% on hand transcribed data, and
73% on the SPRACH recogniser output. We also present
an error analysis and a comparison of our approach with
an alternative statistical approach.
|
|
[120]
|
M. Carreira-Perpiñán and S. Renals.
A latent-variable modelling approach to the acoustic-to-articulatory
mapping problem.
In Proc. 14th Int. Congress of Phonetic Sciences, pages
2013-2016, San Francisco, 1999.
[ bib |
.ps.gz |
.pdf ]
We present a latent variable approach to the
acoustic-to-articulatory mapping problem, where
different vocal tract configurations can give rise to
the same acoustics. In latent variable modelling, the
combined acoustic and articulatory data are assumed to
have been generated by an underlying low-dimensional
process. A parametric probabilistic model is estimated
and mappings are derived from the respective
conditional distributions. This has the advantage over
other methods, such as articulatory codebooks or neural
networks, of directly addressing the nonuniqueness
problem. We demonstrate our approach with
electropalatographic and acoustic data from the ACCOR
database.
|
|
[121]
|
S. Renals and M. Hochberg.
Start-synchronous search for large vocabulary continuous speech
recognition.
IEEE Trans. on Speech and Audio Processing, 7:542-553, 1999.
[ bib |
.ps.gz |
.pdf ]
In this paper, we present a novel, efficient search
strategy for large vocabulary continuous speech
recognition. The search algorithm, based on a stack
decoder framework, utilizes phone-level posterior
probability estimates (produced by a connectionist/HMM
acoustic model) as a basis for phone deactivation
pruning - a highly efficient method of reducing the
required computation. The single-pass algorithm is
naturally factored into the time-asynchronous
processing of the word sequence and the
time-synchronous processing of the HMM state sequence.
This enables the search to be decoupled from the
language model while still maintaining the
computational benefits of time-synchronous processing.
The incorporation of the language model in the search
is discussed and computationally cheap approximations
to the full language model are introduced. Experiments
were performed on the North American Business News task
using a 60,000 word vocabulary and a trigram language
model. Results indicate that the computational cost of
the search may be reduced by more than a factor of 40
with a relative search error of less than 2% using the
techniques discussed in the paper.
|
|
[122]
|
Y. Gotoh, S. Renals, and G. Williams.
Named entity tagged language models.
In Proc IEEE ICASSP, pages 513-516, Phoenix AZ, 1999.
[ bib |
.ps.gz |
.pdf ]
We introduce Named Entity (NE) Language Modelling, a
stochastic finite state machine approach to identifying
both words and NE categories from a stream of spoken
data. We provide an overview of our approach to NE
tagged language model (LM) generation together with
results of the application of such a LM to the task of
out-of-vocabulary (OOV) word reduction in large
vocabulary speech recognition. Using the Wall Street
Journal and Broadcast News corpora, it is shown that
the tagged LM was able to reduce the overall word error
rate by 14%, detecting up to 70% of previously OOV
words. We also describe an example of the direct
tagging of spoken data with NE categories.
|
|
[123]
|
S. Renals and Y. Gotoh.
Integrated transcription and identification of named entities in
broadcast speech.
In Proc. Eurospeech, pages 1039-1042, Budapest, 1999.
[ bib |
.ps.gz |
.pdf ]
This paper presents an approach to integrating
functions for both transcription and named entity (NE)
identification into a large vocabulary continuous
speech recognition system. It builds on NE tagged
language modelling approach, which was recently applied
for development of the statistical NE annotation
system. We also present results for proper name
identification experiment using the Hub-4E open
evaluation data.
|
|
[124]
|
G. Williams and S. Renals.
Confidence measures from local posterior probability estimates.
Computer Speech and Language, 13:395-411, 1999.
[ bib |
.ps.gz |
.pdf ]
In this paper we introduce a set of related confidence
measures for large vocabulary continuous speech
recognition (LVCSR) based on local phone posterior
probability estimates output by an acceptor HMM
acoustic model. In addition to their computational
efficiency, these confidence measures are attractive as
they may be applied at the state-, phone-, word- or
utterance-levels, potentially enabling discrimination
between different causes of low confidence recognizer
output, such as unclear acoustics or mismatched
pronunciation models. We have evaluated these
confidence measures for utterance verification using a
number of different metrics. Experiments reveal several
trends in `profitability of rejection', as measured by
the unconditional error rate of a hypothesis test.
These trends suggest that crude pronunciation models
can mask the relatively subtle reductions in confidence
caused by out-of-vocabulary (OOV) words and
disfluencies, but not the gross model mismatches
elicited by non-speech sounds. The observation that a
purely acoustic confidence measure can provide improved
performance over a measure based upon both acoustic and
language model information for data drawn from the
Broadcast News corpus, but not for data drawn from the
North American Business News corpus suggests that the
quality of model fit offered by a trigram language
model is reduced for Broadcast News data. We also argue
that acoustic confidence measures may be used to inform
the search for improved pronunciation models.
|
|
[125]
|
Y. Gotoh and S. Renals.
Topic-based mixture language modelling.
Journal of Natural Language Engineering, 5:355-375, 1999.
[ bib |
.ps.gz |
.pdf ]
This paper describes an approach for constructing a
mixture of language models based on simple statistical
notions of semantics using probabilistic models
developed for information retrieval. The approach
encapsulates corpus-derived semantic information and is
able to model varying styles of text. Using such
information, the corpus texts are clustered in an
unsupervised manner and a mixture of topic-specific
language models is automatically created. The principal
contribution of this work is to characterise the
document space resulting from information retrieval
techniques and to demonstrate the approach for mixture
language modelling. A comparison is made between manual
and automatic clustering in order to elucidate how the
global content information is expressed in the space.
We also compare (in terms of association with manual
clustering and language modelling accuracy) alternative
term-weighting schemes and the effect of singular
valued decomposition dimension reduction (latent
semantic analysis). Test set perplexity results using
the British National Corpus indicate that the approach
can improve the potential of statistical language
modelling. Using an adaptive procedure, the
conventional model may be tuned to track text data with
a slight increase in computational cost.
|
|
[126]
|
S. Renals, D. Abberley, D. Kirby, and T. Robinson.
The THISL system for indexing and retrieval of broadcast news.
In Proc. IEEE Workshop on Multimedia Signal Processing, pages
77-82, Copenhagen, 1999.
[ bib |
http |
.ps.gz |
.pdf ]
This paper describes the THISL news retrieval system
which maintains an archive of BBC radio and television
news recordings. The system uses the Abbot large
vocabulary continuous speech recognition system to
transcribe news broadcasts, and the thislIR text
retrieval system to index and access the transcripts.
Decoding and indexing is performed automatically, and
the archive is updated with three hours of new material
every day. A web-based interface to the retrieval
system has been devised to facilitate access to the
archive.
|
|
[127]
|
D. Abberley, D. Kirby, S. Renals, and T. Robinson.
The THISL broadcast news retrieval system.
In Proc. ESCA Workshop on Accessing Information In Spoken
Audio, pages 19-24, Cambridge, 1999.
[ bib |
http |
.ps.gz |
.pdf ]
This paper described the THISL spoken document
retrieval system for British and North American
Broadcast News. The system is based on the
Abbot large vocabulary speech recognizer,
using a recurrent network acoustic model, and a
probabilistic text retrieval system. We discuss the
development of a realtime British English Broadcast
News system, and its integration into a spoken document
retrieval system. Detailed evaluation is performed
using a similar North American Broadcast News system,
to take advantage of the TREC SDR evaluation
methodology. We report results on this evaluation, with
particular reference to the effect of query expansion
and of automatic segmentation algorithms.
|
|
[128]
|
S. Renals, Y. Gotoh, R. Gaizauskas, and M. Stevenson.
The SPRACH/LaSIE system for named entity identification in
broadcast news.
In Proc. DARPA Broadcast News Workshop, pages 47-50, 1999.
[ bib |
.html |
.ps.gz |
.pdf ]
We have developed two conceptually different systems
that are able to identify named entities from spoken
audio. One (referred to as SPRACH-S) has a stochastic
finite state machine structure for use with an acoustic
model that identifies both words and named entities
from speech data. The other (referred to as SPRACH-R)
is a rule-based system which uses matching against
stored name lists, part-of-speech tagging, and light
phrasal parsing with specialised named entity grammars.
We provide an overview of the two approaches and
present results on the Hub-4E IE-NE evaluation task.
|
|
[129]
|
D. Abberley, S. Renals, G. Cook, and T. Robinson.
Retrieval of broadcast news documents with the THISL system.
In Proc. Seventh Text Retrieval Conference (TREC-7), pages
181-190, 1999.
[ bib |
.ps.gz |
.pdf ]
This paper describes the THISL system that
participated in the TREC-7 evaluation, Spoken Document
Retrieval (SDR) Track, and presents the results
obtained, together with some analysis. The THISL system
is based on the Abbot speech recognition system
and the thislIR text retrieval system. In this
evaluation we were concerned with investigating the
suitability for SDR of a recognizer running at less
than ten times realtime, the use of multiple
transcriptions and word graphs, the effect of simple
query expansion algorithms and the effect of varying
standard IR parameters.
|
|
[130]
|
D. Abberley, S. Renals, and G. Cook.
Retrieval of broadcast news documents with the THISL system.
In Proc IEEE ICASSP, pages 3781-3784, Seattle, 1998.
[ bib |
.ps.gz |
.pdf ]
This paper describes a spoken document retrieval
system, combining the Abbot large vocabulary continuous
speech recognition (LVCSR) system developed by
Cambridge University, Sheffield University and
SoftSound, and the PRISE information retrieval engine
developed by NIST. The system was constructed to enable
us to participate in the TREC 6 Spoken Document
Retrieval experimental evaluation. Our key aims in this
work wer e to produce a complete system for the SDR
task, to investigate the effect of a word error rate of
30-50% on retrieval performance and to investigate the
integration of LVCSR and word spotting in a retrieval
task.
|
|
[131]
|
S. Renals and D. Abberley.
The THISL spoken document retrieval system.
In Proc. 14th Twente Workshop on Language Technology, pages
129-140, 1998.
[ bib |
.ps.gz |
.pdf ]
THISL is an ESPRIT Long Term Research Project focused
the development and construction of a system to items
from an archive of television and radio news
broadcasts. In this paper we outline our spoken
document retrieval system based on the Abbot speech
recognizer and a text retrieval system based on Okapi
term-weighting . The system has been evaluated as part
of the TREC-6 and TREC-7 spoken document retrieval
evaluations and we report on the results of the TREC-7
evaluation based on a document collection of 100 hours
of North American broadcast news.
|
|
[132]
|
M. Carreira-Perpiñán and S. Renals.
Experimental evaluation of latent variable models for dimensionality
reduction.
In IEEE Proc. Neural Networks for Signal Processing, volume 8,
pages 165-173, Cambridge, 1998.
[ bib |
.ps.gz |
.pdf ]
We use electropalatographic (EPG) data as a test bed
for dimensionality reduction methods based in latent
variable modelling, in which an underlying lower
dimension representation is inferred directly from the
data. Several models (and mixtures of them) are
investigated, including factor analysis and the
generative topographic mapping (GTM). Experiments
indicate that nonlinear latent variable modelling
reveals a low-dimensional structure in the data
inaccessible to the investigated linear models.
|
|
[133]
|
J. Barker, G. Williams, and S. Renals.
Acoustic confidence measures for segmenting broadcast news.
In Proc. ICSLP, pages 2719-2722, Sydney, 1998.
[ bib |
.ps.gz |
.pdf ]
In this paper we define an acoustic confidence measure
based on the estimates of local posterior probabilities
produced by a HMM/ANN large vocabulary continuous
speech recognition system. We use this measure to
segment continuous audio into regions where it is and
is not appropriate to expend recognition effort. The
segmentation is computationally inexpensive and
provides reductions in both overall word error rate and
decoding time. The technique is evaluated using
material from the Broadcast News corpus.
|
|
[134]
|
D. Abberley, S. Renals, G. Cook, and T. Robinson.
The 1997 THISL spoken document retrieval system.
In Proc. Sixth Text Retrieval Conference (TREC-6), pages
747-752, 1998.
[ bib |
.ps.gz |
.pdf ]
The THISL spoken document retrieval system is based on
the Abbot Large Vocabulary Continuous Speech
Recognition (LVCSR) system developed by Cambridge
University, Sheffield University and SoftSound, and
uses PRISE (NIST) for indexing and retrieval. We
participated in full SDR mode. Our approach was to
transcribe the spoken documents at the word level using
Abbot, indexing the resulting text transcriptions using
PRISE. The LVCSR system uses a recurrent network-based
acoustic model (with no adaptation to different
conditions) trained on the 50 hour Broadcast News
training set, a 65,000 word vocabulary and a trigram
language model derived from Broadcast News text. Words
in queries which were out-of-vocabulary (OOV) were word
spotted at query time (utilizing the posterior phone
probabilities output by the acoustic model), added to
the transcriptions of the relevant documents and the
collection was then re-indexed. We generated
pronunciations at run-time for OOV words using the
Festival TTS system (University of Edinburgh).
|
|
[135]
|
G. Williams and S. Renals.
Confidence measures derived from an acceptor HMM.
In Proc. ICSLP, pages 831-834, Sydney, 1998.
[ bib |
.ps.gz |
.pdf ]
In this paper we define a number of confidence
measures derived from an acceptor HMM and evaluate
their performance for the task of utterance
verification using the North American Business News
(NAB) and Broadcast News (BN) corpora. Results are
presented for decodings made at both the word and phone
level which show the relative profitability of
rejection provided by the diverse set of confidence
measures. The results indicate that language model
dependent confidence measures have reduced performance
on BN data relative to that for the more grammatically
constrained NAB data. An explanation linking the
observations that rejection is more profitable for
noisy acoustics, for a reduced vocabulary and at the
phone level is also given.
|
|
[136]
|
M. Carreira-Perpiñán and S. Renals.
Dimensionality reduction of electropalatographic data using latent
variable models.
Speech Communication, 26:259-282, 1998.
[ bib |
.ps.gz |
.pdf ]
We consider the problem of obtaining a reduced
dimension representation of electropalatographic (EPG)
data. An unsupervised learning approach based on latent
variable modelling is adopted, in which an underlying
lower dimension representation is inferred directly
from the data. Several latent variable models are
investigated, including factor analysis and the
generative topographic mapping (GTM). Experiments were
carried out using a subset of the EUR-ACCOR database,
and the results indicate that these automatic methods
capture important, adaptive structure in the EPG data.
Nonlinear latent variable modelling clearly outperforms
the investigated linear models in terms of
log-likelihood and reconstruction error and suggests a
substantially smaller intrinsic dimensionality for the
EPG data than that claimed by previous studies. A
two-dimensional representation is produced with
applications to speech therapy, language learning and
articulatory dynamics.
|
|
[137]
|
G. Williams and S. Renals.
Confidence measures for evaluating pronunciation models.
In ESCA Workshop on Modeling pronunciation variation for
automatic speech recognition, pages 151-155, Kerkrade, Netherlands, 1998.
[ bib |
.ps.gz |
.pdf ]
In this paper, we investigate the use of confidence
measures for the evaluation of pronunciation models and
the employment of these evaluations in an automatic
baseform learning process. The confidence measures and
pronunciation models are obtained from the Abbot hybrid
Hidden Markov Model/Artificial Neural Network Large
Vocabulary Continuous Speech Recognition system.
Experiments were carried out for a number of baseform
learning schemes using the ARPA North American Business
News and the Broadcast News corpora from which it was
found that a confidence measure based scheme provided
the largest reduction in Word Error Rate.
|
|
[138]
|
J. Hennebert, C. Ris, H. Bourlard, S. Renals, and N. Morgan.
Estimation of global posteriors and forward-backward training of
hybrid HMM/ANN systems.
In Proc. Eurospeech, pages 1951-1954, Rhodes, 1997.
[ bib |
.ps.gz |
.pdf ]
The results of our research presented in this paper
are two-fold. First, an estimation of global
posteriors[5 5 is formalized in the framework of hybrid
HMM/ANN systems. It is shown that hybrid HMM/ANN
systems, in which the ANN part estimates local
posteriors can be used to model global posteriors. This
formalization provides us with a clear theory in which
both REMAP and “classical” Viterbi trained hybrid
systems are unified. Second, a new forward-backward
training of hybrid HMM/ANN systems is derived from the
previous formulation. Comparisons of performance
between Viterbi and forward-backward hybrid systems are
presented and discussed.
|
|
[139]
|
G. Williams and S. Renals.
Confidence measures for hybrid HMM/ANN speech recognition.
In Proc. Eurospeech, pages 1955-1958, Rhodes, 1997.
[ bib |
.ps.gz |
.pdf ]
In this paper we introduce four acoustic confidence
measures which are derived from the output of a hybrid
HMM/ANN large vocabulary continuous speech recognition
system. These confidence measures, based on local
posterior probability estimates computed by an ANN, are
evaluated at both phone and word levels, using the
North American Business News corpus.
|
|
[140]
|
Y. Gotoh and S. Renals.
Document space models using latent semantic analysis.
In Proc. Eurospeech, pages 1443-1446, Rhodes, 1997.
[ bib |
.ps.gz |
.pdf ]
In this paper, an approach for constructing mixture
language models (LMs) based on some notion of semantics
is discussed. To this end, a technique known as latent
semantic analysis (LSA) is used. The approach
encapsulates corpus-derived semantic information and is
able to model the varying style of the text. Using such
information, the corpus texts are clustered in an
unsupervised manner and mixture LMs are automatically
created. This work builds on previous work in the field
of information retrieval which was recently applied by
Bellegarda et. al. to the problem of clustering words
by semantic categories. The principal contribution of
this work is to characterize the document space
resulting from the LSA modeling and to demonstrate the
approach for mixture LM application. Comparison is made
between manual and automatic clustering in order to
elucidate how the semantic information is expressed in
the space. It is shown that, using semantic
information, mixture LMs performs better than a
conventional single LM with slight increase of
computational cost.
|
|
[141]
|
B. L. Karlsen, G. J. Brown, M. Cooke, P. Green, and S. Renals.
Analysis of a simultaneous speaker sound corpus.
In D. F. Rosenthal and H. G. Okuno, editors, Computational
Auditory Scene Analysis, pages 321-334. Lawrence Erlbaum Associates, 1997.
[ bib ]
|
|
[142]
|
S. Renals.
Phone deactivation pruning in large vocabulary continuous speech
recognition.
IEEE Signal Processing Letters, 3:4-6, 1996.
[ bib |
.ps.gz ]
In this letter we introduce a new pruning strategy for
large vocabulary continuous speech recognition based on
direct estimates of local posterior phone
probabilities. This approach is well suited to hybrid
connectionist/hidden Markov model systems. Experiments
on the Wall Street Journal task using a 20,000 word
vocabulary and a trigram language model have
demonstrated that phone deactivation pruning can
increase the speed of recognition-time search by up to
a factor of 10, with a relative increase in error rate
of less than 2%.
|
|
[143]
|
D. Kershaw, T. Robinson, and S. Renals.
The 1995 Abbot LVCSR system for multiple unknown microphones.
In Proc. ICSLP, pages 1325-1328, Philadelphia PA, 1996.
[ bib ]
|
|
[144]
|
T. Robinson, M. Hochberg, and S. Renals.
The use of recurrent networks in continuous speech recognition.
In C.-H. Lee, K. K. Paliwal, and F. K. Soong, editors, Automatic
Speech and Speaker Recognition - Advanced Topics, pages 233-258. Kluwer
Academic Publishers, 1996.
[ bib |
.ps.gz ]
This chapter describes a use of recurrent neural
networks (ie, feedback is incorporated in the
computation) as an acoustic model for continuous speech
recognition. The form of the recurrent neural network
is described, along with an appropriate parameter
estimation procedure. For each frame of acoustic data,
the recurrent network generates an estimate of the
posterior probability of the possible phones given the
observed acoustic signal. The posteriors are then
converted into scaled likelihoods and used as the
observation probabilities within a conventional
decoding paradigm (eg, Viterbi decoding). The
advantages of the using recurrent networks are that
they require a small number of parameters and provide a
fast decoding capability (relative to conventional
large vocabulary HMM systems).
|
|
[145]
|
S. Renals and M. Hochberg.
Efficient evaluation of the LVCSR search space using the NOWAY
decoder.
In Proc IEEE ICASSP, pages 149-152, Atlanta, 1996.
[ bib |
.ps.gz ]
This work further develops and analyses the large
vocabulary continuous speech recognition search
strategy reported at ICASSP-95. In particular, the
posterior-based phone deactivation pruning approach has
been extended to include phone-dependent thresholds and
an improved estimate of the least upper bound on the
utterance log-probability has been developed. Analysis
of the pruning procedures and of the search's
interaction with the language model has also been
performed. Experiments were carried out using the ARPA
North American Business News task with a 20,000 word
vocabulary and a trigram language model. As a result of
these improvements and analyses, the computational cost
of the recognition process performed by the Noway
decoder has been substantially reduced.
|
|
[146]
|
D. Kershaw, T. Robinson, and S. Renals.
The 1995 Abbot hybrid connectionist-HMM large vocabulary
recognition system.
In Proc. ARPA Spoken Language Technology Conference, pages
93-99, 1996.
[ bib ]
|
|
[147]
|
M. Hochberg, G. Cook, S. Renals, T. Robinson, and R. Schechtman.
The 1994 Abbot hybrid connectionist-HMM large vocabulary
recognition system.
In Proc. ARPA Spoken Language Technology Workshop, pages
170-175, 1995.
[ bib |
.ps.gz ]
|
|
[148]
|
T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals.
WSJCAM0: A British English speech corpus for large vocabulary
continuous speech recognition.
In Proc IEEE ICASSP, pages 81-84, Detroit, 1995.
[ bib ]
|
|
[149]
|
S. Renals and M. Hochberg.
Efficient search using posterior phone probability estimates.
In Proc IEEE ICASSP, pages 596-599, Detroit, 1995.
[ bib |
.ps.gz ]
In this paper we present a novel, efficient search
strategy for large vocabulary continuous speech
recognition (LVCSR). The search algorithm, based on
stack decoding, uses posterior phone probability
estimates to substantially increase its efficiency with
minimal effect on accuracy. In particular, the search
space is dramatically reduced by phone deactivation
pruning where phones with a small local posterior
probability are deactivated. This approach is
particularly well-suited to hybrid connectionist/hidden
Markov model systems because posterior phone
probabilities are directly computed by the acoustic
model. On large vocabulary tasks, using a trigram
language model, this increased the search speed by an
order of magnitude, with 2% or less relative search
error. Results from a hybrid system are presented using
the Wall Street Journal LVCSR database for a 20,000
word task using a backed-off trigram language model.
For this task, our single-pass decoder took around 15
times realtime on an HP735 workstation. At the cost of
7% relative search error, decoding time can be speeded
up to approximately realtime.
|
|
[150]
|
J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, and
T. Robinson.
Speaker adaptation for hybrid HMM-ANN continuous speech recogniton
system.
In Proc. Eurospeech, pages 2171-2174, Madrid, 1995.
[ bib |
.ps.gz ]
It is well known that recognition performance degrades
significantly when moving from a speaker- dependent to
a speaker-independent system. Traditional hidden Markov
model (HMM) systems have successfully applied
speaker-adaptation approaches to reduce this
degradation. In this paper we present and evaluate some
techniques for speaker-adaptation of a hybrid
HMM-artificial neural network (ANN) continuous speech
recognition system. These techniques are applied to a
well trained, speaker-independent, hybrid HMM-ANN
system and the recognizer parameters are adapted to a
new speaker through off-line procedures. The techniques
are evaluated on the DARPA RM corpus using varying
amounts of adaptation material and different ANN
architectures. The results show that speaker-adaptation
within the hybrid framework can substantially improve
system performance.
|
|
[151]
|
M. Hochberg, S. Renals, T. Robinson, and G. Cook.
Recent improvements to the Abbot large vocabulary CSR system.
In Proc IEEE ICASSP, pages 69-72, Detroit, 1995.
[ bib |
.ps.gz ]
ABBOT is the hybrid connectionist-hidden Markov model
(HMM) large-vocabulary continuous speech recognition
(CSR) system developed at Cambridge University. This
system uses a recurrent network to estimate the
acoustic observation probabilities within an HMM
framework. A major advantage of this approach is that
good performance is achieved using context-independent
acoustic models and requiring many fewer parameters
than comparable HMM systems. This paper presents
substantial performance improvements gained from new
approaches to connectionist model combination and
phone-duration modeling. Additional capability has also
been achieved by extending the decoder to handle larger
vocabulary tasks (20,000 words and greater) with a
trigram language model. This paper describes the recent
modifications to the system and experimental results
are reported for various test and development sets from
the November 1992, 1993, and 1994 ARPA evaluations of
spoken language systems.
|
|
[152]
|
M. Hochberg, S. Renals, and T. Robinson.
Abbot: The CUED hybrid connectionist/HMM large vocabulary
recognition system.
In Proc. ARPA Spoken Language Technology Workshop, pages
102-105, 1994.
[ bib ]
|
|
[153]
|
N. Morgan, H. Bourlard, S. Renals, M. Cohen, and H. Franco.
Hybrid neural network/hidden Markov model systems for continuous
speech recognition.
In I. Guyon and P. S. P. Wang, editors, Advances in Pattern
Recognition Systems using Neural Networks Technologies, volume 7 of
Series in Machine Perception and Artificial Intelligence. World Scientific
Publications, 1994.
[ bib ]
|
|
[154]
|
M. Hochberg, S. Renals, T. Robinson, and D. Kershaw.
Large vocabulary continuous speech recognition using a hybrid
connectionist/HMM system.
In Proc. ICSLP, pages 1499-1502, Yokohama, 1994.
[ bib ]
|
|
[155]
|
S. Renals, N. Morgan, H. Bourlard, M. Cohen, and H. Franco.
Connectionist probability estimators in HMM speech recognition.
IEEE Trans. on Speech and Audio Processing, 2:161-175, 1994.
[ bib |
.ps.gz ]
We are concerned with integrating connectionist
networks into a hidden Markov model (HMM) speech
recognition system. This is achieved through a
statistical interpretation of connectionist networks as
probability estimators. We review the basis of HMM
speech recognition and point out the possible benefits
of incorporating connectionist networks. Issues
necessary to the construction of a connectionist HMM
recognition system are discussed, including choice of
connectionist probability estimator. We describe the
performance of such a system, using a multi-layer
perceptron probability estimator, evaluated on the
speaker-independent DARPA Resource Management database.
In conclusion, we show that a connectionist component
improves a state-of-the-art HMM system.
|
|
[156]
|
S. Renals, M. Hochberg, and T. Robinson.
Learning temporal dependencies in connectionist speech recognition.
In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances
in Neural Information Processing Systems, volume 6, pages 1051-1058. Morgan
Kaufmann, 1994.
[ bib |
.ps.gz |
.pdf ]
|
|
[157]
|
S. Renals and M. Hochberg.
Using Gamma filters to model temporal dependencies in speech.
In Proc. ICSLP, pages 1491-1494, Yokohama, 1994.
[ bib |
.ps.gz ]
|
|
[158]
|
T. Robinson, M. Hochberg, and S. Renals.
IPA: Improved phone modelling with recurrent neural networks.
In Proc IEEE ICASSP, pages 37-40, Adelaide, 1994.
[ bib ]
|
|
[159]
|
M. Hochberg, G. Cook, S. Renals, and T. Robinson.
Connectionist model combination for large vocabulary speech
recognition.
In IEEE Proc. Neural Networks for Signal Processing, volume 4,
pages 269-278, 1994.
[ bib |
.ps.gz ]
|
|
[160]
|
N. Morgan, H. Bourlard, S. Renals, M. Cohen, and H. Franco.
Hybrid neural network/hidden Markov model systems for continuous
speech recognition.
Intl. J. Pattern Recog. and Artific. Intell., 7:899-916, 1993.
[ bib ]
|
|
[161]
|
A. J. Robinson, L. Almeida, J.-M. Boite, H. Bourlard, F. Fallside, M. Hochberg,
D. Kershaw, P. Kohn, Y. Konig, N. Morgan, J. P. Neto, S. Renals, M. Saerens,
and C. Wooters.
A neural network based, speaker independent, large vocabulary,
continuous speech recognition system: the Wernicke project.
In Proc. Eurospeech, pages 1941-1944, Berlin, 1993.
[ bib ]
|
|
[162]
|
S. Renals and D. MacKay.
Bayesian regularisation methods in a hybrid MLP-HMM system.
In Proc. Eurospeech, pages 1719-1722, Berlin, 1993.
[ bib |
.ps.gz ]
|
|
[163]
|
H. Bourlard, N. Morgan, and S. Renals.
Neural nets and hidden Markov models: Review and generalizations.
Speech Communication, 11:237-246, 1992.
[ bib ]
|
|
[164]
|
S. Renals, N. Morgan, M. Cohen, and H. Franco.
Connectionist probability estimation in the Decipher speech
recognition system.
In Proc IEEE ICASSP, pages 601-604, San Francisco, 1992.
[ bib |
.ps.gz ]
|
|
[165]
|
S. Renals, H. Bourlard, N. Morgan, H. Franco, and M. Cohen.
Connectionist optimisation of tied mixture hidden Markov models.
In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors,
Advances in Neural Information Processing Systems, volume 4, pages 167-174.
Morgan-Kaufmann, 1992.
[ bib ]
|
|
[166]
|
S. Renals, N. Morgan, M. Cohen, H. Franco, and H. Bourlard.
Improving statistical speech recognition.
In Proc. IJCNN, volume 2, pages 301-307, Baltimore MD, 1992.
[ bib |
.ps.gz ]
|
|
[167]
|
H. Bourlard, N. Morgan, C. Wooters, and S. Renals.
CDNN: A context-dependent neural network for continuous speech
recognition.
In Proc IEEE ICASSP, pages 349-352, San Francisco, 1992.
[ bib ]
|
|
[168]
|
S. Renals, D. McKelvie, and F. McInnes.
A comparative study of continuous speech recognition using neural
networks and hidden Markov models.
In Proc IEEE ICASSP, pages 369-372, Toronto, 1991.
[ bib ]
|
|
[169]
|
S. Renals, N. Morgan, and H. Bourlard.
Probability estimation by feed-forward networks in continuous speech
recognition.
In IEEE Proc. Neural Networks for Signal Processing, pages
309-318, Princeton NJ, 1991.
[ bib |
.ps.gz ]
|
|
[170]
|
S. Renals.
Chaos in neural networks.
In L. B. Almeida and C. J. Wellekens, editors, Neural Networks,
number 412 in Lecture Notes in Computer Science, pages 90-99.
Springer-Verlag, 1990.
[ bib ]
|
|
[171]
|
S. Renals and R. Rohwer.
A study of network dynamics.
J. Stat. Phys., 58:825-847, 1990.
[ bib ]
|
|
[172]
|
S. Renals and R. Rohwer.
Phoneme classification experiments using radial basis functions.
In Proc. IJCNN, pages 461-468, Washington DC, 1989.
[ bib ]
|
|
[173]
|
S. Renals and R. Rohwer.
Neural networks for speech pattern classification.
In IEE Conference Publication 313, 1st IEE Conference on
Artificial Neural Networks, pages 292-296, London, 1989.
[ bib ]
|
|
[174]
|
S. Renals and R. Rohwer.
Learning phoneme recognition using neural networks.
In Proc IEEE ICASSP, pages 413-416, Glasgow, 1989.
[ bib ]
|
|
[175]
|
S. Renals and J. Dalby.
Analysis of a neural network model for speech recognition.
In Proc. Eurospeech, volume 1, pages 333-336, Paris, 1989.
[ bib ]
|
|
[176]
|
R. Rohwer and S. Renals.
Training recurrent networks.
In L. Personnaz and G. Dreyfus, editors, Neural networks from
models to applications (Proc. nEuro '88), pages 207-216, Paris, 1988.
I.D.S.E.T.
[ bib ]
|
|
[177]
|
M. Terry, S. Renals, R. Rohwer, and J. Harrington.
A connectionist approach to speech recognition using peripheral
auditory modelling.
In Proc IEEE ICASSP, pages 699-702, New York, 1988.
[ bib ]
|
|
[178]
|
S. Renals, R. Rohwer, and M. Terry.
A comparison of speech recognition front ends using a connectionist
classifier.
In Proc. FASE Speech '88, pages 1381-1388, Edinburgh, 1988.
[ bib ]
|
|
[179]
|
R. Rohwer, S. Renals, and M. Terry.
Unstable connectionist networks in speech recognition.
In Proc IEEE ICASSP, pages 426-428, New York, 1988.
[ bib ]
|
|
[180]
|
S. Renals.
Radial basis functions network for speech pattern classification.
Electronics Letters, 25:437-439, 1988.
[ bib ]
|