|
[1]
|
Dong Wang, Javier Tejedor, Simon King, and Joe Frankel.
Term-dependent confidence normalization for out-of-vocabulary spoken
term detection.
Journal of Computer Science and Technology, 27(2), 2012.
[ bib |
DOI ]
Spoken Term Detection (STD) is a fundamental component
of spoken information retrieval systems. A key task of
an STD system is to determine reliable detections and
reject false alarms based on certain confidence
measures. The detection posterior probability, which is
often computed from lattices, is a widely used
confidence measure. However, a potential problem of
this confidence measure is that the confidence scores
of detections of all search terms are treated
uniformly, regardless of how much they may differ in
terms of phonetic or linguistic properties. This
problem is particularly evident for out-of-vocabulary
(OOV) terms which tend to exhibit high intra-term
diversity. To address the discrepancy on confidence
levels that the same confidence score may convey for
different terms, a term-dependent decision strategy is
desirable – for example, the term-specific threshold
(TST) approach. In this work, we propose a
term-dependent normalisation technique which
compensates for term diversity on confidence
estimation. Particularly, we propose a linear bias
compensation and a discriminative compensation to deal
with the bias problem that is inherent in lattice-based
confidence measuring from which the TST approach
suffers. We tested the proposed technique on speech
data from the multi-party meeting domain with two
state-of-the-art STD systems based on phonemes and
words respectively. The experimental results
demonstrate that the confidence normalisation approach
leads to a significant performance improvement in STD,
particularly for OOV terms with phoneme-based systems.
|
|
[2]
|
Dong Wang, Nicholas Evans, Raphael Troncy, and Simon King.
Handling overlaps in spoken term detection.
In Proc. International Conference on Acoustics, Speech and
Signal Processing, pages 5656-5659, May 2011.
[ bib |
DOI |
.pdf ]
Spoken term detection (STD) systems usually arrive at
many overlapping detections which are often addressed
with some pragmatic approaches, e.g. choosing the best
detection to represent all the overlaps. In this paper
we present a theoretical study based on a concept of
acceptance space. In particular, we present two
confidence estimation approaches based on Bayesian and
evidence perspectives respectively. Analysis shows that
both approaches possess respective ad vantages and
shortcomings, and that their combination has the
potential to provide an improved confidence estimation.
Experiments conducted on meeting data confirm our
analysis and show considerable performance improvement
with the combined approach, in particular for
out-of-vocabulary spoken term detection with stochastic
pronunciation modeling.
|
|
[3]
|
Dong Wang and Simon King.
Letter-to-sound pronunciation prediction using conditional random
fields.
IEEE Signal Processing Letters, 18(2):122-125, February 2011.
[ bib |
DOI |
.pdf ]
Pronunciation prediction, or letter-to-sound (LTS)
conversion, is an essential task for speech synthesis,
open vo- cabulary spoken term detection and other
applications dealing with novel words. Most current
approaches (at least for English) employ data-driven
methods to learn and represent pronunciation “rules”
using statistical models such as decision trees, hidden
Markov models (HMMs) or joint-multigram models (JMMs).
The LTS task remains challenging, particularly for
languages with a complex relationship between spelling
and pronunciation such as English. In this paper, we
propose to use a conditional random field (CRF) to
perform LTS because it avoids having to model a
distribution over observations and can perform global
inference, suggesting that it may be more suitable for
LTS than decision trees, HMMs or JMMs. One challenge in
applying CRFs to LTS is that the phoneme and grapheme
sequences of a word are generally of different lengths,
which makes CRF training difficult. To solve this
problem, we employed a joint-multigram model to
generate aligned training exemplars. Experiments
conducted with the AMI05 dictionary demonstrate that a
CRF significantly outperforms other models, especially
if n-best lists of predictions are generated.
|
|
[4]
|
Dong Wang, Simon King, Nick Evans, and Raphael Troncy.
Direct posterior confidence for out-of-vocabulary spoken term
detection.
In Proc. ACM Multimedia 2010 Searching Spontaneous
Conversational Speech Workshop, October 2010.
[ bib |
DOI |
.pdf ]
Spoken term detection (STD) is a fundamental task in
spoken information retrieval. Compared to conventional
speech transcription and keyword spotting, STD is an
open-vocabulary task and is necessarily required to
address out-of-vocabulary (OOV) terms. Approaches based
on subword units, e.g. phonemes, are widely used to
solve the OOV issue; however, performance on OOV terms
is still significantly inferior to that for
in-vocabulary (INV) terms. The performance degradation
on OOV terms can be attributed to a multitude of
factors. A particular factor we address in this paper
is that the acoustic and language models used for
speech transcribing are highly vulnerable to OOV terms,
which leads to unreliable confidence measures and
error-prone detections. A direct posterior confidence
measure that is derived from discriminative models has
been proposed for STD. In this paper, we utilize this
technique to tackle the weakness of OOV terms in
confidence estimation. Neither acoustic models nor
language models being included in the computation, the
new confidence avoids the weak modeling problem with
OOV terms. Our experiments, set up on multi-party
meeting speech which is highly spontaneous and
conversational, demonstrate that the proposed technique
improves STD performance on OOV terms significantly;
when combined with conventional lattice-based
confidence, a significant improvement in performance is
obtained on both INVs and OOVs. Furthermore, the new
confidence measure technique can be combined together
with other advanced techniques for OOV treatment, such
as stochastic pronunciation modeling and term-dependent
confidence discrimination, which leads to an integrated
solution for OOV STD with greatly improved performance.
|
|
[5]
|
Dong Wang, Simon King, Nicholas W. D. Evans, and Raphaël Troncy.
Direct posterior confidence for out-of-vocabulary spoken term
detection.
In SSCS 2010, ACM Workshop on Searching Spontaneous
Conversational Speech, September 20-24, 2010, Firenze, Italy,
Firenze, ITALY, September 2010.
[ bib |
DOI ]
Spoken term detection (STD) is a fundamental task in
spoken information retrieval. Compared to conventional
speech transcription and keyword spotting, STD is an
open-vocabulary task and is necessarily required to
address out-of-vocabulary (OOV) terms. Approaches based
on subword units, e.g. phonemes, are widely used to
solve the OOV issue; however, performance on OOV terms
is still significantly inferior to that for
in-vocabulary (INV) terms. The performance degradation
on OOV terms can be attributed to a multitude of
factors. A particular factor we address in this paper
is that the acoustic and language models used for
speech transcribing are highly vulnerable to OOV terms,
which leads to unreliable confidence measures and
error-prone detections. A direct posterior confidence
measure that is derived from discriminative models has
been proposed for STD. In this paper, we utilize this
technique to tackle the weakness of OOV terms in
confidence estimation. Neither acoustic models nor
language models being included in the computation, the
new confidence avoids the weak modeling problem with
OOV terms. Our experiments, set up on multi-party
meeting speech which is highly spontaneous and
conversational, demonstrate that the proposed technique
improves STD performance on OOV terms significantly;
when combined with conventional lattice-based
confidence, a significant improvement in performance is
obtained on both INVs and OOVs. Furthermore, the new
confidence measure technique can be combined together
with other advanced techniques for OOV treatment, such
as stochastic pronunciation modeling and term-dependent
confidence discrimination, which leads to an integrated
solution for OOV STD with greatly improved performance.
|
|
[6]
|
Dong Wang, Simon King, Nick Evans, and Raphael Troncy.
CRF-based stochastic pronunciation modelling for out-of-vocabulary
spoken term detection.
In Proc. Interspeech, Makuhari, Chiba, Japan, September 2010.
[ bib ]
Out-of-vocabulary (OOV) terms present a significant
challenge to spoken term detection (STD). This
challenge, to a large extent, lies in the high degree
of uncertainty in pronunciations of OOV terms. In
previous work, we presented a stochastic pronunciation
modeling (SPM) approach to compensate for this
uncertainty. A shortcoming of our original work,
however, is that the SPM was based on a joint-multigram
model (JMM), which is suboptimal. In this paper, we
propose to use conditional random fields (CRFs) for
letter-to-sound conversion, which significantly
improves quality of the predicted pronunciations. When
applied to OOV STD, we achieve consider- able
performance improvement with both a 1-best system and
an SPM-based system.
|
|
[7]
|
Javier Tejedor, Doroteo T. Toledano, Miguel Bautista, Simon King, Dong Wang,
and Jose Colas.
Augmented set of features for confidence estimation in spoken term
detection.
In Proc. Interspeech, September 2010.
[ bib |
.pdf ]
Discriminative confidence estimation along with
confidence normalisation have been shown to construct
robust decision maker modules in spoken term detection
(STD) systems. Discriminative confidence estimation,
making use of termdependent features, has been shown to
improve the widely used lattice-based confidence
estimation in STD. In this work, we augment the set of
these term-dependent features and show a significant
improvement in the STD performance both in terms of
ATWV and DET curves in experiments conducted on a
Spanish geographical corpus. This work also proposes a
multiple linear regression analysis to carry out the
feature selection. Next, the most informative features
derived from it are used within the discriminative
confidence on the STD system.
|
|
[8]
|
D. Wang, S. King, and J. Frankel.
Stochastic pronunciation modelling for out-of-vocabulary spoken term
detection.
Audio, Speech, and Language Processing, IEEE Transactions on,
PP(99), July 2010.
[ bib |
DOI ]
Spoken term detection (STD) is the name given to the
task of searching large amounts of audio for
occurrences of spoken terms, which are typically single
words or short phrases. One reason that STD is a hard
task is that search terms tend to contain a
disproportionate number of out-of-vocabulary (OOV)
words. The most common approach to STD uses subword
units. This, in conjunction with some method for
predicting pronunciations of OOVs from their written
form, enables the detection of OOV terms but
performance is considerably worse than for
in-vocabulary terms. This performance differential can
be largely attributed to the special properties of
OOVs. One such property is the high degree of
uncertainty in the pronunciation of OOVs. We present a
stochastic pronunciation model (SPM) which explicitly
deals with this uncertainty. The key insight is to
search for all possible pronunciations when detecting
an OOV term, explicitly capturing the uncertainty in
pronunciation. This requires a probabilistic model of
pronunciation, able to estimate a distribution over all
possible pronunciations. We use a joint-multigram model
(JMM) for this and compare the JMM-based SPM with the
conventional soft match approach. Experiments using
speech from the meetings domain demonstrate that the
SPM performs better than soft match in most operating
regions, especially at low false alarm probabilities.
Furthermore, SPM and soft match are found to be
complementary: their combination provides further
performance gains.
|
|
[9]
|
Dong Wang, Simon King, Joe Frankel, and Peter Bell.
Stochastic pronunciation modelling and soft match for
out-of-vocabulary spoken term detection.
In Proc. ICASSP, Dallas, Texas, USA, March 2010.
[ bib |
.pdf ]
A major challenge faced by a spoken term detection
(STD) system is the detection of out-of-vocabulary
(OOV) terms. Although a subword-based STD system is
able to detect OOV terms, performance reduction is
always observed compared to in-vocabulary terms. One
challenge that OOV terms bring to STD is the
pronunciation uncertainty. A commonly used approach to
address this problem is a soft matching procedure,and
the other is the stochastic pronunciation modelling
(SPM) proposed by the authors. In this paper we compare
these two approaches, and combine them using a
discriminative decision strategy. Experimental results
demonstrated that SPM and soft match are highly
complementary, and their combination gives significant
performance improvement to OOV term detection.
Keywords: confidence estimation, spoken term detection, speech
recognition
|
|
[10]
|
Dong Wang, Simon King, and Joe Frankel.
Stochastic pronunciation modelling for spoken term detection.
In Proc. of Interspeech, pages 2135-2138, Brighton, UK,
September 2009.
[ bib |
.pdf ]
A major challenge faced by a spoken term detection
(STD) system is the detection of out-of-vocabulary
(OOV) terms. Although a subword-based STD system is
able to detect OOV terms, performance reduction is
always observed compared to in-vocabulary terms.
Current approaches to STD do not acknowledge the
particular properties of OOV terms, such as
pronunciation uncertainty. In this paper, we use a
stochastic pronunciation model to deal with the
uncertain pronunciations of OOV terms. By considering
all possible term pronunciations, predicted by a
joint-multigram model, we observe a significant
performance improvement.
|
|
[11]
|
Dong Wang, Simon King, Joe Frankel, and Peter Bell.
Term-dependent confidence for out-of-vocabulary term detection.
In Proc. Interspeech, pages 2139-2142, Brighton, UK, September
2009.
[ bib |
.pdf ]
Within a spoken term detection (STD) system, the
decision maker plays an important role in retrieving
reliable detections. Most of the state-of-the-art STD
systems make decisions based on a confidence measure
that is term-independent, which poses a serious problem
for out-of-vocabulary (OOV) term detection. In this
paper, we study a term-dependent confidence measure
based on confidence normalisation and discriminative
modelling, particularly focusing on its remarkable
effectiveness for detecting OOV terms. Experimental
results indicate that the term-dependent confidence
provides much more significant improvement for OOV
terms than terms in-vocabulary.
|
|
[12]
|
Javier Tejedor, Dong Wang, Simon King, Joe Frankel, and Jose Colas.
A posterior probability-based system hybridisation and combination
for spoken term detection.
In Proc. Interspeech, pages 2131-2134, Brighton, UK, September
2009.
[ bib |
.pdf ]
Spoken term detection (STD) is a fundamental task for
multimedia information retrieval. To improve the
detection performance, we have presented a direct
posterior-based confidence measure generated from a
neural network. In this paper, we propose a
detection-independent confidence estimation based on
the direct posterior confidence measure, in which the
decision making is totally separated from the term
detection. Based on this idea, we first present a
hybrid system which conducts the term detection and
confidence estimation based on different sub-word
units, and then propose a combination method which
merges detections from heterogeneous term detectors
based on the direct posterior-based confidence.
Experimental results demonstrated that the proposed
methods improved system performance considerably for
both English and Spanish.
|
|
[13]
|
Dong Wang, Tejedor Tejedor, Joe Frankel, and Simon King.
Posterior-based confidence measures for spoken term detection.
In Proc. of ICASSP09, Taiwan, April 2009.
[ bib |
.pdf ]
Confidence measures play a key role in spoken term
detection (STD) tasks. The confidence measure expresses
the posterior probability of the search term appearing
in the detection period, given the speech. Traditional
approaches are based on the acoustic and language model
scores for candidate detections found using automatic
speech recognition, with Bayes' rule being used to
compute the desired posterior probability. In this
paper, we present a novel direct posterior-based
confidence measure which, instead of resorting to the
Bayesian formula, calculates posterior probabilities
from a multi-layer perceptron (MLP) directly. Compared
with traditional Bayesian-based methods, the
direct-posterior approach is conceptually and
mathematically simpler. Moreover, the MLP-based model
does not require assumptions to be made about the
acoustic features such as their statistical
distribution and the independence of static and dynamic
co-efficients. Our experimental results in both English
and Spanish demonstrate that the proposed direct
posterior-based confidence improves STD performance.
|
|
[14]
|
Javier Tejedor, Dong Wang, Joe Frankel, Simon King, and José Colás.
A comparison of grapheme and phoneme-based units for Spanish spoken
term detection.
Speech Communication, 50(11-12):980-991, November-December
2008.
[ bib |
DOI ]
The ever-increasing volume of audio data available
online through the world wide web means that automatic
methods for indexing and search are becoming essential.
Hidden Markov model (HMM) keyword spotting and lattice
search techniques are the two most common approaches
used by such systems. In keyword spotting, models or
templates are defined for each search term prior to
accessing the speech and used to find matches. Lattice
search (referred to as spoken term detection), uses a
pre-indexing of speech data in terms of word or
sub-word units, which can then quickly be searched for
arbitrary terms without referring to the original
audio. In both cases, the search term can be modelled
in terms of sub-word units, typically phonemes. For
in-vocabulary words (i.e. words that appear in the
pronunciation dictionary), the letter-to-sound
conversion systems are accepted to work well. However,
for out-of-vocabulary (OOV) search terms,
letter-to-sound conversion must be used to generate a
pronunciation for the search term. This is usually a
hard decision (i.e. not probabilistic and with no
possibility of backtracking), and errors introduced at
this step are difficult to recover from. We therefore
propose the direct use of graphemes (i.e., letter-based
sub-word units) for acoustic modelling. This is
expected to work particularly well in languages such as
Spanish, where despite the letter-to-sound mapping
being very regular, the correspondence is not
one-to-one, and there will be benefits from avoiding
hard decisions at early stages of processing. In this
article, we compare three approaches for Spanish
keyword spotting or spoken term detection, and within
each of these we compare acoustic modelling based on
phone and grapheme units. Experiments were performed
using the Spanish geographical-domain Albayzin corpus.
Results achieved in the two approaches proposed for
spoken term detection show us that trigrapheme units
for acoustic modelling match or exceed the performance
of phone-based acoustic models. In the method proposed
for keyword spotting, the results achieved with each
acoustic model are very similar.
|
|
[15]
|
Dong Wang, Ivan Himawan, Joe Frankel, and Simon King.
A posterior approach for microphone array based speech recognition.
In Proc. Interspeech, pages 996-999, September 2008.
[ bib |
.pdf ]
Automatic speech recognition (ASR) becomes rather
difficult in meetings domains because of the adverse
acoustic conditions, including more background noise,
more echo and reverberation and frequent cross-talking.
Microphone arrays have been demonstrated able to boost
ASR performance dramatically in such noisy and
reverberant environments, with various beamforming
algorithms. However, almost all existing beamforming
measures work in the acoustic domain, resorting to
signal processing theories and geometric explanation.
This limits their application, and induces significant
performance degradation when the geometric property is
unavailable or hard to estimate, or if heterogenous
channels exist in the audio system. In this paper, we
preset a new posterior-based approach for array-based
speech recognition. The main idea is, instead of
enhancing speech signals, we try to enhance the
posterior probabilities that frames belonging to
recognition units, e.g., phones. These enhanced
posteriors are then transferred to posterior
probability based features and are modeled by HMMs,
leading to a tandem ANN-HMM hybrid system presented by
Hermansky et al.. Experimental results demonstrated the
validity of this posterior approach. With the posterior
accumulation or enhancement, significant improvement
was achieved over the single channel baseline.
Moreover, we can combine the acoustic enhancement and
posterior enhancement together, leading to a hybrid
acoustic-posterior beamforming approach, which works
significantly better than just the acoustic
beamforming, especially in the scenario with
moving-speakers.
|
|
[16]
|
Joe Frankel, Dong Wang, and Simon King.
Growing bottleneck features for tandem ASR.
In Proc. Interspeech, page 1549, September 2008.
[ bib |
.pdf ]
We present a method for training bottleneck MLPs for
use in tandem ASR. Experiments on meetings data show
that this approach leads to improved performance
compared with training MLPs from a random
initialization.
|
|
[17]
|
Dong Wang, Joe Frankel, Javier Tejedor, and Simon King.
A comparison of phone and grapheme-based spoken term detection.
In Proc. ICASSP, pages 4969-4972, March-April 2008.
[ bib |
DOI ]
We propose grapheme-based sub-word units for spoken
term detection (STD). Compared to phones, graphemes
have a number of potential advantages. For
out-of-vocabulary search terms, phone- based approaches
must generate a pronunciation using letter-to-sound
rules. Using graphemes obviates this potentially
error-prone hard decision, shifting pronunciation
modelling into the statistical models describing the
observation space. In addition, long-span grapheme
language models can be trained directly from large text
corpora. We present experiments on Spanish and English
data, comparing phone and grapheme-based STD. For
Spanish, where phone and grapheme-based systems give
similar transcription word error rates (WERs),
grapheme-based STD significantly outperforms a phone-
based approach. The converse is found for English,
where the phone-based system outperforms a grapheme
approach. However, we present additional analysis which
suggests that phone-based STD performance levels may be
achieved by a grapheme-based approach despite lower
transcription accuracy, and that the two approaches may
usefully be combined. We propose a number of directions
for future development of these ideas, and suggest that
if grapheme-based STD can match phone-based
performance, the inherent flexibility in dealing with
out-of-vocabulary terms makes this a desirable
approach.
|