|
[1]
|
K. Livescu, Ö. Çetin, M. Hasegawa-Johnson, S. King, C. Bartels, N. Borges,
A. Kantor, P. Lal, L. Yung, S. Bezman, Dawson-Haggerty, B. Woods, J. Frankel,
M. Magimai-Doss, and K. Saenko.
Articulatory feature-based methods for acoustic and audio-visual
speech recognition: Summary from the 2006 JHU Summer Workshop.
In Proc. ICASSP, Honolulu, April 2007.
[ bib |
.pdf ]
We report on investigations, conducted at the 2006
Johns HopkinsWorkshop, into the use of articulatory
features (AFs) for observation and pronunciation models
in speech recognition. In the area of observation
modeling, we use the outputs of AF classiers both
directly, in an extension of hybrid HMM/neural network
models, and as part of the observation vector, an
extension of the tandem approach. In the area of
pronunciation modeling, we investigate a model having
multiple streams of AF states with soft synchrony
constraints, for both audio-only and audio-visual
recognition. The models are implemented as dynamic
Bayesian networks, and tested on tasks from the
Small-Vocabulary Switchboard (SVitchboard) corpus and
the CUAVE audio-visual digits corpus. Finally, we
analyze AF classication and forced alignment using a
newly collected set of feature-level manual
transcriptions.
|