Reproducible Research
Convolutional Neural Networks for Distant Speech Recognition
Paper information and status
P Swietojanski, A Ghoshal, and S Renals. "Convolutional Neural Networks for Distant Speech Recognition". Signal Processing Letters, IEEE, Volume:21 , Issue: 9 2014.
[ pdf | IEEE Xplore | bibtex]
Abstract
We investigate convolutional neural networks (CNNs) for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM). In the MDM case we explore a beamformed signal input representation compared with the direct use of multiple acoustic channels as a parallel input to the CNN. We have explored different weight sharing approaches, and propose a channel-wise convolution with two-way pooling. Our experiments, using the AMI meeting corpus, found that CNNs improve the word error rate (WER) by 6.5% relative compared to conventional deep neural network (DNN) models and 15.7% over a discriminatively trained Gaussian mixture model (GMM) baseline. For cross-channel CNN training, the WER improves by 3.5% relative over the comparable DNN structure. Compared with the best beamformed GMM system, cross-channel convolution reduces the WER by 9.7% relative, and matches the accuracy of a beamformed DNN.Data
The train, development, and eval sets are defined below. These are the same as the sets called ``Full-corpus-ASR partition of meetings'' on the AMI Corpus page.- Train set: ES2002, ES2003, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2014, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3006, TS3007, TS3008, TS3009, TS3010, TS3011, TS3012, EN2001, EN2003, EN2004a, EN2005a, EN2006, EN2009, IN1001, IN1002. IN1005, IN1007, IN1008, IN1009, IN1012, IN1013, IN1014, IN1016
- Dev set: ES2011, IS1008, TS3004, IB4001, IB4002, IB4003, IB4004, IB4010, IB4011
- Eval set: ES2004, IS1009, TS3003, EN2002
We use the AMI Annotations v1.6