Reproducible Research

Work on unsupervised speaker adaptation of neural network acoustic models (SLT14' and ICASSP15' papers)

Paper information and status

P Swietojanski and S Renals. "Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models". Proc IEEE SLT, 2014.

[ pdf | article on IEEE Xplore | bibtex

Abstract

This paper proposes a simple yet effective model-based neural network speaker adaptation technique that learns speaker- speciﬁc hidden unit contributions given adaptation data, without requiring any form of speaker-adaptive training, or labelled adaptation data. An additional amplitude parameter is deﬁned for each hidden unit; the amplitude parameters are tied for each speaker, and are learned using unsupervised adaptation. We conducted experiments on the TED talks data, as used in the International Workshop on Spoken Language Translation (IWSLT) evaluations. Our results indicate that the approach can reduce word error rates on standard IWSLT test sets by about 8–15% relative compared to unadapted systems, with a further reduction of 4–6% relative when combined with feature-space maximum likelihood linear re- gression (fMLLR). The approach can be employed in most existing feed-forward neural network architectures, and we report results using various hidden unit activation functions: sigmoid, maxout, and rectifying linear units (ReLU).

Differentiable Pooling for Unsupervised Speaker Adaptation

Paper information and status

P Swietojanski and S Renals. "Differentiable Pooling for Unsupervised Speaker Adaptation". Proc IEEE ICASSP, 2015.

[ pdf | article on IEEE Xplore | bibtex ]

Abstract

This paper proposes a differentiable pooling mechanism to perform model-based neural network speaker adaptation. The proposed technique learns a speaker-dependent combination of activations within pools of hidden units, was shown to work well unsupervised, and does not require speaker-adaptive training. We have conducted a set of experiments on the TED talks data, as used in the IWSLT evaluations. Our results indicate that the approach can reduce word error rates (WERs) on standard IWSLT test sets by about 5–11% relative compared to speaker-independent systems and was found complementary to the recently proposed learning hidden units contribution (LHUC) approach, reducing WER by 6–13% relative. Both methods were also found to work well when adapting with small amounts of unsupervised data – 10 seconds is able to decrease the WER by 5% relative compared to the baseline speaker independent system.

Data

We work with publicitly available TED Talks transcription task following the rules of the IWSLT evaluation campaigns.

Code

We are working on releasing the system building scripts as part of the Kaldi toolkit. More information coming soon.

Contact

Contact Pawel Swietojanski (p.swietojanski@ed.ac.uk) for any further information.