MC-WSJ-AV
The MC-WSJ-AV (multi-channel Wall Street Journal Audio-Visual) corpus is a corpus of read speech (WSJ) recorded with close talking and distant microphone (arrays) enabling research in speaker localisation, (blind) speech separation and speech recognitionAuthors
Author |
Affiliation |
Mike Lincoln |
The University of Edinburgh |
Erich Zwyssig |
The University of Edinburgh |
The 2nd author only carried out minimal data preparation in the final stages of releasing the corpus.
Data Type
Speech from single, overlapping and moving speakers.
Data Source
The MC-WSJ-AV corpus consists of audio recordings generated at premises of the
The Centre for Speech Technology Research
University of Edinburgh
Informatics Forum
10 Crichton Street
Edinburgh
EH8 9AB
United Kingdom
Recordings were carried out using a headset and lapel microphone and an eight-channel microphone array. Despite the title (AV) no video data is provided. Speaker locations are fixed and can be derived from [1].
Languages and dialects
(Mostly) UK English.
Narrative Description
The MC-WSJ-AV corpus offers researchers an intermediate task between simple digit recognition and large vocabulary conversational speech recognition. It consists of sentences read from the Wall Street Journal (WSJ) taken from the test set of the WSJCAM0 database.
A total of about 45 speakers, male and female, are recorded in three different scenarios, these are:
- single (stationary) speaker
- two (stationary) overlapping speakers
- single moving speaker
The speakers are recorded using a headset and lapel microphone and an eight-channel microphone array, reading WSJ from prompts. In the single speaker scenario participants are asked to read from 6 fixed positions, in the overlapping scenario speaker get assigned a fixed position for the entire recording, and for the moving scenario speakers move from one position to the next while reading.
15 participants were recorded for the single scenario, 9 pairs for the overlapping scenario and 9 for the moving scenario. Each read about 90 sentences which are available for speech separation and recognition experiments.
Task |
Applications |
Comments |
Single speaker |
Distant (automatic) speech recognition |
|
Overlapping speakers |
Speaker localisation, speech separation and distant (automatic) speech recognition |
|
Moving speaker |
Speaker localisation, speech separation and distant (automatic) speech recognition |
|
Each speaker (pair) read WSJ sentences (WSJCAM0) from script, i.e.
Data set (name) |
Number of sentences |
Description |
adapt |
Approx. 17 |
TIMIT style, for adaptation |
5k |
Approx. 40 |
5,000 word (closed vocabulary) sub corpus of WSJCAM0 |
20k |
Approx. 40 |
20,000 word (open vocabulary) sub corpus of WSJCAM0 |
Each sentence is individually split from the recording for recognition and stored in folders following the structure
- ./MC-WSJ-AV/audio/<task>/T<#1>[_T<#2>]/<mic_type>/{adap|5k\20k}/*.wav
... with
- <tasks > defining the task, i.e. stat (single static), move (single moving) or olap (two overlapping )
- T<#> defining the participant and his/her number #
- <mic_type> defining the microphone type, e.g. array1/2, headset1/2 or lapel1/2
.. and *.wav is defined as
- AMI_WSJ<#1>[_<#2>]_<mic_type>_T<#1><ref#1>_T<#2><ref#2>.wav
... where <ref#> determines the correct answer from the mlf file stored in ./MC-WSJ-AV/mlf/MC_WSJ_AV.mlf
The speaker locations are stored in ./MC-WSJ-AV/etc/sentencelocation/T<>.txt
References
[1] M. Lincoln, I. McCowan, J. Vepa and H.K. Maganti, "The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments", IEEE Workshop on Automatic Speech Recognition and Understanding, 2005