The Centre for Speech Technology Research, The university of Edinburgh


The MC-WSJ-AV (multi-channel Wall Street Journal Audio-Visual) corpus is a corpus of read speech (WSJ) recorded with close talking and distant microphone (arrays) enabling research in speaker localisation, (blind) speech separation and speech recognition




Mike Lincoln

The University of Edinburgh

Erich Zwyssig

The University of Edinburgh

The 2nd author only carried out minimal data preparation in the final stages of releasing the corpus.

Data Type

Speech from single, overlapping and moving speakers.

Data Source

The MC-WSJ-AV corpus consists of audio recordings generated at premises of the

The Centre for Speech Technology Research
University of Edinburgh
Informatics Forum
10 Crichton Street
United Kingdom

Recordings were carried out using a headset and lapel microphone and an eight-channel microphone array. Despite the title (AV) no video data is provided. Speaker locations are fixed and can be derived from [1].

Languages and dialects

(Mostly) UK English.

Narrative Description

The MC-WSJ-AV corpus offers researchers an intermediate task between simple digit recognition and large vocabulary conversational speech recognition. It consists of sentences read from the Wall Street Journal (WSJ) taken from the test set of the WSJCAM0 database.

A total of about 45 speakers, male and female, are recorded in three different scenarios, these are:

The speakers are recorded using a headset and lapel microphone and an eight-channel microphone array, reading WSJ from prompts. In the single speaker scenario participants are asked to read from 6 fixed positions, in the overlapping scenario speaker get assigned a fixed position for the entire recording, and for the moving scenario speakers move from one position to the next while reading.

15 participants were recorded for the single scenario, 9 pairs for the overlapping scenario and 9 for the moving scenario. Each read about 90 sentences which are available for speech separation and recognition experiments.




Single speaker

Distant (automatic) speech recognition


Overlapping speakers

Speaker localisation, speech separation and distant (automatic) speech recognition


Moving speaker

Speaker localisation, speech separation and distant (automatic) speech recognition


Each speaker (pair) read WSJ sentences (WSJCAM0) from script, i.e.

Data set (name)

Number of sentences



Approx. 17

TIMIT style, for adaptation


Approx. 40

5,000 word (closed vocabulary) sub corpus of WSJCAM0


Approx. 40

20,000 word (open vocabulary) sub corpus of WSJCAM0

Each sentence is individually split from the recording for recognition and stored in folders following the structure

... with

..  and *.wav is defined as

... where <ref#> determines the correct answer from the mlf file stored in ./MC-WSJ-AV/mlf/MC_WSJ_AV.mlf


The speaker locations are stored in ./MC-WSJ-AV/etc/sentencelocation/T<>.txt


[1] M. Lincoln, I. McCowan, J. Vepa and H.K. Maganti, "The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments", IEEE Workshop on Automatic Speech Recognition and Understanding, 2005