The Centre for Speech Technology Research, The university of Edinburgh


The 2012_MMA (multi microphone array) corpus is a corpus of read speech (WSJ) recorded with multiple distant microphone (arrays) enabling research in speaker localisation, (blind) speech separation and speech recognition.




Erich Zwyssig

The University of Edinburgh

Data Type

The 2012_MMA corpus contains speech from single and overlapping speakers, following the ideas and setup of the MC-WSJ-AV corpus [1].

Data Source

The MC-WSJ-AV corpus consists of audio recordings generated at premises of the

The Centre for Speech Technology Research
University of Edinburgh
Informatics Forum
10 Crichton Street
United Kingdom

Recordings were carried out using five microphone arrays, as shown in Figure 1

Figure 1 - 2012_MMA corpus recording setup

The configuration for the five circular microphone arrays (diameter d, sampling rate Fs) are:

With the analogue microphone (array) containing

and the digital microphone (arrays) being

built from

Based on the USBPAL from Rigisystems and using eight (8) Analog Devices digital MEMS microphones ADMP441 the DMMA.2 and DMMA.3 are circular microphone arrays with a diameter of 20 and 4 cm. A detailed datasheet and sound samples for the DMMA.3 are available here (DMMA.3).

Recording setup

The recording setup in the IMR and hemi-anechoic chamber are identical within the constraints of the setup.

Figure 2, 3, 4 and 5 show the microphone array placement and adjustment as well as the dimension of the setup with respect to the position of the speaker(s).

Figure 2 - Recording setup (IMR - one speaker)

Figure 3 - Recording setup (IMR - two speakers)

Figure 4 - Recording setup (hemi-anechoic chamber - one speaker)

Figure 5 - Recording setup (hemi-anechoic chamber - two speakers)

Note: The red triangle indicates microphone channel 1, channel numbers increase counter-clockwise (in the *.wav audio file provided).

Photos of the setup are shown in Figure 6 and 7.

Figure 6 - Recording setup (IMR)

Figure 7 - Recording setup (hemi-anechoic chamber)

Languages and dialects

UK English.

Narrative Description

The 2012_MMA corpus offers researchers an intermediate task between simple digit recognition and large vocabulary conversational speech recognition. It consists of sentences read from the Wall Street Journal (WSJ) taken from the test set of the WSJCAM0 database.

A total of 24 speakers, 12 male and 12 female, are recorded in two different scenarios, these are:

The participants recordings were split equally between an

Two same-gender speakers were paired for recording of the overlapping speech.

The speakers are recorded using five different eight-channel microphone array, reading WSJ from prompts. In the single speaker scenario one speaker reads from a fixed position, in the overlapping scenario two speakers read from two fixed positions.

Two times twelve participants were recorded for the single scenario, two times six pairs for the overlapping scenario. Each read about 90 sentences which are available for speech separation and recognition experiments.




Single speaker

Distant (automatic) speech recognition


Overlapping speakers

Speaker localisation, speech separation and distant (automatic) speech recognition


Each speaker (pair) read WSJ sentences (WSJCAM0) from script, i.e.

Data set (name)

Number of sentences



Approx. 17

TIMIT style, for adaptation


Approx. 40

5,000 word (closed vocabulary) sub corpus of WSJCAM0


Approx. 40

20,000 word (open vocabulary) sub corpus of WSJCAM0

Each sentence is individually split from the recording for recognition and stored in folders following the structure

... with

... and *.wav is defined as

.. with

... and where <ref#> determines the correct answer from the mlf file stored in ./mlf/WSJ.mlf

The speaker locations are stored in ./<corpus>/SentenceLocation/T<>.txt


[1] M. Lincoln, I. McCowan, J. Vepa and H.K. Maganti, "The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments", IEEE Workshop on Automatic Speech Recognition and Understanding, 2005

[2] E. Zwyssig, S. Renals and M. Lincoln, "On the effect of SNR and superdirective beamforming in speaker diarisation in meetings", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

[3] E. Zwyssig, F. Faubel, S. Renals and M. Lincoln, "Recognition of overlapping speech using digital MEMS microphone arrays", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013