2012_MMA

The 2012_MMA (multi microphone array) corpus is a corpus of read speech (WSJ) recorded with multiple distant microphone (arrays) enabling research in speaker localisation, (blind) speech separation and speech recognition.

Authors

Author	Affiliation
Erich Zwyssig	The University of Edinburgh

Data Type

The 2012_MMA corpus contains speech from single and overlapping speakers, following the ideas and setup of the MC-WSJ-AV corpus [1].

Data Source

The MC-WSJ-AV corpus consists of audio recordings generated at premises of the

The Centre for Speech Technology Research
University of Edinburgh
Informatics Forum
10 Crichton Street
Edinburgh
EH8 9AB
United Kingdom

Recordings were carried out using five microphone arrays, as shown in Figure 1

Figure 1 - 2012_MMA corpus recording setup

The configuration for the five circular microphone arrays (diameter d, sampling rate Fs) are:

Analogue, d = 20 cm, Fs = 48 kHz, 8 channels
Analogue, d = 4 cm, Fs = 96 kHz, 8 channels
Digital, d = 20 cm, Fs = 48 kHz, 8 channels
Digital, d = 4 cm, Fs = 96 kHz, 8 channels
Digital, d = 4 cm, Fs = 48 kHz, 8 channels

With the analogue microphone (array) containing

(8 x) Sennheiser MKE 2-P-C microphone

and the digital microphone (arrays) being

digital MEMS microphone array version 2 (DMMA.2) [2]
digital MEMS microphone array version 3 (DMMA.3) [3]

built from

(8 x) ADI ADMP441 digital MEMS microphone
(1x) USBPAL

Based on the USBPAL from Rigisystems and using eight (8) Analog Devices digital MEMS microphones ADMP441 the DMMA.2 and DMMA.3 are circular microphone arrays with a diameter of 20 and 4 cm. A detailed datasheet and sound samples for the DMMA.3 are available here (DMMA.3).

Recording setup

The recording setup in the IMR and hemi-anechoic chamber are identical within the constraints of the setup.

Figure 2, 3, 4 and 5 show the microphone array placement and adjustment as well as the dimension of the setup with respect to the position of the speaker(s).

Figure 2 - Recording setup (IMR - one speaker)

Figure 3 - Recording setup (IMR - two speakers)

Figure 4 - Recording setup (hemi-anechoic chamber - one speaker)

Figure 5 - Recording setup (hemi-anechoic chamber - two speakers)

Note: The red triangle indicates microphone channel 1, channel numbers increase counter-clockwise (in the *.wav audio file provided).

Photos of the setup are shown in Figure 6 and 7.

Figure 6 - Recording setup (IMR)

Figure 7 - Recording setup (hemi-anechoic chamber)

Languages and dialects

UK English.

Narrative Description

The 2012_MMA corpus offers researchers an intermediate task between simple digit recognition and large vocabulary conversational speech recognition. It consists of sentences read from the Wall Street Journal (WSJ) taken from the test set of the WSJCAM0 database.

A total of 24 speakers, 12 male and 12 female, are recorded in two different scenarios, these are:

single (stationary) speaker
two (stationary) overlapping speakers

The participants recordings were split equally between an

instrumented meeting room (6 male/6 female)
hemi-anechoic chamber (6 male/6 female)

Two same-gender speakers were paired for recording of the overlapping speech.

The speakers are recorded using five different eight-channel microphone array, reading WSJ from prompts. In the single speaker scenario one speaker reads from a fixed position, in the overlapping scenario two speakers read from two fixed positions.

Two times twelve participants were recorded for the single scenario, two times six pairs for the overlapping scenario. Each read about 90 sentences which are available for speech separation and recognition experiments.

Task	Applications	Comments
Single speaker	Distant (automatic) speech recognition
Overlapping speakers	Speaker localisation, speech separation and distant (automatic) speech recognition

Each speaker (pair) read WSJ sentences (WSJCAM0) from script, i.e.

Data set (name)	Number of sentences	Description
adapt	Approx. 17	TIMIT style, for adaptation
5k	Approx. 40	5,000 word (closed vocabulary) sub corpus of WSJCAM0
20k	Approx. 40	20,000 word (open vocabulary) sub corpus of WSJCAM0

Each sentence is individually split from the recording for recognition and stored in folders following the structure

./<corpus>/audio/T<#1>[T<#2>]/<mic_type>/{adap|5k\20k}/*.wav

... with

<corpus> defining the corpus, i.e. WSJ, WSJ_anechoic, MSWSJ or MSWSJ_anechoic
T<#> defining the participant and his/her number #
<mic_type> defining the microphone type, i.e. analogue_20cm_48k, analogue_4cm_96k, digital_20cm_48k, digital_4cm_96k, digital_4cm_48k

... and *.wav is defined as

T<#1><ref#1>_T<#2><ref#2>-<smic_type>.wav

.. with

<smic_type> defining the microphone type (short form), i.e. a_20_48, a_4_96, d_20_48, d_4_, d _4_48

... and where <ref#> determines the correct answer from the mlf file stored in ./mlf/WSJ.mlf

The speaker locations are stored in ./<corpus>/SentenceLocation/T<>.txt

References

[1] M. Lincoln, I. McCowan, J. Vepa and H.K. Maganti, "The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments", IEEE Workshop on Automatic Speech Recognition and Understanding, 2005

[2] E. Zwyssig, S. Renals and M. Lincoln, "On the effect of SNR and superdirective beamforming in speaker diarisation in meetings", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

[3] E. Zwyssig, F. Faubel, S. Renals and M. Lincoln, "Recognition of overlapping speech using digital MEMS microphone arrays", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013