|
[1]
|
Alfred Dielmann and Steve Renals.
Recognition of dialogue acts in multiparty meetings using a switching
DBN.
IEEE Transactions on Audio, Speech and Language Processing,
16(7):1303-1314, 2008.
[ bib |
DOI |
http |
.pdf ]
This paper is concerned with the automatic recognition
of dialogue acts (DAs) in multiparty conversational
speech. We present a joint generative model for DA
recognition in which segmentation and classification of
DAs are carried out in parallel. Our approach to DA
recognition is based on a switching dynamic Bayesian
network (DBN) architecture. This generative approach
models a set of features, related to lexical content
and prosody, and incorporates a weighted interpolated
factored language model. The switching DBN coordinates
the recognition process by integrating the component
models. The factored language model, which is estimated
from multiple conversational data corpora, is used in
conjunction with additional task-specific language
models. In conjunction with this joint generative
model, we have also investigated the use of a
discriminative approach, based on conditional random
fields, to perform a reclassification of the segmented
DAs. We have carried out experiments on the AMI corpus
of multimodal meeting recordings, using both manually
transcribed speech, and the output of an automatic
speech recognizer, and using different configurations
of the generative model. Our results indicate that the
system performs well both on reference and fully
automatic transcriptions. A further significant
improvement in recognition accuracy is obtained by the
application of the discriminative reranking approach
based on conditional random fields.
|
|
[2]
|
A. Dielmann and S. Renals.
DBN based joint dialogue act recognition of multiparty meetings.
In Proc. IEEE ICASSP, volume 4, pages 133-136, April 2007.
[ bib |
.pdf ]
Joint Dialogue Act segmentation and classification of
the new AMI meeting corpus has been performed through
an integrated framework based on a switching dynamic
Bayesian network and a set of continuous features and
language models. The recognition process is based on a
dictionary of 15 DA classes tailored for group
decision-making. Experimental results show that a novel
interpolated Factored Language Model results in a low
error rate on the automatic segmentation task, and thus
good recognition results can be achieved on AMI
multiparty conversational speech.
|
|
[3]
|
A. Dielmann and S. Renals.
Automatic dialogue act recognition using a dynamic Bayesian
network.
In S. Renals, S. Bengio, and J. Fiscus, editors, Proc.
Multimodal Interaction and Related Machine Learning Algorithms Workshop
(MLMI-06), pages 178-189. Springer, 2007.
[ bib |
.pdf ]
We propose a joint segmentation and classification
approach for the dialogue act recognition task on
natural multi-party meetings (ICSI Meeting Corpus).
Five broad DA categories are automatically recognised
using a generative Dynamic Bayesian Network based
infrastructure. Prosodic features and a switching
graphical model are used to estimate DA boundaries, in
conjunction with a factored language model which is
used to relate words and DA categories. This easily
generalizable and extensible system promotes a rational
approach to the joint DA segmentation and recognition
task, and is capable of good recognition performance.
|
|
[4]
|
Alfred Dielmann and Steve Renals.
Automatic meeting segmentation using dynamic Bayesian networks.
IEEE Transactions on Multimedia, 9(1):25-36, 2007.
[ bib |
DOI |
http |
.pdf ]
Multiparty meetings are a ubiquitous feature of
organizations, and there are considerable economic
benefits that would arise from their automatic analysis
and structuring. In this paper, we are concerned with
the segmentation and structuring of meetings (recorded
using multiple cameras and microphones) into sequences
of group meeting actions such as monologue, discussion
and presentation. We outline four families of
multimodal features based on speaker turns, lexical
transcription, prosody, and visual motion that are
extracted from the raw audio and video recordings. We
relate these low-level features to more complex group
behaviors using a multistream modelling framework based
on multistream dynamic Bayesian networks (DBNs). This
results in an effective approach to the segmentation
problem, resulting in an action error rate of 12.2%,
compared with 43% using an approach based on hidden
Markov models. Moreover, the multistream DBN developed
here leaves scope for many further improvements and
extensions.
|
|
[5]
|
M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and
D. Zhang.
Multimodal integration for meeting group action segmentation and
recognition.
In S. Renals and S. Bengio, editors, Proc. Multimodal
Interaction and Related Machine Learning Algorithms Workshop (MLMI-05),
pages 52-63. Springer, 2006.
[ bib ]
We address the problem of segmentation and recognition
of sequences of multimodal human interactions in
meetings. These interactions can be seen as a rough
structure of a meeting, and can be used either as input
for a meeting browser or as a first step towards a
higher semantic analysis of the meeting. A common
lexicon of multimodal group meeting actions, a shared
meeting data set, and a common evaluation procedure
enable us to compare the different approaches. We
compare three different multimodal feature sets and our
modelling infrastructures: a higher semantic feature
approach, multi-layer HMMs, a multistream DBN, as well
as a multi-stream mixed-state DBN for disturbed data.
|
|
[6]
|
A. Dielmann and S. Renals.
Multistream dynamic Bayesian network for meeting segmentation.
In S. Bengio and H. Bourlard, editors, Proc. Multimodal
Interaction and Related Machine Learning Algorithms Workshop (MLMI-04),
pages 76-86. Springer, 2005.
[ bib |
.ps.gz |
.pdf ]
This paper investigates the automatic analysis and
segmentation of meetings. A meeting is analysed in
terms of individual behaviours and group interactions,
in order to decompose each meeting in a sequence of
relevant phases, named meeting actions. Three feature
families are extracted from multimodal recordings:
prosody from individual lapel microphone signals,
speaker activity from microphone array data and lexical
features from textual transcripts. A statistical
approach is then used to relate low-level features with
a set of abstract categories. In order to provide a
flexible and powerful framework, we have employed a
dynamic Bayesian network based model, characterized by
multiple stream processing and flexible state duration
modelling. Experimental results demonstrate the
strength of this system, providing a meeting action
error rate of 9%.
|
|
[7]
|
A. Dielmann and S. Renals.
Dynamic Bayesian networks for meeting structuring.
In Proc. IEEE ICASSP, 2004.
[ bib |
.ps.gz |
.pdf ]
This paper is about the automatic structuring of
multiparty meetings using audio information. We have
used a corpus of 53 meetings, recorded using a
microphone array and lapel microphones for each
participant. The task was to segment meetings into a
sequence of meeting actions, or phases. We have adopted
a statistical approach using dynamic Bayesian networks
(DBNs). Two DBN architectures were investigated: a
two-level hidden Markov model (HMM) in which the
acoustic observations were concatenated; and a
multistream DBN in which two separate observation
sequences were modelled. Additionally we have also
explored the use of counter variables to constrain the
number of action transitions. Experimental results
indicate that the DBN architectures are an improvement
over a simple baseline HMM, with the multistream DBN
with counter constraints producing an action error rate
of 6%.
|
|
[8]
|
A. Dielmann and S. Renals.
Multi-stream segmentation of meetings.
In Proc. IEEE Workshop on Multimedia Signal Processing, 2004.
[ bib |
.ps.gz |
.pdf ]
This paper investigates the automatic segmentation of
meetings into a sequence of group actions or phases.
Our work is based on a corpus of multiparty meetings
collected in a meeting room instrumented with video
cameras, lapel microphones and a microphone array. We
have extracted a set of feature streams, in this case
extracted from the audio data, based on speaker turns,
prosody and a transcript of what was spoken. We have
related these signals to the higher level semantic
categories via a multistream statistical model based on
dynamic Bayesian networks (DBNs). We report on a set of
experiments in which different DBN architectures are
compared, together with the different feature streams.
The resultant system has an action error rate of 9%.
|