The Centre for Speech Technology Research, The university of Edinburgh

Publications by Catherine Lai

[1] Leimin Tian, Johanna Moore, and Catherine Lai. Recognizing Emotions in Spoken Dialogue with Acoustic and Lexical Cues. In ICMI 2017 Satellite Workshop Investigating Social Interactions with Artificial Agents, November 2017. [ bib | .pdf ]
Emotions play a vital role in human communications. Therefore, it is desirable for virtual agent dialogue systems to recognize and react to user's emotions. However, current automatic emotion recognizers have limited performance compared to humans. Our work attempts to improve performance of recognizing emotions in spoken dialogue by identifying dialogue cues predictive of emotions, and by building multimodal recognition models with a knowledge-inspired hierarchy. We conduct experiments on both spontaneous and acted dialogue data to study the efficacy of the proposed approaches. Our results show that including prior knowledge on emotions in dialogue in either the feature representation or the model structure is beneficial for automatic emotion recognition.

[2] Peter Bell, Joachim Fainberg, Catherine Lai, and Mark Sinclair. A system for real-time collaborative transcription correction. In Proc. Interspeech (demo session), August 2017. [ bib | .pdf ]
We present a system to enable efficient, collaborative human correction of ASR transcripts, designed to operate in real-time situations, for example, when post-editing live captions generated for news broadcasts. In the system, confusion networks derived from ASR lattices are used to highlight low-confident words and present alternatives to the user for quick correction. The system uses a client-server architecture, whereby information about each manual edit is posted to the server. Such information can be used to dynamically update the one-best ASR output for all utterances currently in the editing pipeline. We propose to make updates in three different ways; by finding a new one-best path through an existing ASR lattice consistent with the correction received; by identifying further instances of out-of-vocabulary terms entered by the user; and by adapting the language model on the fly. Updates are received asynchronously by the client.

[3] Peter Bell, Joachim Fainberg, Catherine Lai, and Mark Sinclair. A system for real time collaborative transcription correction. In Proceedings of Interspeech 2017, pages 817-818, 2017. [ bib | .PDF ]
We present a system to enable efficient, collaborative human correction of ASR transcripts, designed to operate in real-time situations, for example, when post-editing live captions generated for news broadcasts. In the system, confusion networks derived from ASR lattices are used to highlight low-confident words and present alternatives to the user for quick correction. The system uses a client-server architecture, whereby information about each manual edit is posted to the server. Such information can be used to dynamically update the one-best ASR output for all utterances currently in the editing pipeline. We propose to make updates in three different ways; by finding a new one-best path through an existing ASR lattice consistent with the correction received; by identifying further instances of out-of-vocabulary terms entered by the user; and by adapting the language model on the fly. Updates are received asynchronously by the client.

[4] Leimin Tian, Michal Muszynski, Catherine Lai, Johanna Moore, Theodoros Kostoulas, Patrizia Lombardo, Thierry Pun, and Guillame Chanel. Recognizing Induced Emotions of Movie Audiences: Are Induced and Perceived Emotions the Same? In Seventh International Conference on Affective Computing and Intelligent Interaction (ACII2017), 2017. [ bib | .pdf ]
Predicting the emotional response of movie audiences to affective movie content is a challenging task in affective computing. Previous work has focused on using audiovisual movie content to predict movie induced emotions. However, the relationship between the audience’s perceptions of the affective movie content (perceived emotions) and the emotions evoked in the audience (induced emotions) remains unexplored. In this work, we address the relationship between perceived and induced emotions in movies, and identify features and modelling approaches effective for predicting movie induced emotions. First, we extend the LIRIS-ACCEDE database by annotating perceived emotions in a crowd-sourced manner, and find that perceived and induced emotions are not always consistent. Second, we show that dialogue events and aesthetic highlights are effective predictors of movie induced emotions. In addition to movie based features, we also study physiological and behavioural measurements of audiences. Our experiments show that induced emotion recognition can benefit from including temporal context and from including multimodal information. Our study bridges the gap between affective content analysis and induced emotion prediction.

[5] Janine Kleinhans, Mireia Farrús, Agustín Gravano, Juan Manuel Pérez, Catherine Lai, and Leo Wanner. Using prosody to classify discourse relations. In Proceedings of Interspeech 2017, pages 3201-3205, 2017. [ bib | .PDF ]
This work aims to explore the correlation between the discourse structure of a spoken monologue and its prosody by predicting discourse relations from different prosodic attributes. For this purpose, a corpus of semi-spontaneous monologues in English has been automatically annotated according to the rhetorical Structure Theory, which models coherence in text via rhetorical relations. From corresponding audio files, prosodic features such as pitch, intensity, and speech rate have been extracted from different contexts of a relation. Supervised classification tasks using Support Vector Machines have been performed to find relationships between prosodic features and rhetorical relations. Preliminary results show that intensity combined with other features extracted from intra- and intersegmental environments is the feature with the highest predictability for a discourse relation. The prediction of rhetorical relations from prosodic features and their combinations is straightforwardly applicable to several tasks such as speech understanding or generation. Moreover, the knowledge of how rhetorical relations should be marked in terms of prosody will serve as a basis to improve speech synthesis applications and make voices sound more natural and expressive.

[6] Mireia Farrus, Catherine Lai, and Johanna D. Moore. Paragraph-based prosodic cues for speech synthesis applications. In Proceedings of Speech Prosody 2016, pages 1143-1147, Boston, MA, USA, 2016. [ bib | DOI | .pdf ]
Speech synthesis has improved in both expressiveness and voice quality in recent years. However, obtaining full expressiveness when dealing with large multi-sentential synthesized discourse is still a challenge, since speech synthesizers do not take into account the prosodic differences that have been observed in discourse units such as paragraphs. The current study validates and extends previous work by analyzing the prosody of paragraph units in a large and diverse corpus of TED Talks using automatically extracted F0, intensity and timing features. In addition, a series of classification experiments was performed in order to identify which features are consistently used to distinguish paragraph breaks. The results show significant differences in prosody related to paragraph position. Moreover, the classification experiments show that boundary features such as pause duration and differences in F0 and intensity levels are the most consistent cues in marking paragraph boundaries. This suggests that these features should be taken into account when generating spoken discourse in order to improve naturalness and expressiveness.

[7] Catherine Lai, Mireia Farrus, and Johanna Moore. Automatic Paragraph Segmentation with Lexical and Prosodic Features. In Proceedings of Interspeech 2016, San Francisco, CA, USA, 2016. [ bib | .pdf ]
As long-form spoken documents become more ubiquitous in everyday life, so does the need for automatic discourse segmentation in spoken language processing tasks. Although previous work has focused on broad topic segmentation, detection of finer-grained discourse units, such as paragraphs, is highly desirable for presenting and analyzing spoken content. To better understand how different aspects of speech cue these subtle discourse transitions, we investigate automatic paragraph segmentation of TED talks. We build lexical and prosodic paragraph segmenters using Support Vector Machines, AdaBoost, and Long Short Term Memory (LSTM) recurrent neural networks. In general, we find that induced cue words and supra-sentential prosodic features outperform features based on topical coherence, syntactic form and complexity. However, our best performance is achieved by combining a wide range of individually weak lexical and prosodic features, with the sequence modelling LSTM generally outperforming the other classifiers by a large margin. Moreover, we find that models that allow lower level interactions between different feature types produce better results than treating lexical and prosodic contributions as separate, independent information sources.

[8] Leimin Tian, Johanna Moore, and Catherine Lai. Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features. In Spoken Language Technology Workshop (SLT), 2016 IEEE, pages 565-572. IEEE, 2016. [ bib | .pdf ]
Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.

[9] Peter Bell, Catherine Lai, Clare Llewellyn, Alexandra Birch, and Mark Sinclair. A system for automatic broadcast news summarisation, geolocation and translation. In Proc. Interspeech (demo session), Dresden, Germany, September 2015. [ bib | .pdf ]
An increasing amount of news content is produced in audio-video form every day. To effectively analyse and monitoring this multilingual data stream, we require methods to extract and present audio content in accessible ways. In this paper, we describe an end-to-end system for processing and browsing audio news data. This fully automated system brings together our recent research on audio scene analysis, speech recognition, summarisation, named entity detection, geolocation, and machine translation. The graphical interface allows users to visualise the distribution of news content by entity names and story location. Browsing of news events is facilitated through extractive summaries and the ability to view transcripts in multiple languages.

[10] Alessandra Cervone, Catherine Lai, Silvia Pareti, and Peter Bell. Towards automatic detection of reported speech in dialogue using prosodic cues. In Proc. Interspeech, Dresden, Germany, September 2015. [ bib | .pdf ]
The phenomenon of reported speech - whereby we quote the words, thoughts and opinions of others, or recount past dialogue - is widespread in conversational speech. Detecting such quotations automatically has numerous applications: for example, in enhancing automatic transcription or spoken language understanding applications. However, the task is challenging, not least because lexical cues of quotations are frequently ambiguous or not present in spoken language. The aim of this paper is to identify potential prosodic cues of reported speech which could be used, along with the lexical ones, to automatically detect quotations and ascribe them to their rightful source, that is reconstructing their Attribution Relations. In order to do so we analyze SARC, a small corpus of telephone conversations that we have annotated with Attribution Relations. The results of the statistical analysis performed on the data show how variations in pitch, intensity, and timing features can be exploited as cues of quotations. Furthermore, we build a SVM classifier which integrates lexical and prosodic cues to automatically detect quotations in speech that performs significantly better than chance.

[11] Leimin Tian, Catherine Lai, and Johanna D. Moore. Recognizing emotions in dialogue with disfluences and non-verbal vocalisations. In Proceedings of the 4th Interdisciplinary Workshop on Laughter and Other Non-verbal Vocalisations in Speech, volume 14, page 15, 2015. [ bib | .pdf ]
We investigate the usefulness of DISfluencies and Non-verbal Vocalisations (DIS-NV) for recognizing human emotions in dialogues. The proposed fea- tures measure filled pauses, fillers, stutters, laughter, and breath in utterances. The predictiveness of DIS- NV features is compared with lexical features and state-of-the-art low-level acoustic features. Our experimental results show that using DIS-NV features alone is not as predictive as using lexical or acoustic features. However, adding them to lexical or acoustic feature set yields improvement compared to using lexical or acoustic features alone. This indi- cates that disfluencies and non-verbal vocalisations provide useful information overlooked by the other two types of features for emotion recognition

[12] Leimin Tian, Johanna D. Moore, and Catherine Lai. Emotion Recognition in Spontaneous and Acted Dialogues. In Proceedings of ACII 2015, Xi'an, China, 2015. [ bib | .pdf ]
In this work, we compare emotion recognition on two types of speech: spontaneous and acted dialogues. Experiments were conducted on the AVEC 2012 database of spontaneous dialogues and the IEMOCAP database of acted dialogues. We studied the performance of two types of acoustic features for emotion recognition: knowledge-inspired disfluency and non-verbal vocalisation (DIS-NV) features, and statistical Low-Level Descriptor (LLD) based features. Both Support Vector Machines (SVM) and Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN) were built using each feature set on each emotional database. Our work aims to identify aspects of the data that constrain the effectiveness of models and features. Our results show that the perfor mance of different types of features and models is influenced by the type of dialogue and the amount of training data. Because DIS-NVs are less frequent in acted dialogues than in spontaneous dialogues, the DIS-NV features perform better than the LLD features when recognizing emotions in spontaneous dialogues, but not in acted dialogues. The LSTM-RNN model gives better performance than the SVM model when there is enough training data, but the complex structure of a LSTM-RNN model may limit its performance when there is less training data available, and may also risk over-fitting. Additionally, we find that long distance contexts may be more useful when performing emotion recognition at the word level than at the utterance level.

[13] Johanna D. Moore, Leimin Tian, and Catherine Lai. Word-level emotion recognition using high-level features. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 8404 of Lecture Notes in Computer Science, pages 17-31. Springer Berlin Heidelberg, 2014. [ bib | DOI | .pdf ]
In this paper, we investigate the use of high-level features for recognizing human emotions at the word-level in natural conversations with virtual agents. Experiments were carried out on the 2012 Audio/Visual Emotion Challenge (AVEC2012) database, where emotions are defined as vectors in the Arousal-Expectancy-Power-Valence emotional space. Our model using 6 novel disfluency features yields significant improvements compared to those using large number of low-level spectral and prosodic features, and the overall performance difference between it and the best model of the AVEC2012 Word-Level Sub-Challenge is not significant. Our visual model using the Active Shape Model visual features also yields significant improvements compared to models using the low-level Local Binary Patterns visual features. We built a bimodal model By combining our disfluency and visual feature sets and applying Correlation-based Feature-subset Selection. Considering overall performance on all emotion dimensions, our bimodal model outperforms the second best model of the challenge, and comes close to the best model. It also gives the best result when predicting Expectancy values.

[14] Catherine Lai. Interpreting final rises: Task and role factors. In Proceedings of Speech Prosody 7, Dublin, Ireland, 2014. [ bib | .pdf ]
This paper examines the distribution of utterance final pitch rises in dialogues with different task structures. More specifically, we examine map-task and topical conversation dialogues of Southern Standard British English speakers in the IViE corpus. Overall, we find that the map-task dialogues contain more rising features, where these mainly arise from instructions and affirmatives. While rise features were somewhat predictive of turn-changes, these effects were swamped by task and role effects. Final rises were not predictive of affirmative responses. These findings indicate that while rises can be interpreted as indicating some sort of contingency, it is with respect to the higher level discourse structure rather than the specific utterance bearing the rise. We explore the relationship between rises and the need for co-ordination in dialogue, and hypothesize that the more speakers have to co-ordinate in a dialogue, the more rising features we will see on non-question utterances. In general, these sorts of contextual conditions need to be taken into account when we collect and analyze intonational data, and when we link them to speaker states such as uncertainty or submissiveness.

[15] Catherine Lai and Steve Renals. Incorporating lexical and prosodic information at different levels for meeting summarization. In Proc. Interspeech 2014, 2014. [ bib | .pdf ]
This paper investigates how prosodic features can be used to augment lexical features for meeting summarization. Automatic detection of summary-worthy content using non-lexical features, like prosody, has generally focused on features calculated over dialogue acts. However, a salient role of prosody is to distinguish important words within utterances. To examine whether including more fine grained prosodic information can help extractive summarization, we perform experiments incorporating lexical and prosodic features at different levels. For ICSI and AMI meeting corpora, we find that combining prosodic and lexical features at a lower level has better AUROC performance than adding in prosodic features derived over dialogue acts. ROUGE F-scores also show the same pattern for the ICSI data. However, the differences are less clear for the AMI data where the range of scores is much more compressed. In order to understand the relationship between the generated summaries and differences in standard measures, we look at the distribution of extracted content over meeting as well as summary redundancy. We find that summaries based on dialogue act level prosody better reflect the amount of human annotated summary content in meeting segments, while summaries derived from prosodically augmented lexical features exhibit less redundancy.

[16] Catherine Lai, Jean Carletta, and Steve Renals. Detecting summarization hot spots in meetings using group level involvement and turn-taking features. In Proc. Interspeech 2013, Lyon, France, 2013. [ bib | .pdf ]
In this paper we investigate how participant involvement and turn-taking features relate to extractive summarization of meeting dialogues. In particular, we examine whether automatically derived measures of group level involvement, like participation equality and turn-taking freedom, can help detect where summarization relevant meeting segments will be. Results show that classification using turn-taking features performed better than the majority class baseline for data from both AMI and ICSI meeting corpora in identifying whether meeting segments contain extractive summary dialogue acts. The feature based approach also provided better recall than using manual ICSI involvement hot spot annotations. Turn-taking features were additionally found to be predictive of the amount of extractive summary content in a segment. In general, we find that summary content decreases with higher participation equality and overlap, while it increases with the number of very short utterances. Differences in results between the AMI and ICSI data sets suggest how group participatory structure can be used to understand what makes meetings easy or difficult to summarize.

[17] Catherine Lai, Jean Carletta, and Steve Renals. Modelling participant affect in meetings with turn-taking features. In Proceedings of WASSS 2013, Grenoble, France, 2013. [ bib | .pdf ]
This paper explores the relationship between turn-taking and meeting affect. To investigate this, we model post-meeting ratings of satisfaction, cohesion and leadership from participants of AMI corpus meetings using group and individual turn-taking features. The results indicate that participants gave higher satisfaction and cohesiveness ratings to meetings with greater group turn-taking freedom and individual very short utterance rates, while lower ratings were associated with more silence and speaker overlap. Besides broad applicability to satisfaction ratings, turn-taking freedom was found to be a better predictor than equality of speaking time when considering whether participants felt that everyone they had a chance to contribute. If we include dialogue act information, we see that substantive feedback type turns like assessments are more predictive of meeting affect than information giving acts or backchannels. This work highlights the importance of feedback turns and modelling group level activity in multiparty dialogue for understanding the social aspects of speech.

[18] Catherine Lai, Keelan Evanini, and Klaus Zechner. Applying rhythm metrics to non-native spontaneous speech. In Proceedings of SLaTE 2013, Grenoble, France, 2013. [ bib | .pdf ]
This study investigates a variety of rhythm metrics on two corpora of non-native spontaneous speech and compares the nonnative distributions to values from a corpus of native speech. Several of the metrics are shown to differentiate well between native and non-native speakers and to also have moderate correlations with English proficiency scores that were assigned to the non-native speech. The metric that had the highest correlation with English proficiency scores (apart from speaking rate) was rPVIsyl (the raw Pairwise Variability Index for syllables), with r = 0.43.