The Centre for Speech Technology Research, The university of Edinburgh

Publications by Sasha Calhoun

[1] Ani Nenkova, Jason Brenier, Anubha Kothari, Sasha Calhoun, Laura Whitton, David Beaver, and Dan Jurafsky. To memorize or to predict: Prominence labeling in conversational speech. In NAACL Human Language Technology Conference, Rochester, NY, 2007. [ bib | .pdf ]
The immense prosodic variation of natural conversational speech makes it challenging to predict which words are prosodically prominent in this genre. In this paper, we examine a new feature, accent ratio, which captures how likely it is that a word will be realized as prominent or not. We compare this feature with traditional accentprediction features (based on part of speech and N-grams) as well as with several linguistically motivated and manually labeled information structure features, such as whether a word is given, new, or contrastive. Our results show that the linguistic features do not lead to significant improvements, while accent ratio alone can yield prediction performance almost as good as the combination of any other subset of features. Moreover, this feature is useful even across genres; an accent-ratio classifier trained only on conversational speech predicts prominence with high accuracy in broadcast news. Our results suggest that carefully chosen lexicalized features can outperform less fine-grained features.

[2] Sasha Calhoun. Predicting focus through prominence structure. In Proc. Interspeech, Antwerp, Belgium, 2007. [ bib | .pdf ]
Focus is central to our control of information flow in dialogue. Spoken language understanding systems therefore need to be able to detect focus automatically. It is well known that prominence is a key marker of focus in English, however, the relationship is not straight-forward. We present focus prediction models built using the NXT Switchboard corpus. We claim that a focus is more likely if a word is more prominent than expected given its syntactic, semantic and discourse properties. Crucially, the perception of prominence arises not only from acoustic cues, but also the position in prosodic structure. Our focus prediction results, along with a study showing the acoustic properties of focal accents vary by structural position, support our claims. As a largely novel task, these results are an important first step in detecting focus for spoken language applications.

[3] Sasha Calhoun. Information Structure and the Prosodic Structure of English: a Probabilistic Relationship. PhD thesis, University of Edinburgh, 2006. [ bib ]
This thesis looks at how information structure is signalled prosodically in English. It has been standardly held that information structure is primarily signalled by the distribution of pitch accents within syntax structure, as well as intonation event type. Rather, it is argued that previous work has underestimated the importance, and richness, of metrical prosodic structure and its role in signalling information structure. A new approach is proposed: to view information structure as a strong constraint on the mapping of words onto metrical prosodic structure. Focal elements (kontrast) align with nuclear prominence, while accents on other words are not usually directly 'meaningful'. Information units (theme/rheme) try to align with prosodic phrases. This mapping is probabilistic, so it is also influenced by lexical and syntactic effects, as well as rhythmical constraints and other features including emphasis. Qualitative and quantitative analysis is presented in support of these claims using the NXT Switchboard corpus which has been annotated with substantial new layers of semantic and prosodic features.

[4] Sasha Calhoun, Malvina Nissim, Mark Steedman, and Jason Brenier. A framework for annotating information structure in discourse. In Frontiers in Corpus Annotation II: Pie in the Sky, ACL2005 Conference Workshop, Ann Arbor, Michigan, June 2005. [ bib | .pdf ]
We present a framework for the integrated analysis of the textual and prosodic characteristics of information structure in the Switchboard corpus of conversational English. Information structure describes the availability, organisation and salience of entities in a discourse model. We present standards for the annotation of information status (old, mediated and new), and give guidelines for annotating information structure, i.e. theme/rheme and background/kontrast. We show that information structure in English can only be analysed concurrently with prosodic prominence and phrasing. Along with existing annotations which we have integrated using NXT technology, the corpus will be unique in the field of conversational speech in terms of size and richness of annotation, vital for many NLP applications.

[5] Sasha Calhoun. It's the difference that matters: An argument for contextually-grounded acoustic intonational phonology. In Linguistics Society of America Annual Meeting, Oakland, California, January 2005. [ bib | .pdf ]
Standardly, the link between intonation and discourse meaning is described in terms of perceptual intonation categories, e.g. ToBI. We argue that this approach needs to be refined to explicitly recognise: firstly, that perception is affected by multiple acoustic cues, including duration and intensity, as well as F0; and secondly that the interpretation of these cues is directly linked to the phonetic and discourse context. Investigating the marking of topic status in a small game task corpus, we found that although topic status is not consistently marked by ToBI pitch accent, it is by the F0 mean, intensity and duration of the topic word. Using regression analysis, we found that when factoring out the F0 mean and intensity of key parts of the preceding discourse, intensity and duration become stronger predictors of topic status than F0.

[6] Sasha Calhoun. Phonetic dimensions of intonational categories: the case of L+H* and H*. In Prosody 2004, Nara, Japan, March 2004. poster. [ bib | .ps | .pdf ]
ToBI, in its conception, was an attempt to describe intonation in terms of phonological categories. An effect of the success of ToBI in doing this has been to make it standard to try to characterise all intonational phonological distinctions in terms of ToBI distinctions, i.e. segmental alignment of pitch targets and pitch height as either High or Low. Here we report a series of experiments which attempted to do this, linking two supposed phonological categories, theme and rheme accents, to two controversial ToBI pitch accents L+H* and H* respectively. Our results suggest a reanalysis of the dimensions of phonological intonational distinctions. It is suggested that there are three layers affecting the intonational contour: global extrinsic, local extrinsic and intrinsic; and the theme-rheme distinction may lie in the local extrinsic layer. It is the similarity both of the phonetic effects and the semantic information conveyed by the last two layers that has led to the confusion in results such as those reported here.

[7] Sasha Calhoun. The nature of theme and rheme accents. In One-Day Meeting for Young Speech Researchers, University College, London, April 2003. [ bib | .ps | .pdf ]
It has increasingly been recognised that appropriate intonation is essential to create believable voices for speech synthesis. This is particularly true in dialogue, where the link between intonation and meaning is especially important. Here we report two experiments, a production and perception study, which test an aspect of Steedman's (2000) theory relating information and intonation structure with a view to specifying intonation in a speech synthesis system. He claims that themes and rhemes, the basic building blocks of information structure, are marked by distinctive pitch accents in English, which he identifies with L+H* and H* in the ToBI system respectively. After reviewing problems with the identification of these ToBI accents, we show that speakers do produce and listeners do distinguish different pitch accents in these discourse contexts, but that the ToBI labels may not be helpful to characterise the distinction. The exact phonetic nature of theme and rheme accents remains unclear, but the alignment of the start of the rise, pitch height and the fall after the pitch peak all appear to be factors. Speakers also appear to be more sensitive to the distinction at the end of an utterance than utterance-medially.

[8] Sasha Calhoun. Using prosody in ASR: the segmentation of broadcast radio news. Master's thesis, University of Edinburgh, 2002. [ bib | .pdf ]
This study explores how prosodic information can be used in Automatic Speech Recognition (ASR). A system was built which automatically identifies topic boundaries in a corpus of broadcast radio news. We evaluate the effectiveness of different types of features, including textual, durational, F0, Tilt and ToBI features in that system. These features were suggested by a review of the literature on how topic structure is indicated by humans and recognised by both humans and machines from both a linguistic and natural language processing standpoint. In particular, we investigate whether acoustic cues to prosodgz?g information can be used directly to indicate topic structure, or whether it is better to derive discourse structure from intonational events, such as ToBI events, in a manner suggested by Steedman's (2000) theory, among others. It was found that the global properties of an utterance (mean and maximum F0) and textual features (based on Hearst's (1997) lexical scores and cue phrases) were effective in recognising topic boundaries on their own whereas all other features investigated were not. Performance using Tilt and ToBI features was disappointing, although this could have been because of inaccuracies in estimating az?gthese 0g7 parameters. We suggest that different acoustic cues to prosody are more effective in recognising discourse information at certain levels of discourse structure than others. The identification of higher level structure is informed by the properties of lower level structure. Although the findings of this study were not conclusive on this issue, we propose that prosody in ASR and synthesis should be represented in terms of the intonational events relevant to each level of discourse structure. Further, at the level of topic structure, a taxonomy of events is needed to describe the global F0 properties of each utterance that makes up that structure.