The Centre for Speech Technology Research, The university of Edinburgh

PWorkshop Archives: Summer Term 2004

13 Apr 2004

Alice Turk

Beyond final lengthening


27 Apr 2004

Mike Lincoln

TESSA - The TExt and Sign Support Assistant

TESSA is a system designed to aid transactions between a deaf person and a counter clerk in a Post Office. The system uses a speech recognition system to recognise speech from the counter clerk and then synthesises the recognised phrase in British Sign Language using a specially developed avatar. I will describe the design, development and evaluation of the system and (ignoring the old adage of 'never work with children, animals or speech recognition systems') give a demonstration of TESSA in action.



04 May 2004

David Feinberg (University of St Andrews)

Female preferences for male voices

By manipulating fundamental and formant frequencies independently in real voices, we can apply experimental paradigms to vocal attractiveness. Here we find that in general females prefer masculinised male voices, and prefer them more at the late follicular (most fertile) phase of the menstrual cycle. Implications of these findings are that vocal attractiveness has a biological basis.



11 May 2004

Shu-chen Ou and Mits Ota

The Acquisition of Weight-to-Stress Principle: Evidence from Chinese-English Interlanguage

Much research on the acquisition of second language (L2) stress by speakers of another stress system indicates that the development is constrained by principles of metrical phonology (e.g. Archibald 1993, 1995; Pater 2000). However, it remains to be seen whether this generalization is extendable to the acquisition of L2 stress by learners whose native language does not exhibit dynamic stress. The current study addresses this issue by examining Chinese learners' sensitivity to weight-stress mapping in L2 English. The main experiment was designed to test whether Chinese learners know that English stress shifts from the antepenultimate to penultimate syllable when the penult is heavy (i.e., (C)VV or (C)VC). Twenty Mandarin/Taiwanese speakers who passed a screening test on identification of V-VV in monosyllabic words (e.g., BIT vs. BEAT), performed a preference task. The experimental items were trisyllabic nonwords with antepenultimate or penultimate stress. The penultimate syllable was either (i) open with a short nucleus (CV), (ii) open with a long nucleus (CVV), (iii) closed with a short nucleus (CVC) or (iv) closed with a long nucleus (CVVC). In support of our prediction, the Chinese learners showed preference for penultimate stress when the syllable was CVV, CVC or CVVC. However, unlike the English control subjects, the Chinese subjects also favored CVC and CVVC over CVV as stressed syllables, indicating that in addition to following a general weight-to-stress principle, they have a tendency to assign stress on a closed syllable. To verify this, we ran another experiment in which the same subjects were asked to perform an identification task with initially-stressed disyllabic words (e.g., BITTER vs. BEATER). We hypothesized that if Chinese learners preferred a stressed syllable to be closed, their perception of these stimuli would be biased toward BITTER-type words (with a closed stressed syllable) as opposed to BEATER-type words (with an open stressed syllable). The results supported this prediction. The subjects committed more identification errors with the disyllabic stimuli than with the monosyllabic stimuli, and more misidentification of BEATER-type words as BITTER-type words than the other way around.



18 May 2004

Christine Haunz

Language-specific and universal factors influencing perceived similarity

Perceived similarity may be influenced by different factors, which fall into two major groups; the first group are universal factors, which correlate with acoustic-phonetic characteristics, such as the availability of cues signalling sound contrasts, in specific positions and environments. These factors are expected to influence judgments of similarity cross-linguistically, as shown in Steriade (2001), who describes universal hierarchies of relative perceptibility. The second group of factors is concerned with the effects of the language-specific organization of the phoneme inventory and phonotactics on perception, which can be observed in the perceptual difficulties that language learners have with phonemes of a second language. An influence of native phonotactics on perception is contested by Silverman (1992), but supported by Dupoux et al. (1998). The latter view predicts that non-native speakers rate the similarity between forms that are illegal in their native language (L1) and those that they perceptually assimilate them to as significantly higher than listeners of a language in which both forms are legal.

To determine the relative influence of universal and specific factors, an experiment comparing the similarity of pairs of sound sequences as perceived by native speakers of English and Russian was conducted. Listeners rated the similarity of pairs of Russian pseudo-words differing word-initially in the following ways:

  1. C1C2 vs. C1@C2
  2. C1C2 vs. C1/2
  3. C1C2 vs. C1C3 / C4C2 (one feature change per comparison)
  4. C1@C2 vs. C1@C3 / C4@C2

Results showed evidence in favour of both sides. A number of findings suggest effects of universal perceptibility hierarchies: for example, a voicing change is more similar than a change in nasality for both groups. However, there is also support of the claim that language background is relevant, such as in a comparison of the ratings of cluster pairs and the corresponding epenthesised onset pairs (3 and 4): only Russian listeners showed the predicted lower similarity ratings for the case of two released consonants (4) and thus made use of the added acoustic cues. The findings suggest that there is a division of labour between the universal acoustic-phonetic and the language-specific factors, and that they interact to determine perceived similarity.



25 May 2004

Julia Simner

Why ebony is sky-blue, and society tastes of onions: Mechanisms of linguistically triggered synaesthesia

People with synaesthesia involuntarily experience certain percepts (e.g., taste, colour) when engaged in perceptual or cognitive activities (e.g. reading, listening to music) that would not elicit that response in non-synaesthetic people. For example, colours may be experienced in response to spoken words (Marks, 1975) and shapes may be experienced in response to taste (Cytowic, 1993; Cytowic & Wood, 1982). The aim of my research is to understand the cognitive and developmental basis of synaesthesia and what, if anything, this might tell us about the ordinary functioning of memory and language.

I describe an unusual case of developmental synaesthesia, in which speech sounds induce an involuntary sensation of taste that is subjectively located in the mouth (Ward & Simner, 2003). Our subject, JIW, shows a highly structured, non-random relationship between particular combinations of phonemes and the resultant taste, and this is influenced by a number of fine-grain linguistic properties. Functional neuroimaging studies of JIW (by David Parslow and colleagues) support the genuineness of the case, as does JIW's consistency over time. Our results suggest that JIW's synaesthesia does not simply reflect innate connections from one perceptual system to another, but that it can be mediated by a symbolic/conceptual level of representation. I describe findings from two further studies (Simner, 2003, Ward, Simner & Auyeung, 2003) which compare two different profiles of synaesthesia: word-colour and word-taste. It is argued that different cognitive mechanisms are responsible for the synaesthetic percepts in each group, and that these might inform us of the functioning of ordinary language comprehension.



08 Jun 2004

Moritz Neugebauer (University College Dublin)

Tree-based Acoustic Modelling with Phonological Constraints

Decision-tree based state tying has become increasingly popular for modelling context dependency for large vocabulary speech recognition. Firstly, the classification and prediction property of decision trees allows model units or contexts which do not occur in the training data to be provided. Secondly, the node splitting procedure of decision-tree based state tying is a model selection process. It thus maintains the balance between model complexity and the number of parameters in order to render a robust estimation of model parameters from the limited amount of training data.

Decision trees are built from a set of phonetic questions which refer to classes such as vowels or plosives in order to assign triphone states to appropriate acoustic models. The assumption behind the choice of phoneme classes is that phonemes which belong to the same class have a similar acoustic effect on neighbouring sounds. The standard approach in deriving the phonetic questions for a particular task with a specific phoneme set is to use a human expert.

In this presentation a new method is presented which automatically defines this question set. To this end, tree learning algorithms are interleaved with a knowledge representation component. Decision trees are explicitly linked to deduced phonological feature descriptions which then provide a classification scheme of phoneme contexts. The following components will be presented in detail: (a) the automatic learning of paradigmatic phonological constraints, (b) their formal representation and (c) their application to tree-based state tying for speech recognition.



15 Jun 2004

Mariko Sugahara & Alice Turk

Phonetic Reflexes of Morphemic Structure in English at Different Speech Rates

Prosodic constituent structure below the word-level is in a lot of debate. One example of such controversial cases in English is the internal prosodic organization of words consisting of a stem and a Level II suffix: supporters of the Morphology-Phonology Edge Alignment approach argue that there is a prosodic word boundary between the stem and the following Level II suffix while lexical phonologists do not. The main goal of this study is to compare those two different views by observing durational patterns of segments dominated by words consisting of a stem and a Level II suffix at normal and slower speech rates.



13 Jul 2004

Simone Ashby Hanna (University College Dublin)

Is Infant-directed Speech Hyperspeech? An Acoustical Analysis of Speech to Infants and other Accommodative Speech Styles

This research investigates whether properties of speech to infants may be classified as hyperspeech modifications, as defined by Lindblom's H&H Theory. Hyperspeech is defined as an attempt by speakers to meet listeners' communicative and situational demands by increasing the distinctiveness of speech sounds for enhanced lexical identification. Previous research has presented infant-directed (ID) speech as an accommodative style whose acoustic features provide babies with structured input for making sense of speech and acquiring mappings between sound and meaning. Implicit in some of these claims is the notion that ID speech is better formed---i.e. it involves hyperarticulation as per Lindblom's definition---compared to adult-directed speech. Yet, with most investigations focused on effects of this style on infants' perceptual abilities, little quantitative information is available for the ID speech signal itself. This thesis aims to describe the acoustic properties of ID speech, and evaluate the systematicity of such variations with respect to other accommodative styles.

By comparing ID speech with other listener-oriented styles, two questions are addressed. First, do different listener constraints elicit characteristic forms of modulation by speakers? Few studies have examined how listener constraints interact with modulation of the signal. Results for this investigation show that for experimental conditions representing computer-directed speech, 'foreigner talk', and Lombard speech, subjects varied systematically depending on the type of listener being addressed. In contrast, for both simulated and real ID speech conditions, speakers exercised considerable flexibility in the manipulation of selected acoustic parameters. Second, how does ID speech compare with other accommodative styles in the use of hyperspeech? I conclude that ID speech may involve production of clear speech features, but that unlike speech to an artificial or non-native addressee (i.e. other modalities involving conceptual and/or linguistic constraints on the part of the interlocutor), use of such features is far less consistent across speakers.



27 Jul 2004

Erich Round (Yale University)

Utterance Rhythm: Results and critical assessment of one `auditory' model for English and Swedish

Last year, in a study of the polysemy and prosody 'some' in spontaneous English and Swedish, I employed a rather minimally complex transcription system for what might be termed 'auditory utterance rhythm'. Somewhat surprisingly, this system produced some very strong results in terms of form-meaning correspondences and enabled me to describe a few characteristics of (Australian) English and (Gothenburg) Swedish which might be challenging within other frameworks. In this talk, therefore, I present the system used and mention the research on which it was based, as well as the main results it led to. Within this, I draw comparisons between 'auditory utterance rhythm' and other ways of investigating speech rhythm, with a view towards reflecting critically on whether one would wish make further studies in the same vein as I did, and what refinements and changes one would want to make to the system used.



02 Sep 2004

Bob Ladd & Ivan Yuen

Practice Talks

Bob Ladd
Alignment allophony and the European "pitch accent" languages

Recent work on the way F0 target points are aligned with the segmental string has shown that there are consistent patterns of structurally conditioned variation. For example, our work on Dutch finds that H accent peaks are aligned earlier with long vowels than with short vowels, and earlier with nuclear accents than with prenuclear accents. This paper proposes that such "alignment allophony" can be preserved and phonologised when the conditioning factor is lost through historical change, and that phonologisation of alignment allophony is a plausible mechanism for the genesis of the accentual contrasts and quasi-contrasts seen in European pitch accent systems.

This proposal remedies the only conspicuous weak point in Gussenhoven's account of the Central Franconian tonogensis (which proposes that tonal distinctions somehow arose AFTER the loss of final vowels, in order to eliminate homophony of grammatically distinct forms). The plausibility of our proposal is bolstered by data from contemporary German showing consistently different alignment patterns in pairs where final /-n/ contrasts with a "syllabic" [n] representing /-n@n/, like _den Stein_ (the stone, acc. sg.) vs. _den Steinen_ (the stones, dat. pl.). The disyllabic alignment is preserved in _den Steinen_ even though there is only a single [n] segment, and the two forms are therefore distinguished acoustically by subtle differences of duration and by the pitch contour.

This proposal explains the otherwise puzzling independent development of strikingly similar "complementary distribution" of accentual features in Scandinavia, Central Franconia, and the Scots Gaelic of the Hebrides: in each of these three areas some varieties mark "Accent 1" with some sort of glottal feature (stoed, Schaerfung, etc.) that causes a rapid drop in F0 on the stressed vowel (thereby phonologising early alignment), while others mark "Accent 2" with a lexically attached H tone (thereby phonologising late alignment). More generally, the fact that the European languages - and very few others - seem to keep producing these word accent systems is explained by the fact that their phonological structures (specifically, the fact that they often have vowel quantity contrasts, complex phonotactics, and strong dynamic stress) favour the presence of allophonic differences of alignment that are acoustically salient enough to be phonologised in historical change.

Ivan Yuen
Downtrend and the perception of lexical tones

Downtrend results in different f0 values for a phonologically equivalent accent/tone (Pierrehumbert 1979, Prieto 1996, 1998, Shih 2000). In a late-occurring position-in-utterance, the phonetic f0 value of an accent is realised lower than in an early-occurring position-in-utterance. In equating different f0 values to the same perceived prominence of an accent, it has been shown that listeners compensate for downtrend with regard to a global reference line (Pierrehumbert 1979, Gussenhoven and Rietveld 1988, Terken 1991, Gussenhoven et al 1997). But it is not clear what information in the f0 contour listeners employ in constructing a frame of reference in order to normalise for downtrend. The current study investigated normalisation of downtrend in a tone language (Cantonese), which allows us to examine what f0 information in the f0 contour can be employed as a reference frame in normalising for downtrend.

The results of a perception experiment showed that downtrend was compensated for in identifying Cantonese tones and that there was a strong local f0 context effect in the normalisation.



[back to PWorkshop Archives]

<owner-pworkshop@ling.ed.ac.uk>