M. Wester, J. Frankel, and S. King. Asynchronous articulatory feature recognition using dynamic Bayesian networks. In Proc. IEICI Beyond HMM Workshop, Kyoto, December 2004. [ bib | .ps | .pdf ]

This paper builds on previous work where dynamic Bayesian networks (DBN) were proposed as a model for articulatory feature recognition. Using DBNs makes it possible to model the dependencies between features, an addition to previous approaches which was found to improve feature recognition performance. The DBN results were promising, giving close to the accuracy of artificial neural nets (ANNs). However, the system was trained on canonical labels, leading to an overly strong set of constraints on feature co-occurrence. In this study, we describe an embedded training scheme which learns a set of data-driven asynchronous feature changes where supported in the data. Using a subset of the OGI Numbers corpus, we describe articulatory feature recognition experiments using both canonically-trained and asynchronous DBNs. Performance using DBNs is found to exceed that of ANNs trained on an identical task, giving a higher recognition accuracy. Furthermore, inter-feature dependencies result in a more structured model, giving rise to fewer feature combinations in the recognition output. In addition to an empirical evaluation of this modelling approach, we give a qualitative analysis, comparing asynchrony found through our data-driven methods to the asynchrony which may be expected on the basis of linguistic knowledge.

Yoshinori Shiga and Simon King. Source-filter separation for articulation-to-speech synthesis. In Proc. ICSLP, Jeju, Korea, October 2004. [ bib | .ps | .pdf ]

In this paper we examine a method for separating out the vocal-tract filter response from the voice source characteristic using a large articulatory database. The method realises such separation for voiced speech using an iterative approximation procedure under the assumption that the speech production process is a linear system composed of a voice source and a vocal-tract filter, and that each of the components is controlled independently by different sets of factors. Experimental results show that the spectral variation is evidently influenced by the fundamental frequency or the power of speech, and that the tendency of the variation may be related closely to speaker identity. The method enables independent control over the voice source characteristic in our articulation-to-speech synthesis.

Jithendra Vepa and Simon King. Subjective evaluation of join cost functions used in unit selection speech synthesis. In Proc. 8th International Conference on Spoken Language Processing (ICSLP), Jeju, Korea, October 2004. [ bib | .pdf ]

In our previous papers, we have proposed join cost functions derived from spectral distances, which have good correlations with perceptual scores obtained for a range of concatenation discontinuities. To further validate their ability to predict concatenation discontinuities, we have chosen the best three spectral distances and evaluated them subjectively in a listening test. The unit sequences for synthesis stimuli are obtained from a state-of-the-art unit selection text-tospeech system: rVoice from Rhetorical Systems Ltd. In this paper, we report listeners preferences for each of the three join cost functions.

Yoshinori Shiga and Simon King. Estimating detailed spectral envelopes using articulatory clustering. In Proc. ICSLP, Jeju, Korea, October 2004. [ bib | .ps | .pdf ]

This paper presents an articulatory-acoustic mapping where detailed spectral envelopes are estimated. During the estimation, the harmonics of a range of F0 values are derived from the spectra of multiple voiced speech signals vocalized with similar articulator settings. The envelope formed by these harmonics is represented by a cepstrum, which is computed by fitting the peaks of all the harmonics based on the weighted least square method in the frequency domain. The experimental result shows that the spectral envelopes are estimated with the highest accuracy when the cepstral order is 48-64 for a female speaker, which suggests that representing the real response of the vocal tract requires high-quefrency elements that conventional speech synthesis methods are forced to discard in order to eliminate the pitch component of speech.

Alexander Gutkin and Simon King. Phone classification in pseudo-Euclidean vector spaces. In Proc. 8th International Conference on Spoken Language Processing (ICSLP), volume II, pages 1453-1457, Jeju Island, Korea, October 2004. [ bib | .ps.gz | .pdf ]

Recently we have proposed a structural framework for modelling speech, which is based on patterns of phonological distinctive features, a linguistically well-motivated alternative to standard vector-space acoustic models like HMMs. This framework gives considerable representational freedom by working with features that have explicit linguistic interpretation, but at the expense of the ability to apply the wide range of analytical decision algorithms available in vector spaces, restricting oneself to more computationally expensive and less-developed symbolic metric tools. In this paper we show that a dissimilarity-based distance-preserving transition from the original structural representation to a corresponding pseudo-Euclidean vector space is possible. Promising results of phone classification experiments conducted on the TIMIT database are reported.

D. Toney, D. Feinberg, and K. Richmond. Acoustic features for profiling mobile users of conversational interfaces. In S. Brewster and M. Dunlop, editors, 6th International Symposium on Mobile Human-Computer Interaction - MobileHCI 2004, pages 394-398, Glasgow, Scotland, September 2004. Springer. [ bib ]

Conversational interfaces allow human users to use spoken language to interact with computer-based information services. In this paper, we examine the potential for personalizing speech-based human-computer interaction according to the user's gender and age. We describe a system that uses acoustic features of the user's speech to automatically estimate these physical characteristics. We discuss the difficulties of implementing this process in relation to the high level of environmental noise that is typical of mobile human-computer interaction.

J. Frankel, M. Wester, and S. King. Articulatory feature recognition using dynamic Bayesian networks. In Proc. ICSLP, September 2004. [ bib | .ps | .pdf ]

This paper describes the use of dynamic Bayesian networks for the task of articulatory feature recognition. We show that by modeling the dependencies between a set of 6 multi-leveled articulatory features, recognition accuracy is increased over an equivalent system in which features are considered independent. Results are compared to those found using artificial neural networks on an identical task.

Alexander Gutkin and Simon King. Structural Representation of Speech for Phonetic Classification. In Proc. 17th International Conference on Pattern Recognition (ICPR), volume 3, pages 438-441, Cambridge, UK, August 2004. IEEE Computer Society Press. [ bib | .ps.gz | .pdf ]

This paper explores the issues involved in using symbolic metric algorithms for automatic speech recognition (ASR), via a structural representation of speech. This representation is based on a set of phonological distinctive features which is a linguistically well-motivated alternative to the “beads-on-a-string” view of speech that is standard in current ASR systems. We report the promising results of phoneme classification experiments conducted on a standard continuous speech task.

Alexander Gutkin, David Gay, Lev Goldfarb, and Mirjam Wester. On the Articulatory Representation of Speech within the Evolving Transformation System Formalism. In Lev Goldfarb, editor, Pattern Representation and the Future of Pattern Recognition (Proc. Satellite Workshop of 17th International Conference on Pattern Recognition), pages 57-76, Cambridge, UK, August 2004. [ bib | .ps.gz | .pdf ]

This paper deals with the formulation of an alternative, structural, approach to the speech representation and recognition problem. In this approach, we require both the representation and the learning algorithms to be linguistically meaningful and to naturally represent the linguistic data at hand. This allows the speech recognition system to discover the emergent combinatorial structure of the linguistic classes. The proposed approach is developed within the ETS formalism, the first formalism in applied mathematics specifically designed to address the issues of class and object/event representation. We present an initial application of ETS to the articulatory modelling of speech based on elementary physiological gestures that can be reliably represented as the ETS primitives. We discuss the advantages of this gestural approach over prevalent methods and its promising potential to mathematical modelling and representation in linguistics.

J. Vepa and S. King. Subjective evaluation of join cost and smoothing methods. In Proc. 5th ISCA speech synthesis workshop, Pittsburgh, USA, June 2004. [ bib | .pdf ]

In our previous papers, we have proposed join cost functions derived from spectral distances, which have good correlations with perceptual scores obtained for a range of concatenation discontinuities. To further validate their ability to predict concatenation discontinuities, we have chosen the best three spectral distances and evaluated them subjectively in a listening test. The units for synthesis stimuli are obtained from a state-of-the-art unit selection text-to-speech system: `rVoice' from Rhetorical Systems Ltd. We also compared three different smoothing methods in this listening test. In this paper, we report listeners' preferences for each join costs in combination with each smoothing method.

Yoshinori Shiga and Simon King. Accurate spectral envelope estimation for articulation-to-speech synthesis. In Proc. 5th ISCA Speech Synthesis Workshop, pages 19-24, CMU, Pittsburgh, USA, June 2004. [ bib | .ps | .pdf ]

This paper introduces a novel articulatory-acoustic mapping in which detailed spectral envelopes are estimated based on the cepstrum, inclusive of the high-quefrency elements which are discarded in conventional speech synthesis to eliminate the pitch component of speech. For this estimation, the method deals with the harmonics of multiple voiced-speech spectra so that several sets of harmonics can be obtained at various pitch frequencies to form a spectral envelope. The experimental result shows that the method estimates spectral envelopes with the highest accuracy when the cepstral order is 48-64, which suggests that the higher order coeffcients are required to represent detailed envelopes reflecting the real vocal-tract responses.

Yoshinori Shiga. Source-filter separation based on an articulatory corpus. In One day meeting for young speech researchers (UK meeting), University College London, London, United Kingdom, April 2004. [ bib ]

A new approach is presented for estimating voice source and vocal-tract filter characteristics based on an articulatory database. From the viewpoint of acoustics, in order to estimate the transfer function of a system, both the input and output of the system need to be observed. In the case of the source-filter separation problem, however, only the output (i.e. speech) is observable, and the response of the system (vocal tract) and the input (voice source) must be estimated simultaneously. The estimation is hence theoretically impossible, and consequently the estimation problem is generally solved approximately by applying rather oversimplified models. The proposed approach separates these two characteristics under the assumption that each of the characteristics is controlled independently by a different set of factors. The separation is achieved by iterative approximation based on the above assumption using a large speech corpus including electro-magnetic articulograph data. The proposed approach enables the independent control of the source and filter characteristics, and thus contributes toward improving speech quality in speech synthesis.

Sasha Calhoun. Phonetic dimensions of intonational categories: the case of L+H* and H*. In Prosody 2004, Nara, Japan, March 2004. poster. [ bib | .ps | .pdf ]

ToBI, in its conception, was an attempt to describe intonation in terms of phonological categories. An effect of the success of ToBI in doing this has been to make it standard to try to characterise all intonational phonological distinctions in terms of ToBI distinctions, i.e. segmental alignment of pitch targets and pitch height as either High or Low. Here we report a series of experiments which attempted to do this, linking two supposed phonological categories, theme and rheme accents, to two controversial ToBI pitch accents L+H* and H* respectively. Our results suggest a reanalysis of the dimensions of phonological intonational distinctions. It is suggested that there are three layers affecting the intonational contour: global extrinsic, local extrinsic and intrinsic; and the theme-rheme distinction may lie in the local extrinsic layer. It is the similarity both of the phonetic effects and the semantic information conveyed by the last two layers that has led to the confusion in results such as those reported here.

Enrico Zovato, Stefano Sandri, Silvia Quazza, and Leonardo Badino. Prosodic analysis of a multi-style corpus in the perspective of emotional speech synthesis. In Proc. ICSLP 2004, Jeju, Korea, 2004. [ bib | .pdf ]

A. Wray, S.J. Cox, M. Lincoln, and J. Tryggvason. A formulaic approach to translation at the post office: Reading the signs. Language and Communication, 24(1):59-75, 2004. [ bib | .pdf ]

TESSA is an interactive translation system designed to support transactions between a post office clerk and a deaf customer. The system translates the clerk's speech into British Sign Language (BSL), displayed on a screen, using a specially-developed avatar (virtual human). TESSA is a context-constrained exemplification of one of two basic approaches to machine translation, neither of which can currently fulfil all of the demands of successful automatic translation. Drawing on recent research in theoretical psycholinguistics, we show how TESSA is a convincing prototype model of one aspect of real human language processing. Ways are suggested of exploiting this parallel, potentially offering new possibilities for the future design of artificial language systems.

H. Christensen, B. Kolluru, Y. Gotoh, and S. Renals. From text summarisation to style-specific summarisation for broadcast news. In Proc. ECIR-2004, 2004. [ bib | .ps.gz | .pdf ]

In this paper we report on a series of experiments investigating the path from text-summarisation to style-specific summarisation of spoken news stories. We show that the portability of traditional text summarisation features to broadcast news is dependent on the diffusiveness of the information in the broadcast news story. An analysis of two categories of news stories (containing only read speech or some spontaneous speech) demonstrates the importance of the style and the quality of the transcript, when extracting the summary-worthy information content. Further experiments indicate the advantages of doing style-specific summarisation of broadcast news.

A. Dielmann and S. Renals. Dynamic Bayesian networks for meeting structuring. In Proc. IEEE ICASSP, 2004. [ bib | .ps.gz | .pdf ]

This paper is about the automatic structuring of multiparty meetings using audio information. We have used a corpus of 53 meetings, recorded using a microphone array and lapel microphones for each participant. The task was to segment meetings into a sequence of meeting actions, or phases. We have adopted a statistical approach using dynamic Bayesian networks (DBNs). Two DBN architectures were investigated: a two-level hidden Markov model (HMM) in which the acoustic observations were concatenated; and a multistream DBN in which two separate observation sequences were modelled. Additionally we have also explored the use of counter variables to constrain the number of action transitions. Experimental results indicate that the DBN architectures are an improvement over a simple baseline HMM, with the multistream DBN with counter constraints producing an action error rate of 6%.

Jithendra Vepa and Simon King. Join cost for unit selection speech synthesis. In Abeer Alwan and Shri Narayanan, editors, Speech Synthesis. Prentice Hall, 2004. [ bib | .ps ]

C. Mayo and A. Turk. The development of perceptual cue weighting within and across monosyllabic words. In LabPhon 9, University of Illinois at Urbana-Champaign, 2004. [ bib ]

Robert A.J. Clark, Korin Richmond, and Simon King. Festival 2 - build your own general purpose unit selection speech synthesiser. In Proc. 5th ISCA workshop on speech synthesis, 2004. [ bib | .ps | .pdf ]

This paper describes version 2 of the Festival speech synthesis system. Festival 2 provides a development environment for concatenative speech synthesis, and now includes a general purpose unit selection speech synthesis engine. We discuss various aspects of unit selection speech synthesis, focusing on the research issues that relate to voice design and the automation of the voice development process.

Rachel Baker, Robert A.J. Clark, and Michael White. Synthesising contextually appropriate intonation in limited domains. In Proc. 5th ISCA workshop on speech synthesis, Pittsburgh, USA, 2004. [ bib | .ps | .pdf ]

Leonardo Badino. Chinese text word segmentation considering semantic links among sentences. In Proc. ICSLP 2004, Jeju, Korea, 2004. [ bib | .pdf ]

A. Dielmann and S. Renals. Multi-stream segmentation of meetings. In Proc. IEEE Workshop on Multimedia Signal Processing, 2004. [ bib | .ps.gz | .pdf ]

This paper investigates the automatic segmentation of meetings into a sequence of group actions or phases. Our work is based on a corpus of multiparty meetings collected in a meeting room instrumented with video cameras, lapel microphones and a microphone array. We have extracted a set of feature streams, in this case extracted from the audio data, based on speaker turns, prosody and a transcript of what was spoken. We have related these signals to the higher level semantic categories via a multistream statistical model based on dynamic Bayesian networks (DBNs). We report on a set of experiments in which different DBN architectures are compared, together with the different feature streams. The resultant system has an action error rate of 9%.

Leonardo Badino, Claudia Barolo, and Silvia Quazza. Language independent phoneme mapping for foreign TTS. In Proc. 5th ISCA Speech Synthesis Workshop, Pittsburgh, USA, 2004. [ bib | .pdf ]

Y. H. Abdel-Haleem, S. Renals, and N. D. Lawrence. Acoustic space dimensionality selection and combination using the maximum entropy principle. In Proc. IEEE ICASSP, 2004. [ bib | .pdf ]

In this paper we propose a discriminative approach to acoustic space dimensionality selection based on maximum entropy modelling. We form a set of constraints by composing the acoustic space with the space of phone classes, and use a continuous feature formulation of maximum entropy modelling to select an optimal feature set. The suggested approach has two steps: (1) the selection of the best acoustic space that efficiently and economically represents the acoustic data and its variability; (2) the combination of selected acoustic features in the maximum entropy framework to estimate the posterior probabilities over the phonetic labels given the acoustic input. Specific contributions of this paper include a parameter estimation algorithm (generalized improved iterative scaling) that enables the use of negative features, the parameterization of constraint functions using Gaussian mixture models, and experimental results using the TIMIT database.

Leonardo Badino, Claudia Barolo, and Silvia Quazza. A general approach to TTS reading of mixed-language texts. In Proc. ICSLP 2004, Jeju, Korea, 2004. [ bib | .pdf ]

C. Mayo and T. Turk. Adult-child differences in acoustic cue weighting are influenced by segmental context: Children are not always perceptually biased towards transitions. Journal of the Acoustical Society of America, 115:3184-3194, 2004. [ bib | .pdf ]