Haruto Takeda, Naoki Saito, Tomoshi Otsuki, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Hidden Markov Model for AUtomatic Transcription of MIDI Signals. In 2002 International Workshop on Multimedia Signal Processing, December 2002. [ bib | .pdf ]

J. Vepa, S. King, and P. Taylor. Objective distance measures for spectral discontinuities in concatenative speech synthesis. In Proc. ICSLP, Denver, USA, September 2002. [ bib | .pdf ]

In unit selection based concatenative speech systems, `join cost', which measures how well two units can be joined together, is one of the main criteria for selecting appropriate units from the inventory. The ideal join cost will measure `perceived' discontinuity, based on easily measurable spectral properties of the units being joined, in order to ensure smooth and natural-sounding synthetic speech. In this paper we report a perceptual experiment conducted to measure the correlation between `subjective' human perception and various `objective' spectrally-based measures proposed in the literature. Our experiments used a state-of-the-art unit-selection text-to-speech system: `rVoice' from Rhetorical Systems Ltd.

J. Vepa, S. King, and P. Taylor. New objective distance measures for spectral discontinuities in concatenative speech synthesis. In Proc. IEEE 2002 workshop on speech synthesis, Santa Monica, USA, September 2002. [ bib | .pdf ]

The quality of unit selection based concatenative speech synthesis mainly depends on how well two successive units can be joined together to minimise the audible discontinuities. The objective measure of discontinuity used when selecting units is known as the `join cost'. The ideal join cost will measure `perceived' discontinuity, based on easily measurable spectral properties of the units being joined, in order to ensure smooth and natural-sounding synthetic speech. In this paper we describe a perceptual experiment conducted to measure the correlation between `subjective' human perception and various `objective' spectrally-based measures proposed in the literature. Also we report new objective distance measures derived from various distance metrics based on these spectral features, which have good correlation with human perception to concatenation discontinuities. Our experiments used a state-of-the art unit-selection text-to-speech system: `rVoice' from Rhetorical Systems Ltd.

Kanad Keeni and Hiroshi Shimodaira. On Selection of Training Data for Fast Learning of Neural Networks Using Back Propagation. In IASTED International Conference on Artificial Intelligence and Application (AIA2002), pages 474-478, September 2002. [ bib ]

Junko Tokuno, Nobuhito Inami, Shigeki Matsuda, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Context-Dependent Substroke Model for HMM-based On-line Handwriting Recognition. In Proc. IWFHR-8, pages 78-83, August 2002. [ bib | .pdf ]

This paper describes an effective modeling technique in the on-line recognition for cursive Kanji handwritings and Hiragana handwritings. Our conventional recognition system based on substroke HMMs (hidden Markov models) employs straight-type substrokes as primary models and has achieved high recognition rate in the recognition of careful Kanji handwritings. On the other hand, the recognition rate for cursive handwritings is comparatively low, since they consist of mainlycurve-strokes. Therefore, we propose a technique of using multiple models for each substroke by considering the substroke context, which is a preceding substroke and a following substroke. In order to construct these context-dependent models efficiently, we use the SSS (Successive State Splitting) algorithm developed in speech recognition. Through the experiments, the recognition rate improved from 88% to 92% for cursive Kanji handwritings and from 90% to 98% for Hiragana handwritings.

Mitsuru Nakai, Takashi Sudo, Hiroshi Shimodaira, and Shigeki Sagayama. Pen Pressure Features for Writer-Independent On-Line Handwriting Recognition Based on Substroke HMM. In Proc. ICPR2002, III, pages 220-223, August 2002. [ bib | .pdf ]

Shin-ichi Kawamoto, Hiroshi Shimodaira, Tsuneo Nitta, Takuya Nishimoto, Satoshi Nakamura, Katsunobu Itou, Shigeo Morishima, Tatsuo Yotsukura, Atsuhiko Kai, Akinobu Lee, Yoichi Yamashita, Takao Kobayashi, Keiichi Tokuda, Keikichi Hirose, Nobuaki Minematsu, Atsushi Yamada, Yasuharu Den, Takehito Utsuro, and Shigeki Sagayama. Open-source software for developing anthropomorphic spoken dialog agent. In Proc. PRICAI-02, International Workshop on Lifelike Animated Agents, pages 64-69, August 2002. [ bib | .pdf ]

Shin-ichi Kawamoto, Hiroshi Shimodaira, et al. Design of Software Toolkit for Anthromorphic Spoken Dialog Agent Software with Customization-oriented Features. Information Processing Society of Japan (IPSJ) Journal, 43(7):2249-2263, July 2002. (in Japanese). [ bib ]

Jun Rokui, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Speaker Normalization Using Linear Transformation of Vocal Tract Length Based on Maximum Likelihood Estimation. Information Processing Society of Japan (IPSJ), 43(7):2030-2037, July 2002. (in Japanese). [ bib ]

Hiroshi Shimodaira, Nobuyoshi Sakai, Mitsuru Nakai, and Shigeki Sagayama. Jacobian Joint Adaptation to Noise, Channel and Vocal Tract Length. In Proc. ICASSP2002, pages 197-200, May 2002. [ bib | .pdf ]

A new Jacobian approach that linearly decomposes the composite of additive noise, multiplicative noise (channel transfer function) and speaker's vocal tract length, and adapts the acoustic model parameters simultaneously to these factors is proposed in this paper. Due to the fact that these factors non-linearly degrade the observed features for speech recognition, existing approaches fail to adapt the acoustic models adequately. Approximating the nonlinear operation by a linear model enables to employ the least square error estimation of the factors and adapt the acoustic model parameters with small amount of speech samples. Speech recognition experiments on ATR isolated word database demonstrate significant reduction of error rates, which supports the effectiveness of the proposed scheme.

Yoshinori Matsushita, Shinnichi Kawamoto, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. A Head-Behavior Synchronization Model with Utterance for Anthropomorphic Spoken-Dialog Agent. In Technical Report of IEICE, HIS2001, March 2002. (in Japanese). [ bib ]

A novel method of synchronously synthesizing the head motion of an anthropomorphic spoken dialog agent with its utterance is proposed. Although much efforts have been taken to synchronize the lip motion with utterance, very few research exist for such head-motion control. A neural network is employed to learn the relationship between the acoustic features of the utterance and the head motion that are measured by a motion-capturing system. The proposed method enables to simulate the facial animation automatically that moves synchronously with any given utterances. Subjective evaluation of the performance of the method is reported as well.

Tomoshi Otsuki, Naoki Saitou, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Musical Rhythm Recognition Using Hidden Markov Model. Information Processing Society of Japan (IPSJ) JOURNAL, 43(2), February 2002. (in Japanese). [ bib ]

H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang. Visual prosody: Facial movements accompanying speech. In Proc Fifth Int. Conf. Automatic Face and Gesture Recognition, pages 397-401, 2002. [ bib | .ps | .pdf ]

As we articulate speech, we usually move the head and exhibit various facial expressions. This visual aspect of speech aids understanding and helps communicating additional information, such as the speaker's mood. In this paper we analyze quantitatively head and facial movements that accompany speech and investigate how they relate to the text's prosodic structure. We recorded several hours of speech and measured the locations of the speaker's main facial features as well as their head poses. The text was evaluated with a prosody prediction tool, identifying phrase boundaries and pitch accents. Characteristic for most speakers are simple motion patterns that are repeatedly applied in synchrony with the main prosodic events. Direction and strength of head movements vary widely from one speaker to another, yet their timing is typically well synchronized with the spoken text. Understanding quantitatively the correlations between head movements and spoken text is important for synthesizing photo-realistic talking heads. Talking heads appear much more engaging when they exhibit realistic motion patterns.

K. Richmond. Estimating Articulatory Parameters from the Acoustic Speech Signal. PhD thesis, The Centre for Speech Technology Research, Edinburgh University, 2002. [ bib | .ps ]

A successful method for inferring articulation from the acoustic speech signal would find many applications: low bit-rate speech coding, visual representation of speech, and the possibility of improved automatic speech recognition to name but a few. It is unsurprising, therefore, that researchers have been investigating the acoustic-to-articulatory inversion mapping for several decades now. A great variety of approaches and models have been applied to the problem. Unfortunately, the overwhelming majority of these attempts have faced difficulties in satisfactorily assessing performance in terms of genuine human articulation. However, technologies such as electromagnetic articulography (EMA) mean that measurement of human articulation during speech has become increasingly accessible. Crucially, a large corpus of acoustic-articulatory data during phonetically-diverse, continuous speech has recently been recorded at Queen Margaret College, Edinburgh. One of the primary motivations of this thesis is to exploit the availability of this remarkable resource. Among the data-driven models which have been employed in previous studies, the feedforward multilayer perceptron (MLP) in particular has been used several times with promising results. Researchers have cited advantages in terms of memory requirement and execution speed as a significant factor motivating their use. Furthermore, the MLP is well known as a universal function approximator; an MLP of suitable form can in theory represent any arbitrary mapping function. Therefore, using an MLP in conjunction with the relatively large quantities of acoustic-articulatory data arguably represents a promising and useful first research step for the current thesis, and a significant part of this thesis is occupied with doing this. Having demonstrated an MLP which performs well enough to provide a reasonable baseline, we go on to critically evaluate the suitability of the MLP for the inversion mapping. The aim is to find ways to improve modelling accuracy further. Considering what model of the target articulatory domain is provided in the MLP is key in this respect. It has been shown that the outputs of an MLP trained with the sum-of-squares error function approximate the mean of the target data points conditioned on the input vector. In many situations, this is an appropriate and sufficient solution. In other cases, however, this conditional mean is an inconveniently limiting model of data in the target domain, particularly for ill-posed problems where the mapping may be multi-valued. Substantial evidence exists which shows that multiple articulatory configurations are able to produce the same acoustic signal. This means that a system intended to map from a point in acoustic space can be faced with multiple candidate articulatory configurations. Therefore, despite the impressive ability of the MLP to model mapping functions, it may prove inadequate in certain respects for performing the acoustic-to-articulatory inversion mapping. Mixture density networks (MDN) provide a principled method to model arbitrary probability density functions over the target domain, conditioned on the input vector. In theory, therefore, the MDN offers a superior model of the target domain compared to the MLP. We hypothesise that this advantage will prove beneficial in the case of the acoustic-to-articulatory inversion mapping. Accordingly, this thesis aims to test this hypothesis and directly compare the performance of MDN with MLP on exactly the same acoustic-to-articulatory inversion task.

Jesper Salomon, Simon King, and Miles Osborne. Framewise phone classification using support vector machines. In Proceedings International Conference on Spoken Language Processing, Denver, 2002. [ bib | .ps | .pdf ]

We describe the use of Support Vector Machines for phonetic classification on the TIMIT corpus. Unlike previous work, in which entire phonemes are classified, our system operates in a framewise manner and is intended for use as the front-end of a hybrid system similar to ABBOT. We therefore avoid the problems of classifying variable-length vectors. Our frame-level phone classification accuracy on the complete TIMIT test set is competitive with other results from the literature. In addition, we address the serious problem of scaling Support Vector Machines by using the Kernel Fisher Discriminant.

M. Wester, J.M. Kessens, and H. Strik. Goal-directed ASR in a multimedia indexing and searching environment (MUMIS). In Proc. ICSLP, pages 1993-1996, Denver, 2002. [ bib | .pdf ]

This paper describes the contribution of automatic speech recognition (ASR) within the framework of MUMIS (Multimedia Indexing and Searching Environment). The domain is football commentaries. The initial results of carrying out ASR on Dutch and English football commentaries are presented. We found that overall word error rates are high, but application specific words are recognized reasonably well. The difficulty of the ASR task is greatly increased by the high levels of noise present in the material.

O. Goubanova. Forms of introduction in map task dialogues: Case of L2 Russian speakers. In Proc. ICSLP 2002, Denver, USA, 2002. [ bib ]

A. J. Robinson, G. D. Cook, D. P. W. Ellis, E. Fosler-Lussier, S. J. Renals, and D. A. G. Williams. Connectionist speech recognition of broadcast news. Speech Communication, 37:27-45, 2002. [ bib | .ps.gz | .pdf ]

This paper describes connectionist techniques for recognition of Broadcast News. The fundamental difference between connectionist systems and more conventional mixture-of-Gaussian systems is that connectionist models directly estimate posterior probabilities as opposed to likelihoods. Access to posterior probabilities has enabled us to develop a number of novel approaches to confidence estimation, pronunciation modelling and search. In addition we have investigated a new feature extraction technique based on the modulation-filtered spectrogram, and methods for combining multiple information sources. We have incorporated all of these techniques into a system for the transcription of Broadcast News, and we present results on the 1998 DARPA Hub-4E Broadcast News evaluation data.

Helen Wright-Hastie, Massimo Poesio, and Stephen Isard. Automatically predicting dialogue structure using prosodic features. Speech Communication, 36(1-2):63-79, 2002. [ bib ]

S.J. Cox, M. Lincoln, J Tryggvason, M Nakisa, M. Wells, Mand Tutt, and S Abbott. TESSA, a system to aid communication with deaf people. In ASSETS 2002, Fifth International ACM SIGCAPH Conference on Assistive Technologies, pages 205-212, Edinburgh, Scotland, 2002. [ bib | .pdf ]

TESSA is an experimental system that aims to aid transactions between a deaf person and a clerk in a Post Office by translating the clerks speech to sign language. A speech recogniser recognises speech from the clerk and the system then synthesizes the appropriate sequence of signs in British Sign language (BSL) using a speciallydeveloped avatar. By using a phrase lookup approach to language translation, which is appropriate for the highly constrained discourse in a Post Office, we were able to build a working system that we could evaluate. We summarise the results of this evaluation (undertaken by deaf users and Post office clerks), and discuss how the findings from the evaluation are being used in the development of an improved system

O. Pietquin and S. Renals. ASR system modeling for automatic evaluation and optimization of dialogue systems. In Proc IEEE ICASSP, pages 46-49, 2002. [ bib | .pdf ]

Though the field of spoken dialogue systems has developed quickly in the last decade, rapid design of dialogue strategies remains uneasy. Several approaches to the problem of automatic strategy learning have been proposed and the use of Reinforcement Learning introduced by Levin and Pieraccini is becoming part of the state of the art in this area. However, the quality of the strategy learned by the system depends on the definition of the optimization criterion and on the accuracy of the environment model. In this paper, we propose to bring a model of an ASR system in the simulated environment in order to enhance the learned strategy. To do so, we introduced recognition error rates and confidence levels produced by ASR systems in the optimization criterion.

V. Wan and S. Renals. Evaluation of kernel methods for speaker verification and identification. In Proc IEEE ICASSP, pages 669-672, 2002. [ bib | .pdf ]

Support vector machines are evaluated on speaker verification and speaker identification tasks. We compare the polynomial kernel, the Fisher kernel, a likelihood ratio kernel and the pair hidden Markov model kernel with baseline systems based on a discriminative polynomial classifier and generative Gaussian mixture model classifiers. Simulations were carried out on the YOHO database and some promising results were obtained.

Sasha Calhoun. Using prosody in ASR: the segmentation of broadcast radio news. Master's thesis, University of Edinburgh, 2002. [ bib | .pdf ]

This study explores how prosodic information can be used in Automatic Speech Recognition (ASR). A system was built which automatically identifies topic boundaries in a corpus of broadcast radio news. We evaluate the effectiveness of different types of features, including textual, durational, F0, Tilt and ToBI features in that system. These features were suggested by a review of the literature on how topic structure is indicated by humans and recognised by both humans and machines from both a linguistic and natural language processing standpoint. In particular, we investigate whether acoustic cues to prosodgz?g information can be used directly to indicate topic structure, or whether it is better to derive discourse structure from intonational events, such as ToBI events, in a manner suggested by Steedman's (2000) theory, among others. It was found that the global properties of an utterance (mean and maximum F0) and textual features (based on Hearst's (1997) lexical scores and cue phrases) were effective in recognising topic boundaries on their own whereas all other features investigated were not. Performance using Tilt and ToBI features was disappointing, although this could have been because of inaccuracies in estimating az?gthese 0g7 parameters. We suggest that different acoustic cues to prosody are more effective in recognising discourse information at certain levels of discourse structure than others. The identification of higher level structure is informed by the properties of lower level structure. Although the findings of this study were not conclusive on this issue, we propose that prosody in ASR and synthesis should be represented in terms of the intonational events relevant to each level of discourse structure. Further, at the level of topic structure, a taxonomy of events is needed to describe the global F0 properties of each utterance that makes up that structure.

V. Strom. From text to speech without ToBI. In Proc. ICSLP, Denver, 2002. [ bib | .ps | .pdf ]

A new method for predicting prosodic parameters, i.e. phone durations and F0 targets, from preprocessed text is presented. The prosody model comprises a set of CARTs, which are learned from a large database of labeled speech. This database need not be annotated with Tone and Break Indices (ToBI labels). Instead, a simpler symbolic prosodic description is created by a bootstrapping method. The method had been applied to one Spanish and two German speakers. For the German voices, two listening tests showed a significant preference for the new method over a more traditional approach of prosody prediction, based on hand-crafted rules.

Fiona Couper. Switching linear dynamical models for automatic speech recognition. Master's thesis, University of Edinburgh, 2002. [ bib | .pdf ]

The field of speech recognition research has been dominated by the Hidden Markov Model (HMM) in recent years. The HMM has known weaknesses, such as the strong “independence assumption” which presumes observations to be uncorrelated. New types of statistical modelling are now being investigated to overcome the weaknesses of HMMs. One such model is the Linear Dynamical Model (LDM), whose properties are more appropriate to speech. Modelling phone segments with LDMs gives fairly good classification and recognition scores, and this report explores possible extensions to a system using such models. Training only one model per phone cannot fully model variation that exists in speech, and perhaps training more than one model for some segments will improve accuracy scores. This is investigated here, and four methods for building two models instead of one for any phone are presented. Three of the methods produce significantly increased classification accuracy scores, compared to a set of single models.

Mirjam Wester. Pronunciation Variation Modeling for Dutch Automatic Speech Recognition. PhD thesis, University of Nijmegen, 2002. [ bib | .pdf ]

This thesis consists of an introductory review to pronunciation variation modeling, followed by four papers in which the PhD research is described.

C. Mayo, A. Turk, and J. Watson. Development of cue weighting strategies in children's speech perception. In Proceedings of TIPS: Temporal Integration in the Perception of Speech, Aix-en-Provence, 2002. [ bib ]

Juergen Schroeter, Alistair Conkie, Ann Syrdal, Mark Beutnagel, Matthias Jilka, Volker Strom, Yeon-Jun Kim, Hong-Goo Kang, and David Kapilow. A perspective on the next challanges for TTS. In IEEE 2002 Workshop in Speech Synthesis, pages 11-13, Santa Monica, CA, 2002. [ bib | .ps | .pdf ]

The quality of speech synthesis has come a long way since Homer Dudley's “Vocoder” in 1939. In fact, with the wide-spread use of unit-selection synthesizers, the naturalness of the synthesized speech is now high enough to pass the Turing test for short utterances, such as prompts. Therefore, it seems valid to ask the question “what are the next challenges for TTS Research?” This paper tries to identify unresoved issues, the solution of which would greatly enhance the state of the art in TTS.