Hisashi Kawai, Tomoki Toda, Junichi Yamagishi, Toshio Hirai, Jinfu Ni, Nobuyuki Nishizawa, Minoru Tsuzaki, and Keiichi Tokuda. Ximera: a concatenative speech synthesis system with large scale corpora. IEICE Trans. Information and Systems, J89-D-II(12):2688-2698, December 2006. [ bib ]

Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Reinforcement learning of dialogue strategies with hierarchical abstract machines. In Proc. IEEE/ACL Workshop on Spoken Language Technology (SLT), December 2006. [ bib | .pdf ]

In this paper we propose partially specified dialogue strategies for dialogue strategy optimization, where part of the strategy is specified deterministically and the rest optimized with Reinforcement Learning (RL). To do this we apply RL with Hierarchical Abstract Machines (HAMs). We also propose to build simulated users using HAMs, incorporating a combination of hierarchical deterministic and probabilistic behaviour. We performed experiments using a single-goal flight booking dialogue system, and compare two dialogue strategies (deterministic and optimized) using three types of simulated user (novice, experienced and expert). Our results show that HAMs are promising for both dialogue optimization and simulation, and provide evidence that indeed partially specified dialogue strategies can outperform deterministic ones (on average 4.7 fewer system turns) with faster learning than the traditional RL framework.

Chie Shimodaira, Hiroshi Shimodaira, and Susumu Kunifuji. A Divergent-Style Learning Support Tool for English Learners Using a Thesaurus Diagram. In Proc. KES2006, Bournemouth, United Kingdom, October 2006. [ bib | .pdf ]

This paper proposes an English learning support tool which provides users with divergent information to find the right words and expressions. In contrast to a number of software tools for English translation and composition, the proposed tool is designed to give users not only the right answer to the user's query but also a lot of words and examples which are relevant to the query. Based on the lexical information provided by the lexical database, WordNet, the proposed tool provides users with a thesaurus diagram, in which synonym sets and relation links are presented in multiple windows to help users to choose adequate words and understand similarities and differences between words. Subjective experiments are carried out to evaluate the system.

Junko Tokuno, Mitsuru Nakai, Hiroshi Shimodaira, Shigeki Sagayama, and Masaki Nakagawa. On-line Handwritten Character Recognition Selectively employing Hierarchical Spatial Relationships among Subpatterns. In Proc. IWFHR-10, La Baule, France, October 2006. [ bib ]

This paper proposes an on-line handwritten character pattern recognition method that examines spatial relationships among subpatterns which are components of a character pattern. Conventional methods evaluating spatial relationships among subpatterns have not considered characteristics of deformed handwritings and evaluate all the spatial relationships equally. However, the deformations of spatial features are different within a character pattern. In our approach, we assume that the distortions of spatial features are dependent on the hierarchy of character patterns so that we selectively evaluate hierarchical spatial relationships of subpatterns by employing Bayesian network as a post-processor of our sub-stroke based HMM recognition system. Experiments of on-line handwritten Kanji character recognition with a lexicon of 1,016 elementary characters revealed that the approach we propose improves the recognition accuracy for different types of deformations.

Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Learning multi-goal dialogue strategies using reinforcement learning with reduced state-action spaces. In Proc. Interspeech, September 2006. [ bib | .pdf ]

Learning dialogue strategies using the reinforcement learning framework is problematic due to its expensive computational cost. In this paper we propose an algorithm that reduces a state-action space to one which includes only valid state-actions. We performed experiments on full and reduced spaces using three systems (with 5, 9 and 20 slots) in the travel domain using a simulated environment. The task was to learn multi-goal dialogue strategies optimizing single and multiple confirmations. Average results using strategies learnt on reduced spaces reveal the following benefits against full spaces: 1) less computer memory (94% reduction), 2) faster learning (93% faster convergence) and better performance (8.4% less time steps and 7.7% higher reward).

Sue Fitt and Korin Richmond. Redundancy and productivity in the speech technology lexicon - can we do better? In Proc. Interspeech 2006, September 2006. [ bib | .pdf ]

Current lexica for speech technology typically contain much redundancy, while omitting useful information. A comparison with lexica in other media and for other purposes is instructive, as it highlights some features we may borrow for text-to-speech and speech recognition lexica. We describe some aspects of the new lexicon we are producing, Combilex, whose structure and implementation is specifically designed to reduce redundancy and improve the representation of productive elements of English. Most importantly, many English words are predictable derivations of baseforms, or compounds. Storing the lexicon as a combination of baseforms and derivational rules speeds up lexicon development, and improves coverage and maintainability.

Le Zhang and Steve Renals. Phone recognition analysis for trajectory HMM. In Proc. Interspeech 2006, Pittsburgh, USA, September 2006. [ bib | .pdf ]

The trajectory HMM has been shown to be useful for model-based speech synthesis where a smoothed trajectory is generated using temporal constraints imposed by dynamic features. To evaluate the performance of such model on an ASR task, we present a trajectory decoder based on tree search with delayed path merging. Experiment on a speaker-dependent phone recognition task using the MOCHA-TIMIT database shows that the MLE-trained trajectory model, while retaining attractive properties of being a proper generative model, tends to favour over-smoothed trajectory among competing hypothesises, and does not perform better than a conventional HMM. We use this to build an argument that models giving better fit on training data may suffer a reduction of discrimination by being too faithful to training data. This partially explains why alternative acoustic models that try to explicitly model temporal constraints do not achieve significant improvements in ASR.

Jithendra Vepa and Simon King. Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Transactions on Speech and Audio Processing, 14(5):1763-1771, September 2006. [ bib | .pdf ]

In unit selection-based concatenative speech synthesis, join cost (also known as concatenation cost), which measures how well two units can be joined together, is one of the main criteria for selecting appropriate units from the inventory. Usually, some form of local parameter smoothing is also needed to disguise the remaining discontinuities. This paper presents a subjective evaluation of three join cost functions and three smoothing methods. We describe the design and performance of a listening test. The three join cost functions were taken from our previous study, where we proposed join cost functions derived from spectral distances, which have good correlations with perceptual scores obtained for a range of concatenation discontinuities. This evaluation allows us to further validate their ability to predict concatenation discontinuities. The units for synthesis stimuli are obtained from a state-of-the-art unit selection text-to-speech system: rVoice from Rhetorical Systems Ltd. In this paper, we report listeners' preferences for each join cost in combination with each smoothing method.

J. Frankel and S. King. Observation process adaptation for linear dynamic models. Speech Communication, 48(9):1192-1199, September 2006. [ bib | .ps | .pdf ]

This work introduces two methods for adapting the observation process parameters of linear dynamic models (LDM) or other linear-Gaussian models. The first method uses the expectation-maximization (EM) algorithm to estimate transforms for location and covariance parameters, and the second uses a generalized EM (GEM) approach which reduces computation in making updates from O(p6) to O(p3), where p is the feature dimension. We present the results of speaker adaptation on TIMIT phone classification and recognition experiments with relative error reductions of up to 6%. Importantly, we find minimal differences in the results from EM and GEM. We therefore propose that the GEM approach be applied to adaptation of hidden Markov models which use non-diagonal covariances. We provide the necessary update equations.

R. Clark, K. Richmond, V. Strom, and S. King. Multisyn voices for the Blizzard Challenge 2006. In Proc. Blizzard Challenge Workshop (Interspeech Satellite), Pittsburgh, USA, September 2006. (http://festvox.org/blizzard/blizzard2006.html). [ bib | .pdf ]

This paper describes the process of building unit selection voices for the Festival Multisyn engine using the ATR dataset provided for the Blizzard Challenge 2006. We begin by discussing recent improvements that we have made to the Multisyn voice building process, prompted by our participation in the Blizzard Challenge 2006. We then go on to discuss our interpretation of the results observed. Finally, we conclude with some comments and suggestions for the formulation of future Blizzard Challenges.

Partha Lal. A comparison of singing evaluation algorithms. In Proc. Interspeech 2006, September 2006. [ bib | .pdf ]

This paper describes a system that compares user renditions of short sung clips with the original version of those clips. The F0 of both recordings was estimated and then Viterbi-aligned with each other. The total difference in pitch after alignment was used as a distance metric and transformed into a rating out of ten, to indicate to the user how close he or she was to the original singer. An existing corpus of sung speech was used for initial design and optimisation of the system. We then collected further development and evaluation corpora - these recordings were judged for closeness to an original recording by two human judges. The rankings assigned by those judges were used to design and optimise the system. The design was then implemented and deployed as part of a telephone-based entertainment application.

Robert A. J. Clark and Simon King. Joint prosodic and segmental unit selection speech synthesis. In Proc. Interspeech 2006, Pittsburgh, USA, September 2006. [ bib | .ps | .pdf ]

We describe a unit selection technique for text-to-speech synthesis which jointly searches the space of possible diphone sequences and the space of possible prosodic unit sequences in order to produce synthetic speech with more natural prosody. We demonstrates that this search, although currently computationally expensive, can achieve improved intonation compared to a baseline in which only the space of possible diphone sequences is searched. We discuss ways in which the search could be made sufficiently efficient for use in a real-time system.

K. Richmond. A trajectory mixture density network for the acoustic-articulatory inversion mapping. In Proc. Interspeech, Pittsburgh, USA, September 2006. [ bib | .pdf ]

This paper proposes a trajectory model which is based on a mixture density network trained with target features augmented with dynamic features together with an algorithm for estimating maximum likelihood trajectories which respects constraints between the static and derived dynamic features. This model was evaluated on an inversion mapping task. We found the introduction of the trajectory model successfully reduced root mean square error by up to 7.5%, as well as increasing correlation scores.

G. Murray and S. Renals. Dialogue act compression via pitch contour preservation. In Proceedings of the 9th International Conference on Spoken Language Processing, Pittsburgh, USA, September 2006. [ bib | .pdf ]

This paper explores the usefulness of prosody in automatically compressing dialogue acts from meeting speech. Specifically, this work attempts to compress utterances by preserving the pitch contour of the original whole utterance. Two methods of doing this are described in detail and are evaluated subjectively using human annotators and objectively using edit distance with a human-authored gold-standard. Both metrics show that such a prosodic approach is much better than the random baseline approach and significantly better than a simple text compression method.

G. Murray, S. Renals, J. Moore, and J. Carletta. Incorporating speaker and discourse features into speech summarization. In Proceedings of the Human Language Technology Conference - North American Chapter of the Association for Computational Linguistics Meeting (HLT-NAACL) 2006, New York City, USA, June 2006. [ bib | .pdf ]

The research presented herein explores the usefulness of incorporating speaker and discourse features in an automatic speech summarization system applied to meeting recordings from the ICSI Meetings corpus. By analyzing speaker activity, turn-taking and discourse cues, it is hypothesized that a system can outperform solely text-based methods inherited from the field of text summarization. The summarization methods are described, two evaluation methods are applied and compared, and the results clearly show that utilizing such features is advantageous and efficient. Even simple methods relying on discourse cues and speaker activity can outperform text summarization approaches.

B. Hachey, G. Murray, and D. Reitter. Dimensionality reduction aids term co-occurrence based multi-document summarization. In Proceedings of ACL Summarization Workshop 2006, Sydney, Australia, June 2006. [ bib | .pdf ]

A key task in an extraction system for query-oriented multi-document summarisation, necessary for computing relevance and redundancy, is modelling text semantics. In the Embra system, we use a representation derived from the singular value decomposition of a term co-occurrence matrix. We present methods to show the reliability of performance improvements. We find that Embra performs better with dimensionality reduction.

G. Murray, S. Renals, and M. Taboada. Prosodic correlates of rhetorical relations. In Proceedings of HLT/NAACL ACTS Workshop, 2006, New York City, USA, June 2006. [ bib | .pdf ]

This paper investigates the usefulness of prosodic features in classifying rhetorical relations between utterances in meeting recordings. Five rhetorical relations of contrast, elaboration, summary, question and cause are explored. Three training methods - supervised, unsupervised, and combined - are compared, and classification is carried out using support vector machines. The results of this pilot study are encouraging but mixed, with pairwise classification achieving an average of 68% accuracy in discerning between relation pairs using only prosodic features, but multi-class classification performing only slightly better than chance.

A. Janin, A. Stolcke, X. Anguera, K. Boakye, Ö. Çetin, J. Frankel, and J. Zheng. The ICSI-SRI spring 2006 meeting recognition system. In Proc. MLMI, Washington DC., May 2006. [ bib | .ps | .pdf ]

We describe the development of the ICSI-SRI speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2006 Meeting Rich Transcription (RT-06S) evaluation, highlighting improvements made since last year, including improvements to the delay-and-sum algorithm, the nearfield segmenter, language models, posterior-based features, HMM adaptation methods, and adapting to a small amount of new lecture data. Results are reported on RT-05S and RT-06S meeting data. Compared to the RT-05S conference system, we achieved an overall improvement of 4% relative in the MDM and SDM conditions, and 11% relative in the IHM condition. On lecture data, we achieved an overall improvement of 8% relative in the SDM condition, 12% on MDM, 14% on ADM, and 15% on IHM.

Peter Bell, Tina Burrows, and Paul Taylor. Adaptation of prosodic phrasing models. In Proc. Speech Prosody 2006, Dresden, Germany, May 2006. [ bib | .pdf ]

There is considerable variation in the prosodic phrasing of speech betweeen different speakers and speech styles. Due to the time and cost of obtaining large quantities of data to train a model for every variation, it is desirable to develop models that can be adapted to new conditions with a limited amount of training data. We describe a technique for adapting HMM-based phrase boundary prediction models which alters a statistic distribution of prosodic phrase lengths. The adapted models show improved prediction performance across different speakers and types of spoken material.

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang. Multimodal integration for meeting group action segmentation and recognition. In S. Renals and S. Bengio, editors, Proc. Multimodal Interaction and Related Machine Learning Algorithms Workshop (MLMI-05), pages 52-63. Springer, 2006. [ bib ]

We address the problem of segmentation and recognition of sequences of multimodal human interactions in meetings. These interactions can be seen as a rough structure of a meeting, and can be used either as input for a meeting browser or as a first step towards a higher semantic analysis of the meeting. A common lexicon of multimodal group meeting actions, a shared meeting data set, and a common evaluation procedure enable us to compare the different approaches. We compare three different multimodal feature sets and our modelling infrastructures: a higher semantic feature approach, multi-layer HMMs, a multistream DBN, as well as a multi-stream mixed-state DBN for disturbed data.

Steve Renals, Samy Bengio, and Jonathan Fiscus, editors. Machine learning for multimodal interaction (Proceedings of MLMI '06), volume 4299 of Lecture Notes in Computer Science. Springer-Verlag, 2006. [ bib ]

T. Hain, L. Burget, L. Burget, J. dines, G. Garau, M. Karafiat, M. Lincoln, J. Vepa, and V. Wan. The AMI meeting transcription system: Progress and performance. In Proceedings of the Rich Transcription 2006 Spring Meeting Recognition Evaluation, 2006. [ bib | .pdf ]

We present the AMI 2006 system for the transcription of speech in meetings. The system was jointly developed by multiple sites on the basis of the 2005 system for participation in the NIST RT'05 evaluations. The paper describes major developments such as improvements in automatic segmentation, cross-domain model adaptation, inclusion of MLP based features, improvements in decoding, language modelling and vocal tract length normalisation, the use of a new decoder, and a new system architecture. This is followed by a comprehensive description of the final system and its performance in the NIST RT'06s evaluations. In comparison to the previous year word error rate results on the individual headset microphone task were reduced by 20% relative.

Simon King. Handling variation in speech and language processing. In Keith Brown, editor, Encyclopedia of Language and Linguistics. Elsevier, 2nd edition, 2006. [ bib ]

Sasha Calhoun. Information Structure and the Prosodic Structure of English: a Probabilistic Relationship. PhD thesis, University of Edinburgh, 2006. [ bib ]

This thesis looks at how information structure is signalled prosodically in English. It has been standardly held that information structure is primarily signalled by the distribution of pitch accents within syntax structure, as well as intonation event type. Rather, it is argued that previous work has underestimated the importance, and richness, of metrical prosodic structure and its role in signalling information structure. A new approach is proposed: to view information structure as a strong constraint on the mapping of words onto metrical prosodic structure. Focal elements (kontrast) align with nuclear prominence, while accents on other words are not usually directly 'meaningful'. Information units (theme/rheme) try to align with prosodic phrases. This mapping is probabilistic, so it is also influenced by lexical and syntactic effects, as well as rhythmical constraints and other features including emphasis. Qualitative and quantitative analysis is presented in support of these claims using the NXT Switchboard corpus which has been annotated with substantial new layers of semantic and prosodic features.

Simon King. Language variation in speech technologies. In Keith Brown, editor, Encyclopedia of Language and Linguistics. Elsevier, 2nd edition, 2006. [ bib ]

P. Hsueh, J. Moore, and S. Renals. Automatic segmentation of multiparty dialogue. In Proc. EACL06, 2006. [ bib | .pdf ]

In this paper, we investigate the prob- lem of automatically predicting segment boundaries in spoken multiparty dialogue. We extend prior work in two ways. We first apply approaches that have been pro- posed for predicting top-level topic shifts to the problem of identifying subtopic boundaries. We then explore the impact on performance of using ASR output as opposed to human transcription. Exam- ination of the effect of features shows that predicting top-level and predicting subtopic boundaries are two distinct tasks: (1) for predicting subtopic boundaries, the lexical cohesion-based approach alone can achieve competitive results, (2) for predicting top-level boundaries, the ma- chine learning approach that combines lexical-cohesion and conversational fea- tures performs best, and (3) conversational cues, such as cue phrases and overlapping speech, are better indicators for the top- level prediction task. We also find that the transcription errors inevitable in ASR output have a negative impact on models that combine lexical-cohesion and conver- sational features, but do not change the general preference of approach for the two tasks.

Volker Strom, Robert Clark, and Simon King. Expressive prosody for unit-selection speech synthesis. In Proc. Interspeech, Pittsburgh, 2006. [ bib | .ps | .pdf ]

Current unit selection speech synthesis voices cannot produce emphasis or interrogative contours because of a lack of the necessary prosodic variation in the recorded speech database. A method of recording script design is proposed which addresses this shortcoming. Appropriate components were added to the target cost function of the Festival Multisyn engine, and a perceptual evaluation showed a clear preference over the baseline system.

Marc Al-Hames, Thomas Hain, Jan Cernocky, Sascha Schreiber, Mannes Poel, Ronald Mueller, Sebastien Marcel, David van Leeuwen, Jean-Marc Odobez, Sileye Ba, Hervé Bourlard, Fabien Cardinaux, Daniel Gatica-Perez, Adam Janin, Petr Motlicek, Stephan Reiter, Steve Renals, Jeroen van Rest, Rutger Rienks, Gerhard Rigoll, Kevin Smith, Andrew Thean, and Pavel Zemcik. Audio-video processing in meetings: Seven questions and current AMI answers. In S. Renals, S. Bengio, and J. G. Fiscus, editors, Machine Learning for Multimodal Interaction (Proc. MLMI '06), volume 4299 of Lecture Notes in Computer Science, pages 24-35. Springer, 2006. [ bib ]

Steve Renals and Samy Bengio, editors. Machine learning for multimodal interaction (Proceedings of MLMI '05), volume 3869 of Lecture Notes in Computer Science. Springer-Verlag, 2006. [ bib ]