The Centre for Speech Technology Research, The university of Edinburgh

Publications by Hiroshi Shimodaira

[1] Atef Ben Youssef, Hiroshi Shimodaira, and David Braude. Speech driven talking head from estimated articulatory features. In Proc. ICASSP, pages 4606-4610, Florence, Italy, May 2014. [ bib | .pdf ]
In this paper, we present a talking head in which the lips and head motion are controlled using articulatory movements estimated from speech. A phonesize HMM-based inversion mapping is employed and trained in a semi-supervised fashion. The advantage of the use of articulatory features is that they can drive the lips motions and they have a close link with head movements. Speech inversion normally requires the training data recorded with electromagnetic articulograph (EMA), which restricts the naturalness of head movements. The present study considers a more realistic recording condition where the training data for the target speaker are recorded with a usual motion capture system rather than EMA. Different temporal clustering techniques are investigated for HMM-based mapping as well as a GMM-based frame-wise mapping as a baseline system. Objective and subjective experiments show that the synthesised motions are more natural using an HMM system than a GMM one, and estimated EMA features outperform prosodic features.

[2] Atef Ben Youssef, Hiroshi Shimodaira, and David A. Braude. Head motion analysis and synthesis over different tasks. In Proc. Intelligent Virtual Agents, pages 285-294. Springer, September 2013. [ bib | .pdf ]
It is known that subjects vary in their head movements. This paper presents an analysis of this variety over different tasks and speakers and their impact on head motion synthesis. Measured head and articulatory movements acquired by an ElectroMagnetic Articulograph (EMA) synchronously recorded with audio was used. Data set of speech of 12 people recorded on different tasks confirms that the head motion variate over tasks and speakers. Experimental results confirmed that the proposed models were capable of learning and synthesising task-dependent head motions from speech. Subjective evaluation of synthesised head motion using task models shows that trained models on the matched task is better than mismatched one and free speech data provide models that predict preferred motion by the participants compared to read speech data.

[3] Atef Ben Youssef, Hiroshi Shimodaira, and David A. Braude. Articulatory features for speech-driven head motion synthesis. In Proc. Interspeech, pages 2758-2762, Lyon, France, August 2013. [ bib | .pdf ]
This study investigates the use of articulatory features for speech-driven head motion synthesis as opposed to prosody features such as F0 and energy which have been mainly used in the literature. In the proposed approach, multi-stream HMMs are trained jointly on the synchronous streams of speech and head motion data. Articulatory features can be regarded as an intermediate parametrisation of speech that are expected to have a close link with head movement. Measured head and articulatory movements acquired by EMA were synchronously recorded with speech. Measured articulatory data was compared to those predicted from speech using an HMM-based inversion mapping system trained in a semi-supervised fashion. Canonical correlation analysis (CCA) on a data set of free speech of 12 people shows that the articulatory features are more correlated with head rotation than prosodic and/or cepstral speech features. It is also shown that the synthesised head motion using articulatory features give higher correlations with the original head motion than when only prosodic features are used.

[4] David A. Braude, Hiroshi Shimodaira, and Atef Ben Youssef. Template-warping based speech driven head motion synthesis. In Proc. Interspeech, pages 2763-2767, Lyon, France, August 2013. [ bib | .pdf ]
We propose a method for synthesising head motion from speech using a combination of an Input-Output Markov model (IOMM) and Gaussian mixture models trained in a supervised manner. A key difference of this approach compared to others is to model the head motion in each angle as a series of templates of motion rather than trying to recover a frame-wise function. The templates were chosen to reflect natural patterns in the head motion, and states for the IOMM were chosen based on statistics of the templates. This reduces the search space for the trajectories and stops impossible motions such as discontinuities from being possible. For synthesis our system warps the templates to account for the acoustic features and the other angles’ warping parameters. We show our system is capable of recovering the statistics of the motion that were chosen for the states. Our system was then compared to a baseline that used a frame-wise mapping that is based on previously published work. A subjective preference test that includes multiple speakers showed participants have a preference for the segment based approach. Both of these systems were trained on storytelling free speech.

[5] David Adam Braude, Hiroshi Shimodaira, and Atef Ben Youssef. Template-warping based speech driven head motion synthesis. In Interspeech, pages 2763 - 2767, 2013. [ bib | .pdf ]
We propose a method for synthesising head motion from speech using a combination of an Input-Output Markov model (IOMM) and Gaussian mixture models trained in a supervised manner. A key difference of this approach compared to others is to model the head motion in each angle as a series of templates of motion rather than trying to recover a frame-wise function. The templates were chosen to reflect natural patterns in the head motion, and states for the IOMM were chosen based on statistics of the templates. This reduces the search space for the trajectories and stops impossible motions such as discontinuities from being possible. For synthesis our system warps the templates to account for the acoustic features and the other angles' warping parameters. We show our system is capable of recovering the statistics of the motion that were chosen for the states. Our system was then compared to a baseline that used a frame-wise mapping that is based on previously published work. A subjective preference test that includes multiple speakers showed participants have a preference for the segment based approach. Both of these systems were trained on storytelling free speech.

Keywords: Head motion synthesis, GMMs, IOMM
[6] David A. Braude, Hiroshi Shimodaira, and Atef Ben Youssef. The University of Edinburgh head-motion and audio storytelling (UoE-HaS) dataset. In Proc. Intelligent Virtual Agents, pages 466-467. Springer, 2013. [ bib | .pdf ]
In this paper we announce the release of a large dataset of storytelling monologue with motion capture for the head and body. Initial tests on the dataset indicate that head motion is more dependant on the speaker than the style of speech.

[7] Michael A. Berger, Gregor Hofer, and Hiroshi Shimodaira. Carnival - combining speech technology and computer animation. IEEE Computer Graphics and Applications, 31:80-89, 2011. [ bib | DOI ]
[8] Michael Berger, Gregor Hofer, and Hiroshi Shimodaira. Carnival: a modular framework for automated facial animation. Poster at SIGGRAPH 2010, 2010. Bronze award winner, ACM Student Research Competition. [ bib | .pdf ]
[9] Michal Dziemianko, Gregor Hofer, and Hiroshi Shimodaira. HMM-based automatic eye-blink synthesis from speech. In Proc. Interspeech, pages 1799-1802, Brighton, UK, September 2009. [ bib | .pdf ]
In this paper we present a novel technique to automatically synthesise eye blinking from a speech signal. Animating the eyes of a talking head is important as they are a major focus of attention during interaction. The developed system predicts eye blinks from the speech signal and generates animation trajectories automatically employing a ”Trajectory Hidden Markov Model”. The evaluation of the recognition performance showed that the timing of blinking can be predicted from speech with an F-score value upwards of 52%, which is well above chance. Additionally, a preliminary perceptual evaluation was conducted, that confirmed that adding eye blinking significantly improves the perception the character. Finally it showed that the speech synchronised synthesised blinks outperform random blinking in naturalness ratings.

[10] Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Evaluation of a hierarchical reinforcement learning spoken dialogue system. Computer Speech and Language, 24(2):395-429, 2009. [ bib | DOI | .pdf ]
We describe an evaluation of spoken dialogue strategies designed using hierarchical reinforcement learning agents. The dialogue strategies were learnt in a simulated environment and tested in a laboratory setting with 32 users. These dialogues were used to evaluate three types of machine dialogue behaviour: hand-coded, fully-learnt and semi-learnt. These experiments also served to evaluate the realism of simulated dialogues using two proposed metrics contrasted with ‘Precision-Recall’. The learnt dialogue behaviours used the Semi-Markov Decision Process (SMDP) model, and we report the first evaluation of this model in a realistic conversational environment. Experimental results in the travel planning domain provide evidence to support the following claims: (a) hierarchical semi-learnt dialogue agents are a better alternative (with higher overall performance) than deterministic or fully-learnt behaviour; (b) spoken dialogue strategies learnt with highly coherent user behaviour and conservative recognition error rates (keyword error rate of 20%) can outperform a reasonable hand-coded strategy; and (c) hierarchical reinforcement learning dialogue agents are feasible and promising for the (semi) automatic design of optimized dialogue behaviours in larger-scale systems.

[11] Gregor Hofer, Junichi Yamagishi, and Hiroshi Shimodaira. Speech-driven lip motion generation with a trajectory HMM. In Proc. Interspeech 2008, pages 2314-2317, Brisbane, Australia, September 2008. [ bib | .pdf ]
Automatic speech animation remains a challenging problem that can be described as finding the optimal sequence of animation parameter configurations given some speech. In this paper we present a novel technique to automatically synthesise lip motion trajectories from a speech signal. The developed system predicts lip motion units from the speech signal and generates animation trajectories automatically employing a "Trajectory Hidden Markov Model". Using the MLE criterion, its parameter generation algorithm produces the optimal smooth motion trajectories that are used to drive control points on the lips directly. Additionally, experiments were carried out to find a suitable model unit that produces the most accurate results. Finally a perceptual evaluation was conducted, that showed that the developed motion units perform better than phonemes.

[12] Gregor Hofer and Hiroshi Shimodaira. Automatic head motion prediction from speech data. In Proc. Interspeech 2007, Antwerp, Belgium, August 2007. [ bib | .pdf ]
In this paper we present a novel approach to generate a sequence of head motion units given some speech. The modelling approach is based on the notion that head motion can be divided into a number of short homogeneous units that can each be modelled individually. The system is based on Hidden Markov Models (HMM), which are trained on motion units and act as a sequence generator. They can be evaluated by an accuracy measure. A database of motion capture data was collected and manually annotated for head motion and is used to train the models. It was found that the model is good at distinguishing high activity regions from regions with less activity with accuracies around 75 percent. Furthermore the model is able to distinguish different head motion patterns based on speech features somewhat reliably, with accuracies reaching almost 70 percent.

[13] Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Hierarchical dialogue optimization using semi-markov decision processes. In Proc. Interspeech, August 2007. [ bib | .pdf ]
This paper addresses the problem of dialogue optimization on large search spaces. For such a purpose, in this paper we propose to learn dialogue strategies using multiple Semi-Markov Decision Processes and hierarchical reinforcement learning. This approach factorizes state variables and actions in order to learn a hierarchy of policies. Our experiments are based on a simulated flight booking dialogue system and compare flat versus hierarchical reinforcement learning. Experimental results show that the proposed approach produced a dramatic search space reduction (99.36%), and converged four orders of magnitude faster than flat reinforcement learning with a very small loss in optimality (on average 0.3 system turns). Results also report that the learnt policies outperformed a hand-crafted one under three different conditions of ASR confidence levels. This approach is appealing to dialogue optimization due to faster learning, reusable subsolutions, and scalability to larger problems.

[14] Gregor Hofer, Hiroshi Shimodaira, and Junichi Yamagishi. Speech-driven head motion synthesis based on a trajectory model. Poster at Siggraph 2007, 2007. [ bib | .pdf ]
[15] Gregor Hofer, Hiroshi Shimodaira, and Junichi Yamagishi. Lip motion synthesis using a context dependent trajectory hidden Markov model. Poster at SCA 2007, 2007. [ bib | .pdf ]
[16] Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Reinforcement learning of dialogue strategies with hierarchical abstract machines. In Proc. IEEE/ACL Workshop on Spoken Language Technology (SLT), December 2006. [ bib | .pdf ]
In this paper we propose partially specified dialogue strategies for dialogue strategy optimization, where part of the strategy is specified deterministically and the rest optimized with Reinforcement Learning (RL). To do this we apply RL with Hierarchical Abstract Machines (HAMs). We also propose to build simulated users using HAMs, incorporating a combination of hierarchical deterministic and probabilistic behaviour. We performed experiments using a single-goal flight booking dialogue system, and compare two dialogue strategies (deterministic and optimized) using three types of simulated user (novice, experienced and expert). Our results show that HAMs are promising for both dialogue optimization and simulation, and provide evidence that indeed partially specified dialogue strategies can outperform deterministic ones (on average 4.7 fewer system turns) with faster learning than the traditional RL framework.

[17] Chie Shimodaira, Hiroshi Shimodaira, and Susumu Kunifuji. A Divergent-Style Learning Support Tool for English Learners Using a Thesaurus Diagram. In Proc. KES2006, Bournemouth, United Kingdom, October 2006. [ bib | .pdf ]
This paper proposes an English learning support tool which provides users with divergent information to find the right words and expressions. In contrast to a number of software tools for English translation and composition, the proposed tool is designed to give users not only the right answer to the user's query but also a lot of words and examples which are relevant to the query. Based on the lexical information provided by the lexical database, WordNet, the proposed tool provides users with a thesaurus diagram, in which synonym sets and relation links are presented in multiple windows to help users to choose adequate words and understand similarities and differences between words. Subjective experiments are carried out to evaluate the system.

[18] Junko Tokuno, Mitsuru Nakai, Hiroshi Shimodaira, Shigeki Sagayama, and Masaki Nakagawa. On-line Handwritten Character Recognition Selectively employing Hierarchical Spatial Relationships among Subpatterns. In Proc. IWFHR-10, La Baule, France, October 2006. [ bib ]
This paper proposes an on-line handwritten character pattern recognition method that examines spatial relationships among subpatterns which are components of a character pattern. Conventional methods evaluating spatial relationships among subpatterns have not considered characteristics of deformed handwritings and evaluate all the spatial relationships equally. However, the deformations of spatial features are different within a character pattern. In our approach, we assume that the distortions of spatial features are dependent on the hierarchy of character patterns so that we selectively evaluate hierarchical spatial relationships of subpatterns by employing Bayesian network as a post-processor of our sub-stroke based HMM recognition system. Experiments of on-line handwritten Kanji character recognition with a lexicon of 1,016 elementary characters revealed that the approach we propose improves the recognition accuracy for different types of deformations.

[19] Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Learning multi-goal dialogue strategies using reinforcement learning with reduced state-action spaces. In Proc. Interspeech, September 2006. [ bib | .pdf ]
Learning dialogue strategies using the reinforcement learning framework is problematic due to its expensive computational cost. In this paper we propose an algorithm that reduces a state-action space to one which includes only valid state-actions. We performed experiments on full and reduced spaces using three systems (with 5, 9 and 20 slots) in the travel domain using a simulated environment. The task was to learn multi-goal dialogue strategies optimizing single and multiple confirmations. Average results using strategies learnt on reduced spaces reveal the following benefits against full spaces: 1) less computer memory (94% reduction), 2) faster learning (93% faster convergence) and better performance (8.4% less time steps and 7.7% higher reward).

[20] Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Human-computer dialogue simulation using hidden markov models. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), November 2005. [ bib | .pdf ]
This paper presents a probabilistic method to simulate task-oriented human-computer dialogues at the intention level, that may be used to improve or to evaluate the performance of spoken dialogue systems. Our method uses a network of Hidden Markov Models (HMMs) to predict system and user intentions, where a “language model” predicts sequences of goals and the component HMMs predict sequences of intentions. We compare standard HMMs, Input HMMs and Input-Output HMMs in an effort to better predict sequences of intentions. In addition, we propose a dialogue similarity measure to evaluate the realism of the simulated dialogues. We performed experiments using the DARPA Communicator corpora and report results with three different metrics: dialogue length, dialogue similarity and precision-recall.

[21] Mitsuru Nakai, Shigeki Sagayama, and Hiroshi Shimodaira. On-line Handwriting Recognition Based on Sub-stroke HMM. Trans. IEICE D-II, J88-D2(8), August 2005. (in press) (in Japanese). [ bib ]
This paper describes context-dependent sub-stroke HMMs for on-line handwritten character recognition. As there are so many characters in Japanese, modeling each character by an HMM leads to an infeasible character-recognition system requiring huge amount of memory and enormous computation time. The sub-stroke HMM approach has overcomed these problems by minimizing modeling unit. However, one of the drawback of this approach is that the recognition accuracy deteriorates for scribbled characters. In this paper, we show that the context-dependent sub-stroke modeling which depends on how the sub-stroke connects to the adjacent substrokes is effective to achieve robust recognition of low quality characters.

[22] Junko Tokuno, Nobuhito Inami, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Context-dependent Sub-stroke Model for HMM-based On-line Handwriting Recognition. Trans. IEICE D-II, J88-D2(8), August 2005. (in press), (in Japanese). [ bib ]
A new method is proposed for on-line Kanji handwriting recognition. The method employs sub-stroke HMMs as minimum units to constitute Kanji characters and utilizes the direction of pen motion. The present approach has the following advantages over the conventional methods that employ character HMMs. 1) Much smaller memory requirement for dictionary and models. 2) Fast recognition by employing efficient sub-stroke network search. 3) Capability of recognizing characters not included in the training data if defined as a sequence of sub-strokes in the dictionary. In experiments, we have achieved a correct recognition rate of above 96% by using JAIST-IIPL database that includes 1,016 educational Kanji characters.

[23] Hiroshi Shimodaira, Keisuke Uematsu, Shin'ichi Kawamoto, Gregor Hofer, and Mitsuru Nakai. Analysis and Synthesis of Head Motion for Lifelike Conversational Agents. In Proc. MLMI2005, July 2005. [ bib | .pdf ]
[24] Shin-ichi Kawamoto, Hiroshi Shimodaira, Shigeki Sagayama, et al. Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents. Life-Like Characters. Tools, Affective Functions, and Applications. Helmut Prendinger et al. (Eds.) Springer, pages 187-212, November 2003. [ bib | .pdf ]
Galatea is a software toolkit to develop a human-like spoken dialog agnet. In order to easily integrate the modules of different characteristics including speech recognizer, speech synthesizer, facial-image synthesizer and dialog controller, each module is modeled as a virtual machine having a simple common interface and connected to each other through a broker (communication manager). Galatea employs model-based speech and facial-image synthesizers whose model parameters are adapted easily to those for an existing person if his/her training data is given. The software toolkit that runs on both UNIX/Linux and Windows operating systems will be publicly available in the middle of 2003.

[25] Hiroshi Shimodaira, Takashi Sudo, Mitsuru Nakai, and Shigeki Sagayama. On-line Overlaid-Handwriting Recognition Based on Substroke HMMs. In ICDAR'03, pages 1043-1047, August 2003. [ bib | .pdf ]
This paper proposes a novel handwriting recognition interface for wearable computing where users write characters continuously without pauses on a small single writing box. Since characters are written on the same writing area, they are overlaid with each other. Therefore the task is regarded as a special case of the continuous character recognition problem. In contrast to the conventional continuous character recognition problem, location information of strokes does not help very much in the proposed framework. To tackle the problem, substroke based hidden Markov models (HMMs) and a stochastic bigram language model are employed. Preliminary experiments were carried out on a dataset of 578 handwriting sequences with a character bigram consisting of 1,016 Japanese educational Kanji and 71 Hiragana characters. The proposed method demonstrated promising performance with 69.2% of handwriting sequences beeing correctly recognized when different stroke order was permitted, and the rate was improved up to 88.0% when characters were written with fixed stroke order.

[26] Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Generation of Hierarchical Dictionary for Stroke-order Free Kanji Handwriting Recognition Based on Substroke HMM. In Proc. ICDAR2003, pages 514-518, August 2003. [ bib | .pdf ]
This paper describes a method of generating a Kanjihierarchical structured dictionary for stroke-number and stroke-order free handwriting recognition based on sub-stroke HMM. In stroke-based methods, a large number of stroke-order variations can be easily expressed by justadding different stroke sequences to the dictionary and itis not necessary to train new reference patterns. The hierarchical structured dictionary has an advantage that thousands of stroke-order variations of Kanji characters can be produced using a small number of stroke-order rules defin-ing Kanji parts. Moreover, the recognition speed is fast since common sequences are shared in a substroke network, even if the total number of stroke-order combinations becomes enormous practically. In experiments, 300 differentstroke-order rules of Kanji parts were statistical ly chosen by using 60 writers' handwritings of 1,016 educational Kanjicharacters. By adding these new stroke-order rules to the dictionary, about 9,000 variations of different stroke-orderswere generated for 2,965 JIS 1st level Kanji characters. As a result, we successfully improved the recognition accuracyfrom 82.6% to 90.2% for stroke-order free handwritings.

[27] Shigeki Matsuda, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Speech Recognition Using Asynchronous Transition HMM. IEICE Trans. D-II, J86-D-II(6):741-754, June 2003. (in Japanese). [ bib ]
We propose asynchronous-transition HMM (AT-HMM) that is based on asynchronous transition structures among individual features of acoustic feature vector sequences. Conventional HMM represents vector sequences by using a chain of states, each state has vector distributions of multi-dimensions. Therefore, the conventional HMM assumes that individual features change synchronously. However, this assumption seems over-simplified for modeling the temporal behavior of acoustic features, since cepstrum and its time-derivative can not synchronize with each other. In speaker-dependent continuous phoneme recognition task, the AT-HMMs reduced errors by 10% to 40%. In speaker-independent task, the performance of the AT-HMMs was comparable to conventional HMMs.

[28] Kanad Keeni, Kunio Goto, and Hiroshi Shimodaira. Automatic Filtering of Network IntrusionDetection System Alarms Using Multi-layer Feed-forward Neural Networks. In International Conference on Neural Information Processing (ICONIP2003), June 2003. [ bib ]
[29] Tokuno Junko, Naoto Akira, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Blind-handwriting Interface for Wearable Computing. In Proc. Human - Computer Interaction (HCI) International 2003, Volume 2, pages 303-307, June 2003. [ bib ]
This paper proposes a novel input interface that we call "blind handwriting" for wearable computing. The blind handwriting, which is a word similar to "blind typing" of keyboard, is a particular writing style where the user does not see the pen or the finger movement. Without visual feedback, written characters are distorted, as in the case when the user is blindfolded, and therefore existing on-line handwriting recognition systems fail to recognize them correctly. The sub-stroke based hidden Markov model approach is employed to tackle this problem. When the pen or touch pad is used as an input device, the proposed interface demonstrates a recognition rate of 83% on a test set of 61 people where each person wrote 1016 Japanese Kanji characters.

[30] Kanad Keeni, Kunio Goto, and Hiroshi Shimodaira. On fast learning of Multi-layer Feed-forward Neural Networks Using Back Propagation. In International Conference on Enterprise and Information Systems (ICEIS2003), pages 266-271, April 2003. [ bib ]
This study discusses the subject of training data selection for neural networks using back propagation. We have made only one assumption that there are no overlapping of training data belonging to different classes, in other words the training data is linearly/semi-linearly separable . Training data is analyzed and the data that affect the learning process are selected based on the idea of Critical points. The proposed method is applied to a classification problem where the task is to recognize the characters A,C and B,D. The experimental results show that in case of batch mode the proposed method takes almost 1/7 of real and 1/10 of user training time required for conventional method. On the other hand in case of online mode the proposed method takes 1/3 of training epochs, 1/9 of real and 1/20 of user and 1/3 system time required for the conventional method. The classification rate of training and testing data are the same as it is with the conventional method.

[31] Tu Bao Ho, Trong Dung Nguyen, Hiroshi Shimodaira, and Masayuki Kimura. A Knowledge Discovery System with Support for Model Selection and Visualization. Applied Intelligence, 19:125-141, 2003. [ bib ]
[32] Haruto Takeda, Naoki Saito, Tomoshi Otsuki, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Hidden Markov Model for AUtomatic Transcription of MIDI Signals. In 2002 International Workshop on Multimedia Signal Processing, December 2002. [ bib | .pdf ]
[33] Kanad Keeni and Hiroshi Shimodaira. On Selection of Training Data for Fast Learning of Neural Networks Using Back Propagation. In IASTED International Conference on Artificial Intelligence and Application (AIA2002), pages 474-478, September 2002. [ bib ]
[34] Junko Tokuno, Nobuhito Inami, Shigeki Matsuda, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Context-Dependent Substroke Model for HMM-based On-line Handwriting Recognition. In Proc. IWFHR-8, pages 78-83, August 2002. [ bib | .pdf ]
This paper describes an effective modeling technique in the on-line recognition for cursive Kanji handwritings and Hiragana handwritings. Our conventional recognition system based on substroke HMMs (hidden Markov models) employs straight-type substrokes as primary models and has achieved high recognition rate in the recognition of careful Kanji handwritings. On the other hand, the recognition rate for cursive handwritings is comparatively low, since they consist of mainlycurve-strokes. Therefore, we propose a technique of using multiple models for each substroke by considering the substroke context, which is a preceding substroke and a following substroke. In order to construct these context-dependent models efficiently, we use the SSS (Successive State Splitting) algorithm developed in speech recognition. Through the experiments, the recognition rate improved from 88% to 92% for cursive Kanji handwritings and from 90% to 98% for Hiragana handwritings.

[35] Mitsuru Nakai, Takashi Sudo, Hiroshi Shimodaira, and Shigeki Sagayama. Pen Pressure Features for Writer-Independent On-Line Handwriting Recognition Based on Substroke HMM. In Proc. ICPR2002, III, pages 220-223, August 2002. [ bib | .pdf ]
[36] Shin-ichi Kawamoto, Hiroshi Shimodaira, Tsuneo Nitta, Takuya Nishimoto, Satoshi Nakamura, Katsunobu Itou, Shigeo Morishima, Tatsuo Yotsukura, Atsuhiko Kai, Akinobu Lee, Yoichi Yamashita, Takao Kobayashi, Keiichi Tokuda, Keikichi Hirose, Nobuaki Minematsu, Atsushi Yamada, Yasuharu Den, Takehito Utsuro, and Shigeki Sagayama. Open-source software for developing anthropomorphic spoken dialog agent. In Proc. PRICAI-02, International Workshop on Lifelike Animated Agents, pages 64-69, August 2002. [ bib | .pdf ]
[37] Shin-ichi Kawamoto, Hiroshi Shimodaira, et al. Design of Software Toolkit for Anthromorphic Spoken Dialog Agent Software with Customization-oriented Features. Information Processing Society of Japan (IPSJ) Journal, 43(7):2249-2263, July 2002. (in Japanese). [ bib ]
[38] Jun Rokui, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Speaker Normalization Using Linear Transformation of Vocal Tract Length Based on Maximum Likelihood Estimation. Information Processing Society of Japan (IPSJ), 43(7):2030-2037, July 2002. (in Japanese). [ bib ]

[39] Hiroshi Shimodaira, Nobuyoshi Sakai, Mitsuru Nakai, and Shigeki Sagayama. Jacobian Joint Adaptation to Noise, Channel and Vocal Tract Length. In Proc. ICASSP2002, pages 197-200, May 2002. [ bib | .pdf ]
A new Jacobian approach that linearly decomposes the composite of additive noise, multiplicative noise (channel transfer function) and speaker's vocal tract length, and adapts the acoustic model parameters simultaneously to these factors is proposed in this paper. Due to the fact that these factors non-linearly degrade the observed features for speech recognition, existing approaches fail to adapt the acoustic models adequately. Approximating the nonlinear operation by a linear model enables to employ the least square error estimation of the factors and adapt the acoustic model parameters with small amount of speech samples. Speech recognition experiments on ATR isolated word database demonstrate significant reduction of error rates, which supports the effectiveness of the proposed scheme.

[40] Yoshinori Matsushita, Shinnichi Kawamoto, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. A Head-Behavior Synchronization Model with Utterance for Anthropomorphic Spoken-Dialog Agent. In Technical Report of IEICE, HIS2001, March 2002. (in Japanese). [ bib ]
A novel method of synchronously synthesizing the head motion of an anthropomorphic spoken dialog agent with its utterance is proposed. Although much efforts have been taken to synchronize the lip motion with utterance, very few research exist for such head-motion control. A neural network is employed to learn the relationship between the acoustic features of the utterance and the head motion that are measured by a motion-capturing system. The proposed method enables to simulate the facial animation automatically that moves synchronously with any given utterances. Subjective evaluation of the performance of the method is reported as well.

[41] Tomoshi Otsuki, Naoki Saitou, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Musical Rhythm Recognition Using Hidden Markov Model. Information Processing Society of Japan (IPSJ) JOURNAL, 43(2), February 2002. (in Japanese). [ bib ]
[42] Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, and Shigeki Sagayama. Dynamic Time-Alignment Kernel in Support Vector Machine. Advances in Neural Information Processing Systems 14, NIPS2001, 2:921-928, December 2001. [ bib | .pdf ]
A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classification algorithms can be employed without further modifications. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs).

[43] Mitsuru Nakai, Naoto Akira, Hiroshi Shimodaira, and Shigeki Sagayama. Substroke Approach to HMM-based On-line Kanji Handwriting Recognition. In Proc. ICDAR'01, pages 491-495, September 2001. [ bib | .pdf ]
A new method is proposed for on-line handwriting recognition of Kanji characters. The method employs substroke HMMs as minimum units to constitute Japanese Kanji characters and utilizes the direction of pen motion. The main motivation is to fully utilize the continuous speech recognition algorithm by relating sentence speech to Kanji character, phonemes to substrokes, and grammar to Kanji structure. The proposed system consists input feature analysis, substroke HMMs, a character structure dictionary and a decoder. The present approach has the following advantages over the conventional methods that employ whole character HMMs. 1) Much smaller memory requirement for dictionary and models. 2) Fast recognition by employing efficient substroke network search. 3) Capability of recognizing characters not included in the training data if defined as a sequence of substrokes in the dictionary. 4) Capability of recognizing characters written by various different stroke orders with multiple definitions per one character in the dictionary. 5) Easiness in HMM adaptation to the user with a few sample character data.

[44] Shigeki Sagayama, Yutaka Kato, Mitsuru Nakai, and Hiroshi Shimodaira. Jacobian Approach to Joint Adaptation to Noise, Channel and Vocal Tract Length. In Proc. ISCA Workshop on Adaptation Methods (Sophia Antipolis, France), pages 117-120, August 2001. [ bib ]
[45] Shigeki Sagayama, Koichi Shinoda, Mitsuru Nakai, and Hiroshi Shimodaira. Analytic Methods for Acoustic Model Adaptation: A Review. In Proc. ISCA Workshop on Adaptation Methods (Sophia Antipolis France), pages 67-76, August 2001. Invited Paper. [ bib ]
[46] Kanad Keeni, Kunio Goto, and Hiroshi Shimodaira. On Extraction of E-Mail Address from Fax Message for Automatic Delivery to Individual Recipient. In IASTED International Conference on Siganl Processing Pattern Recognition and Application, July 2001. [ bib ]
[47] Katsuhisa Fujinaga, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Multiple-Regression Hidden Markov Model. In Proc. ICASSP 2001, May 2001. [ bib | .pdf ]
[48] Shigeki Matsuda, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Feature-dependent Allophone Clustering. In Proc. ICSLP2000, pages 413-416, October 2000. [ bib | .pdf ]
We propose a novel method for clustering allophones called Feature-Dependent Allophone Clustering (FD-AC) that determines feature-dependent HMM topology automatically. Existing methods for allophone clustering are based on parameter sharing between the allophone models that resemble each other in behaviors of feature vector sequences. However, all the features of the vector sequences may not necessarily have a common allophone clustering structures. It is considered that the vector sequences can be better modeled by allocating the optimal allophone clustering structure to each feature. In this paper, we propose Feature-Dependent Successive State Splitting (FD-SSS) as an implementation of FD-AC. In speaker-dependent continuous phoneme recognition experiments, HMMs created by FD-SSS reduced the error rates by about 10% compared with the conventional HMMs that have a common allophone clustering structure for all the features.

[49] Hiroshi Shimodaira, Toshihiko Akae, Mitsuru Nakai, and Shigeki Sagayama. Jacobian Adaptation of HMM with Initial Model Selection for Noisy Speech Recognition. In Proc. ICSLP2000, pages 1003-1006, October 2000. [ bib | .pdf ]
An extension of Jacobian Adaptation (JA) of HMMs for degraded speech recognition is presented in which appropriate set of initial models is selected from a number of initial-model sets designed for different noise environments. Based on the first order Taylor series approximation in the acoustic feature domain, JA adapts the acoustic model parameters trained in the initial noise environment A to the new environment B much faster than PMC that creates the acoustic models for the target environment from scratch. Despite the advantage of JA to PMC, JA has a theoretical limitation that the change of acoustic parameters from the environment A to B should be small in order that the linear approximation holds. To extend the coverage of JA, the ideas of multiple sets of initial models and their automatic selection scheme are discussed. Speaker-dependent isolated-word recognition experiments are carried out to evaluate the proposed method.

[50] Shigeki Matsuda, Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Asynchronous-Transition HMM. In Proc. ICASSP 2000 (Istanbul, Turkey), Vol. II, pages 1001-1004, June 2000. [ bib | .pdf ]
We propose a new class of hidden Markov model (HMM) called asynchronous-transition HMM (AT-HMM). Opposed to conventional HMMs where hidden state transition occurs simultaneously to all features, the new class of HMM allows state transitions asynchronous between individual features to better model asynchronous timings of acoustic feature changes. In this paper, we focus on a particular class of AT-HMM with sequential constraints introducing a concept of “state tying across time”. To maximize the advantage of the new model, we also introduce feature-wise state tying technique. Speaker-dependent speech recognition experiments demonstrated that reduced error rates more than 30% and 50% in phoneme and isolated word recognition, respectively, compared with conventional HMMs.

[51] Jun Rokui and Hiroshi Shimodaira. Multistage Building Learning based on Misclassification Measure. In 9-th International Conference on Artificial Neural Networks, Edinburgh, UK, September 1999. [ bib ]
[52] Kanad Keeni, Kenji Nakayama, and Hiroshi Shimodaira. A Training Scheme for Pattern Classification Using Multi-layer Feed-forward Neural Networks. In IEEE International Conference on Computational Intelligence and Multimedia Applications, pages 307-311, September 1999. [ bib ]
[53] Kanad Keeni, Kenji Nakayama, and Hiroshi Shimodaira. Estimation of Initial Weights and Hidden Units for Fast Learning of Multi-layer Neural Networks for Pattern Classification. In IEEE International Joint Conference on Neural Networks (IJCNN'99), July 1999. [ bib ]
[54] Hiroshi Shimodaira, Jun Rokui, and Mitsuru Nakai. Improving The Generalization Performance Of The MCE/GPD Learning. In ICSLP'98, Australia, December 1998. [ bib | .pdf ]
A novel method to prevent the over-fitting effect and improve the generalization performance of the Minimum Classification Error (MCE) / Generalized Probabilistic Descent (GPD) learning is proposed. The MCE/GPD method, which is one of the newest discriminative-learning approaches proposed by Katagiri and Juang in 1992, results in better recognition performance in various areas of pattern recognition than the maximum-likelihood (ML) based approach where a posteriori probabilities are estimated. Despite its superiority in recognition performance, it still suffers from the problem of over-fitting to the training samples as it is with other learning algorithms. In the present study, a regularization technique is employed to the MCE method to overcome this problem. Feed-forward neural networks are employed as a recognition platform to evaluate the recognition performance of the proposed method. Recognition experiments are conducted on several sorts of datasets. The proposed method shows better generalization performance than the original one.

[55] Mitsuru Nakai and Hiroshi Shimodaira. The Use of F0 Reliability Function for Prosodic Command Analysis on F0 Contour Generation Model. In Proc. ICSLP'98, December 1998. [ bib | .pdf ]
[56] Kanad Keeni, Kenji Nakayama, and Hiroshi Shimodaira. Automatic Generation of Initial Weights and Target Outputs of Multi-layer Neural Networks and its Application to Pattern Classification. In International Conference on Neural Information Processing (ICONIP'98), pages 1622-1625, October 1998. [ bib ]
[57] Jun Rokui and Hiroshi Shimodaira. Modified Minimum Classification Error Learning and Its Application to Neural Networks. In ICONIP'98, Kitakyushu, Japan, October 1998. [ bib ]
[58] Eiji Iida, Hiroshi Shimodaira, Susumu Kunifuji, and Masayuki Kimura. A system to Perform Human Problem Solving. In The 5th International Conference on Soft Computing and Information / Intelligent Systems (IIZUKA'98), October 1998. [ bib ]
[59] Kanad Keeni, Kenji Nakayama, and Hiroshi Shimodaira. Automatic Generation of Initial Weights and Estimation of Hidden Units for Pattern Classification Using Neural Networks. In 14th International Conference on Pattern Recognition (ICPR'98), pages 1568-1571, August 1998. [ bib ]
[60] Eiji Iida, Susumu Kunifuji, Hiroshi Shimodaira, and Masayuki Kimura. A Scale-Down Solution of N^2-1 Puzzle. Trans. IEICE(D-I), J81-D-I(6):604-614, June 1998. (in Japanese). [ bib ]
[61] Kanad Keeni, Hiroshi Shimodaira, Kenji Nakayama, and Kazunori Kotani. On Parameter Initialization of Multi-layer Feed-forward Neural Networks for Pattern Recognition. In International Conference on Computational Linguistics, Speech and Document Processing (ICCLSDP-'98), Calcutta, India, pages D8-12, February 1998. [ bib ]
[62] Hiroshi Shimodaira, Jun Rokui, and Mitsuru Nakai. Modified Minimum Classification Error Learning and Its Application to Neural Networks. In 2nd International Workshop on Statistical Techniques in Pattern Recognition (SPR'98), Sydney, Australia, 1998. [ bib | .pdf ]
A novel method to improve the generalization performance of the Minimum Classification Error (MCE) / Generalized Probabilistic Descent (GPD) learning is proposed. The MCE/GPD learning proposed by Juang and Katagiri in 1992 results in better recognition performance than the maximum-likelihood (ML) based learning in various areas of pattern recognition. Despite its superiority in recognition performance, as well as other learning algorithms, it still suffers from the problem of “over-fitting” to the training samples. In the present study, a regularization technique has been employed to the MCE learning to overcome this problem. Feed-forward neural networks are employed as a recognition platform to evaluate the recognition performance of the proposed method. Recognition experiments are conducted on several sorts of data sets.

[63] Mitsuru Nakai, Harald Singer, Yoshimori Sagisaka, and Hiroshi Shimodaira. Accent Phrase Segmentation Based on F0 Templates Using a Superpositional Prosodic Model. Trans. IEICE (D-II), J80-D-II(10):2605-2614, October 1997. (in Japanese). [ bib ]
[64] Hiroshi Shimodaira, Mitsuru Nakai, and Akihiro Kumata. Restration of Pitch Pattern of Speech Based on a Pitch Gereration Model. In Proc. EuroSpeech'97, pages 512-524, September 1997. [ bib | .pdf ]
In this paper a model-based approach for restoring a continuous fundamental frequency (F0) contour from the noisy output of an F0 extractor is investigated. In contrast to the conventional pitch trackers based on numerical curve-fitting, the proposed method employs a quantitative pitch generation model, which is often used for synthesizing F0 contour from prosodic event commands for estimating continuous F0 pattern. An inverse filtering technique is introduced for obtaining the initial candidates of the prosodic commands. In order to find the optimal command sequence from the commands efficiently, a beam-search algorithm and an N-best technique are employed. Preliminary experiments for a male speaker of the ATR B-set database showed promising results both in quality of the restored pattern and estimation of the prosodic events.

[65] Mitsuru Nakai and Hiroshi Shimodaira. On Representation of Fundamental Frequency of Speech for Prosody Analysis Using Reliability Function. In Proc. EuroSpeech'97, pages 243-246, September 1997. [ bib | .pdf ]
[66] Kanad Keeni, Hiroshi Shimodaira, and Kenji Nakayama. On Distributed Representation of Output Layer for Recognizing Japanese Kana Characters Using Neural Networks. In Proceedings of the 4'th International Conference on Document Analysis and Recognition, ICDAR'97, pages 600-603, July 1997. Ulm, Germany. [ bib ]
[67] Tu Bao Ho, Nguyen Trong Dung, Hiroshi Shimodaira, and Masayuki Kimura. An Interactive-Graphic Environment for Discovering and Using Conceptual Knowledge. In 7th European-Japanese Conference on Information Modelling and Knowledge Bases, pages 327-343, May 1997. [ bib ]
[68] Kanad Keeni and Hiroshi Shimodaira. On Representation of Output Layer for Recognizing Japanese Kana Characters Using Neural Networks. In Proc. the `17'th International Conference on Computer Processing of Oriental Languages, pages 305-308, April 1997. Baptist University, Kowloon Tong, Hong Kong. [ bib ]
[69] Mitsuru Nakai, Harald Singer, Yoshinori Sagisaka, and Hiroshi Shimodaira. Accent Phrase Segmentation by F0 Clustering Using Superpositional Modeling, pages 343-360. January 1997. [ bib ]
[70] Sukeyasu Kanno and Hiroshi Shimodaira. Voiced Sound Detection under Nonstationary and Heavy Noisy Environment Using the Prediction Error of Low-Frequency Spectrum. Trans. IEICE(D-II), J80-D-II(1):26-35, January 1997. (in Japanese). [ bib ]
[71] Kanad Keeni, Hiroshi Shimodaira, Tetsuro Nishino, and Yasuo Tan. Recognition of Devanagari Characters Using Neural Networks. IEICE, E79-D(5):523-528, May 1996. [ bib ]
[72] Paul A. Taylor, Hiroshi Shimodaira, Stephen Isard, Simon King, and Jacqueline Kowtko. Using prosodic information to constrain language models for spoken dialogue. In Proc. ICSLP `96, Philadelphia, 1996. [ bib | .ps | .pdf ]
We present work intended to improve speech recognition performance for computer dialogue by taking into account the way that dialogue context and intonational tune interact to limit the possibilities for what an utterance might be. We report here on the extra constraint achieved in a bigram language model expressed in terms of entropy by using separate submodels for different sorts of dialogue acts and trying to predict which submodel to apply by analysis of the intonation of the sentence being recognised.

[73] Hisao Koba, hiroshi Shimodaira, and Masayuki Kimura. Intelligent Automatic Document Transcription System for Braille: To Improve Accessibility to Printed Matter for the Visually Impaired. In HIC International'95, July 1995. [ bib ]
[74] and Hiroshi Shimodaira. HI Design Based on the Costs of Human Information-processing Model. In HIC international'95, July 1995. [ bib ]
[75] Mitsuru Nakai, Singer Harald, Yoshinori Sagisaka, and Hiroshi Shimodaira. Automatic Prosodic Segmentation by F0 Clustering Using Superpositional Modeling. In Proc. ICASSP-95, PR08.6, pages 624-627, May 1995. [ bib | .pdf ]
[76] Mitsuru Nakai and Hiroshi Shimodaira. Accent Phrase Segmentation by Finding N-best Sequences of Pitch Pattern Templates. In Proc. ICSLP94, 8.10, pages 347-350, September 1994. [ bib | .pdf ]
[77] Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Prosodic Phrase Segmentation Based on Pitch-Pattern Clustering. Electronics and Communications in Japan, Part 3, 77(6):80-91, June 1994. (in Japanese). [ bib ]
[78] Hiroshi Shimodaira and Mitsuru Nakai. Prosodic phrase segmentation by pitch pattern clustering. In Proc. ICASSP-94, 76.5, vol.II, pages 185-188, March 1994. [ bib | .pdf ]
This paper proposes a novel method for detecting the optimal sequence of prosodic phrases from continuous speech based on data-driven approach. The pitch pattern of input speech is divided into prosodic segments which minimized the overall distortion with pitch pattern templates of accent phrases by using the One Pass search algorithm. The pitch pattern templates are designed by clustering a large number of training samples of accent phrases. On the ATR continuous speech database uttered by 10 speakers, the rate of correct segmentation was 91.7 % maximum for the same sex data of training and testing, 88.6 % for the opposite sex.

[79] Mitsuru Nakai, Hiroshi Shimodaira, and Shigeki Sagayama. Prosodic phrase segmentation based on pitch-pattern clustering. Trans. IEICE (A), J77-A(2):206-214, February 1994. (in Japanese). [ bib ]
[80] Hiroshi Shimodaira and Mitsuru Nakai. Accent phrase segmentation using transition probabilities between pitch pattern templates. In Proc. EuroSpeech'93, pages 1767-1770, September 1993. [ bib | .ps.gz ]
This paper proposes a novel method for segmenting continuous speech into accent phrases by using a prosodic feature 'pitch pattern'. The pitch pattern extracted from input speech signals is divided into the accent segments automatically by using the One-Stage DP algorithm, in which reference templates representing various types of accent patterns and connectivity between them are used to find out the optimum sequence of accent segments. In case of making the reference templates from a large number of training data, the LBG clustering algorithm is used to represent typical accent patterns by a small number of templates. Evaluation tests were carried out using the ATR continuous speech database of a male speaker. Experimental results showed more than 91 % of phrase boundaries were correctly detected.

[81] Hiroshi Shimodaira and Mitsuru Nakai. Robust pitch detection by narrow band spectrum analysis. In Proc. ICSLP-92, pages 1597-1600, October 1992. [ bib | .pdf ]
This paper proposes a new technique for detecting pitch patterns which is useful for automatic speech recognition, by using a narrow band spectrum analysis. The motivation of this approach is that humans perceive some kind of pitch in whispers where no fundamental frequencies can be observed, while most of the pitch determination algorithm (PDA) fails to detect such perceptual pitch. The narrow band spectrum analysis enable us to find pitch structure distributed locally in frequency domain. Incorporating this technique into PDA's is realized to applying the technique to the lag window based PDA. Experimental results show that pitch detection performance could be improved by 4% for voiced sounds and 8% for voiceless sounds.