The Centre for Speech Technology Research, The university of Edinburgh

Publications by Gregor Hofer

[1] Theresa Wilson and Gregor Hofer. Using linguistic and vocal expressiveness in social role recognition. In Proc Int. Conf. on Intelligent User Interfaces, IUI2011, Palo Alto, USA, 2011. ACM. [ bib | .pdf ]
In this paper, we investigate two types of expressiveness, linguistic and vocal, and whether they are useful for recog- nising the social roles of participants in meetings. Our ex- periments show that combining expressiveness features with speech activity does improve social role recognition over speech activity features alone.

[2] Michael A. Berger, Gregor Hofer, and Hiroshi Shimodaira. Carnival - combining speech technology and computer animation. IEEE Computer Graphics and Applications, 31:80-89, 2011. [ bib | DOI ]
[3] Gregor Hofer and Korin Richmond. Comparison of HMM and TMDN methods for lip synchronisation. In Proc. Interspeech, pages 454-457, Makuhari, Japan, September 2010. [ bib | .pdf ]
This paper presents a comparison between a hidden Markov model (HMM) based method and a novel artificial neural network (ANN) based method for lip synchronisation. Both model types were trained on motion tracking data, and a perceptual evaluation was carried out comparing the output of the models, both to each other and to the original tracked data. It was found that the ANN-based method was judged significantly better than the HMM based method. Furthermore, the original data was not judged significantly better than the output of the ANN method.

Keywords: hidden Markov model (HMM), mixture density network, lip synchronisation, inversion mapping
[4] Michael Berger, Gregor Hofer, and Hiroshi Shimodaira. Carnival: a modular framework for automated facial animation. Poster at SIGGRAPH 2010, 2010. Bronze award winner, ACM Student Research Competition. [ bib | .pdf ]
[5] Michael Pucher, Friedrich Neubarth, Volker Strom, Sylvia Moosmüller, Gregor Hofer, Christian Kranzler, Gudrun Schuchmann, and Dietmar Schabus. Resources for speech synthesis of viennese varieties. In Proc. Int. Conf. on Language Resources and Evaluation, LREC'10, Malta, 2010. European Language Resources Association (ELRA). [ bib | .ps | .pdf ]
This paper describes our work on developing corpora of three varieties of Viennese for unit selection speech synthesis. The synthetic voices for Viennese varieties, implemented with the open domain unit selection speech synthesis engine Multisyn of Festival will also be released within Festival. The paper especially focuses on two questions: how we selected the appropriate speakers and how we obtained the text sources needed for the recording of these non-standard varieties. Regarding the first one, it turned out that working with a ‘prototypical’ professional speaker was much more preferable than striving for authenticity. In addition, we give a brief outline about the differences between the Austrian standard and its dialectal varieties and how we solved certain technical problems that are related to these differences. In particular, the specific set of phones applicable to each variety had to be determined by applying various constraints. Since such a set does not serve any descriptive purposes but rather is influencing the quality of speech synthesis, a careful design of such a (in most cases reduced) set was an important task.

[6] Gregor Hofer, Korin Richmond, and Michael Berger. Lip synchronization by acoustic inversion. Poster at Siggraph 2010, 2010. [ bib | .pdf ]
[7] Michal Dziemianko, Gregor Hofer, and Hiroshi Shimodaira. HMM-based automatic eye-blink synthesis from speech. In Proc. Interspeech, pages 1799-1802, Brighton, UK, September 2009. [ bib | .pdf ]
In this paper we present a novel technique to automatically synthesise eye blinking from a speech signal. Animating the eyes of a talking head is important as they are a major focus of attention during interaction. The developed system predicts eye blinks from the speech signal and generates animation trajectories automatically employing a ”Trajectory Hidden Markov Model”. The evaluation of the recognition performance showed that the timing of blinking can be predicted from speech with an F-score value upwards of 52%, which is well above chance. Additionally, a preliminary perceptual evaluation was conducted, that confirmed that adding eye blinking significantly improves the perception the character. Finally it showed that the speech synchronised synthesised blinks outperform random blinking in naturalness ratings.

[8] Gregor Hofer, Junichi Yamagishi, and Hiroshi Shimodaira. Speech-driven lip motion generation with a trajectory HMM. In Proc. Interspeech 2008, pages 2314-2317, Brisbane, Australia, September 2008. [ bib | .pdf ]
Automatic speech animation remains a challenging problem that can be described as finding the optimal sequence of animation parameter configurations given some speech. In this paper we present a novel technique to automatically synthesise lip motion trajectories from a speech signal. The developed system predicts lip motion units from the speech signal and generates animation trajectories automatically employing a "Trajectory Hidden Markov Model". Using the MLE criterion, its parameter generation algorithm produces the optimal smooth motion trajectories that are used to drive control points on the lips directly. Additionally, experiments were carried out to find a suitable model unit that produces the most accurate results. Finally a perceptual evaluation was conducted, that showed that the developed motion units perform better than phonemes.

[9] Gregor Hofer and Hiroshi Shimodaira. Automatic head motion prediction from speech data. In Proc. Interspeech 2007, Antwerp, Belgium, August 2007. [ bib | .pdf ]
In this paper we present a novel approach to generate a sequence of head motion units given some speech. The modelling approach is based on the notion that head motion can be divided into a number of short homogeneous units that can each be modelled individually. The system is based on Hidden Markov Models (HMM), which are trained on motion units and act as a sequence generator. They can be evaluated by an accuracy measure. A database of motion capture data was collected and manually annotated for head motion and is used to train the models. It was found that the model is good at distinguishing high activity regions from regions with less activity with accuracies around 75 percent. Furthermore the model is able to distinguish different head motion patterns based on speech features somewhat reliably, with accuracies reaching almost 70 percent.

[10] Gregor Hofer, Hiroshi Shimodaira, and Junichi Yamagishi. Speech-driven head motion synthesis based on a trajectory model. Poster at Siggraph 2007, 2007. [ bib | .pdf ]
[11] Gregor Hofer, Hiroshi Shimodaira, and Junichi Yamagishi. Lip motion synthesis using a context dependent trajectory hidden Markov model. Poster at SCA 2007, 2007. [ bib | .pdf ]
[12] G. Hofer, K. Richmond, and R. Clark. Informed blending of databases for emotional speech synthesis. In Proc. Interspeech, September 2005. [ bib | .ps | .pdf ]
The goal of this project was to build a unit selection voice that could portray emotions with varying intensities. A suitable definition of an emotion was developed along with a descriptive framework that supported the work carried out. A single speaker was recorded portraying happy and angry speaking styles. Additionally a neutral database was also recorded. A target cost function was implemented that chose units according to emotion mark-up in the database. The Dictionary of Affect supported the emotional target cost function by providing an emotion rating for words in the target utterance. If a word was particularly 'emotional', units from that emotion were favoured. In addition intensity could be varied which resulted in a bias to select a greater number emotional units. A perceptual evaluation was carried out and subjects were able to recognise reliably emotions with varying amounts of emotional units present in the target utterance.

[13] Hiroshi Shimodaira, Keisuke Uematsu, Shin'ichi Kawamoto, Gregor Hofer, and Mitsuru Nakai. Analysis and Synthesis of Head Motion for Lifelike Conversational Agents. In Proc. MLMI2005, July 2005. [ bib | .pdf ]