|
[1]
|
Atef Ben Youssef, Hiroshi Shimodaira, and David Braude.
Speech driven talking head from estimated articulatory features.
In Proc. ICASSP, pages 4606-4610, Florence, Italy, May 2014.
[ bib |
.pdf ]
In this paper, we present a talking head in which the lips and head motion are controlled using articulatory movements estimated from speech. A phonesize HMM-based inversion mapping is employed and trained in a semi-supervised fashion. The advantage of the use of articulatory features is that they can drive the lips motions and they have a close link with head movements. Speech inversion normally requires the training data recorded with electromagnetic articulograph (EMA), which restricts the naturalness of head movements. The present study considers a more realistic recording condition where the training data for the target speaker are recorded with a usual motion capture system rather than EMA. Different temporal clustering techniques are investigated for HMM-based mapping as well as a GMM-based frame-wise mapping as a baseline system. Objective and subjective experiments show that the synthesised motions are more natural using an HMM system than a GMM one, and estimated EMA features outperform prosodic features.
|
|
[2]
|
Atef Ben Youssef, Hiroshi Shimodaira, and David A. Braude.
Head motion analysis and synthesis over different tasks.
In Proc. Intelligent Virtual Agents, pages 285-294. Springer,
September 2013.
[ bib |
.pdf ]
It is known that subjects vary in their head movements. This paper presents an analysis of this variety over different tasks and speakers and their impact on head motion synthesis. Measured head and articulatory movements acquired by an ElectroMagnetic Articulograph (EMA) synchronously recorded with audio was used. Data set of speech of 12 people recorded on different tasks confirms that the head motion variate over tasks and speakers. Experimental results confirmed that the proposed models were capable of learning and synthesising task-dependent head motions from speech. Subjective evaluation of synthesised head motion using task models shows that trained models on the matched task is better than mismatched one and free speech data provide models that predict preferred motion by the participants compared to read speech data.
|
|
[3]
|
Atef Ben Youssef, Hiroshi Shimodaira, and David A. Braude.
Articulatory features for speech-driven head motion synthesis.
In Proc. Interspeech, pages 2758-2762, Lyon, France, August
2013.
[ bib |
.pdf ]
This study investigates the use of articulatory features for speech-driven head motion synthesis as opposed to prosody features such as F0 and energy which have been mainly used in the literature. In the proposed approach, multi-stream HMMs are trained jointly on the synchronous streams of speech and head motion data. Articulatory features can be regarded as an intermediate parametrisation of speech that are expected to have a close link with head movement. Measured head and articulatory movements acquired by EMA were synchronously recorded with speech. Measured articulatory data was compared to those predicted from speech using an HMM-based inversion mapping system trained in a semi-supervised fashion. Canonical correlation analysis (CCA) on a data set of free speech of 12 people shows that the articulatory features are more correlated with head rotation than prosodic and/or cepstral speech features. It is also shown that the synthesised head motion using articulatory features give higher correlations with the original head motion than when only prosodic features are used.
|
|
[4]
|
David A. Braude, Hiroshi Shimodaira, and Atef Ben Youssef.
Template-warping based speech driven head motion synthesis.
In Proc. Interspeech, pages 2763-2767, Lyon, France, August
2013.
[ bib |
.pdf ]
We propose a method for synthesising head motion from speech using a combination of an Input-Output Markov model (IOMM) and Gaussian mixture models trained in a supervised manner. A key difference of this approach compared to others is to model the head motion in each angle as a series of templates of motion rather than trying to recover a frame-wise function. The templates were chosen to reflect natural patterns in the head motion, and states for the IOMM were chosen based on statistics of the templates. This reduces the search space for the trajectories and stops impossible motions such as discontinuities from being possible. For synthesis our system warps the templates to account for the acoustic features and the other angles’ warping parameters. We show our system is capable of recovering the statistics of the motion that were chosen for the states. Our system was then compared to a baseline that used a frame-wise mapping that is based on previously published work. A subjective preference test that includes multiple speakers showed participants have a preference for the segment based approach. Both of these systems were trained on storytelling free speech.
|
|
[5]
|
David Adam Braude, Hiroshi Shimodaira, and Atef Ben Youssef.
Template-warping based speech driven head motion synthesis.
In Interspeech, pages 2763 - 2767, 2013.
[ bib |
.pdf ]
We propose a method for synthesising head motion from speech using a combination of an Input-Output Markov model (IOMM) and Gaussian mixture models trained in a supervised manner. A key difference of this approach compared to others is to model the head motion in each angle as a series of templates of motion rather than trying to recover a frame-wise function. The templates were chosen to reflect natural patterns in the head motion, and states for the IOMM were chosen based on statistics of the templates. This reduces the search space for the trajectories and stops impossible motions such as discontinuities from being possible. For synthesis our system warps the templates to account for the acoustic features and the other angles' warping parameters. We show our system is capable of recovering the statistics of the motion that were chosen for the states. Our system was then compared to a baseline that used a frame-wise mapping that is based on previously published work. A subjective preference test that includes multiple speakers showed participants have a preference for the segment based approach. Both of these systems were trained on storytelling free speech.
Keywords: Head motion synthesis, GMMs, IOMM
|
|
[6]
|
David A. Braude, Hiroshi Shimodaira, and Atef Ben Youssef.
The University of Edinburgh head-motion and audio storytelling
(UoE-HaS) dataset.
In Proc. Intelligent Virtual Agents, pages 466-467. Springer,
2013.
[ bib |
.pdf ]
In this paper we announce the release of a large dataset of storytelling monologue with motion capture for the head and body. Initial tests on the dataset indicate that head motion is more dependant on the speaker than the style of speech.
|
|
[7]
|
Thomas Hueber, Atef Ben Youssef, Gérard Bailly, Pierre Badin, and Frédéric
Elisei.
Cross-speaker acoustic-to-articulatory inversion using phone-based
trajectory HMM for pronunciation training.
In Proc. Interspeech, Portland, Oregon, USA, 2012.
[ bib |
.pdf ]
The article presents a statistical mapping approach for crossspeaker acoustic-to-articulatory inversion. The goal is to estimate the most likely articulatory trajectories for a reference speaker from the speech audio signal of another speaker. This approach is developed in the framework of our system of visual articulatory feedback developed for computer-assisted pronunciation training applications (CAPT). The proposed technique is based on the joint modeling of articulatory and acoustic features, for each phonetic class, using full-covariance trajectory HMM. The acoustic-to-articulatory inversion is achieved in 2 steps: 1) finding the most likely HMM state sequence from the acoustic observations; 2) inferring the articulatory trajectories from both the decoded state sequence and the acoustic observations. The problem of speaker adaptation is addressed using a voice conversion approach, based on trajectory GMM.
|
|
[8]
|
Gérard Bailly, Pierre Badin, Lionel Revéret, and Atef Ben Youssef.
Sensorimotor characteristics of speech production.
Cambridge University Press, 2012.
[ bib |
DOI ]
|
|
[9]
|
Atef Ben Youssef.
Control of talking heads by acoustic-to-articulatory inversion
for language learning and rehabilitation.
PhD thesis, Grenoble University, October 2011.
[ bib |
.pdf ]
This thesis presents a visual articulatory feedback system in which the visible and non visible articulators of a talking head are controlled by inversion from a speaker's voice. Our approach to this inversion problem is based on statistical models built on acoustic and articulatory data recorded on a French speaker by means of an electromagnetic articulograph. A first system combines acoustic speech recognition and articulatory speech synthesis techniques based on hidden Markov Models (HMMs). A second system uses Gaussian mixture models (GMMs) to estimate directly the articulatory trajectories from the speech sound. In order to generalise the single speaker system to a multi-speaker system, we have implemented a speaker adaptation method based on the maximum likelihood linear regression (MLLR) that we have assessed by means of a reference articulatory recognition system. Finally, we present a complete visual articulatory feedback demonstrator.
Keywords: visual articulatory feedback; acoustic-to-articulatory speech inversion mapping; ElectroMagnetic Articulography (EMA); hidden Markov models (HMMs), Gaussian mixture models (GMMs); speaker adaptation; face-to-tongue mapping
|
|
[10]
|
Atef Ben Youssef, Thomas Hueber, Pierre Badin, and Gérard Bailly.
Toward a multi-speaker visual articulatory feedback system.
In Proc. Interspeech, pages 589-592, Florence, Italie, August
2011.
[ bib |
.pdf ]
In this paper, we present recent developments on the HMMbased acoustic-to-articulatory inversion approach that we develop for a "visual articulatory feedback" system. In this approach, multi-stream phoneme HMMs are trained jointly on synchronous streams of acoustic and articulatory data, acquired by electromagnetic articulography (EMA). Acousticto- articulatory inversion is achieved in two steps. Phonetic and state decoding is first performed. Then articulatory trajectories are inferred from the decoded phone and state sequence using the maximum-likelihood parameter generation algorithm (MLPG). We introduce here a new procedure for the reestimation of the HMM parameters, based on the Minimum Generation Error criterion (MGE). We also investigate the use of model adaptation techniques based on maximum likelihood linear regression (MLLR), as a first step toward a multispeaker visual articulatory feedback system.
|
|
[11]
|
Atef Ben Youssef, Thomas Hueber, Pierre Badin, Gérard Bailly, and Frédéric
Elisei.
Toward a speaker-independent visual articulatory feedback system.
In 9th International Seminar on Speech Production, ISSP9,
Montreal, Canada, 2011.
[ bib |
.pdf ]
|
|
[12]
|
Thomas Hueber, Pierre Badin, Gérard Bailly, Atef Ben Youssef, Frédéric
Elisei, Bruce Denby, and Gérard Chollet.
Statistical mapping between articulatory and acoustic data.
application to silent speech interface and visual articulatory feedback.
In Proceedings of the 1st International Workshop on
Performative Speech and Singing Synthesis (p3s), Vancouver, Canada, 2011.
[ bib |
.pdf ]
This paper reviews some theoretical and practical aspects of different statistical mapping techniques used to model the relationships between the articulatory gestures and the resulting speech sound. These techniques are based on the joint modeling of articulatory and acoustic data using Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM). These methods are implemented in two systems: (1) the silent speech interface developed at SIGMA and LTCI laboratories which converts tongue and lip motions, captured during silent articulation by ultrasound and video imaging, into audible speech, and (2) the visual articulatory feedback system, developed at GIPSA-lab, which automatically animates, from the speech sound, a 3D orofacial clone displaying all articulators (including the tongue). These mapping techniques are also discussed in terms of real-time implementation.
Keywords: statistical mapping silent speech ultrasound visual articulatory feedback talking head HMM GMM
|
|
[13]
|
Atef Ben Youssef, Pierre Badin, and Gérard Bailly.
Can tongue be recovered from face? the answer of data-driven
statistical models.
In Proc. Interspeech, pages 2002-2005, Makuhari, Japan,
September 2010.
[ bib |
.pdf ]
This study revisits the face-to-tongue articulatory inversion problem in speech. We compare the Multi Linear Regression method (MLR) with two more sophisticated methods based on Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs), using the same French corpus of articulatory data acquired by ElectroMagnetoGraphy. GMMs give overall results better than HMMs, but MLR does poorly. GMMs and HMMs maintain the original phonetic class distribution, though with some centralisation effects, effects still much stronger with MLR. A detailed analysis shows that, if the jaw / lips / tongue tip synergy helps recovering front high vowels and coronal consonants, the velars are not recovered at all. It is therefore not possible to recover reliably tongue from face
|
|
[14]
|
Atef Ben Youssef, Pierre Badin, Gérard Bailly, and Viet-Anh Tran.
Méthodes basées sur les hmms et les gmms pour l'inversion
acoustico-articulatoire en parole.
In Proc. JEP, pages 249-252, Mons, Belgium, May 2010.
[ bib |
.pdf ]
Two speech inversion methods are implemented and compared. In the first, multistream Hidden Markov Models (HMMs) of phonemes are jointly trained from synchronous streams of articulatory data acquired by EMA and speech spectral parameters; an acoustic recognition system uses the acoustic part of the HMMs to deliver a phoneme chain and the states durations; this information is then used by a trajectory formation procedure based on the articulatory part of the HMMs to resynthesise the articulatory data. In the second, Gaussian Mixture Models (GMMs) are trained on these streams to associate directly articulatory frames with acoustic frames in context. Over a corpus of 17 minutes uttered by a French speaker, the RMS error was 1,66 mm with the HMMs and 2,25 mm with the GMMs.
|
|
[15]
|
Atef Ben Youssef, Pierre Badin, and Gérard Bailly.
Acoustic-to-articulatory inversion in speech based on statistical
models.
In Proc. AVSP 2010, pages 160-165, Hakone, Kanagawa, Japon,
2010.
[ bib |
.pdf ]
Two speech inversion methods are implemented and compared. In the first, multistream Hidden Markov Models (HMMs) of phonemes are jointly trained from synchronous streams of articulatory data acquired by EMA and speech spectral parameters; an acoustic recognition system uses the acoustic part of the HMMs to deliver a phoneme chain and the states durations; this information is then used by a trajectory formation procedure based on the articulatory part of the HMMs to resynthesise the articulatory movements. In the second, Gaussian Mixture Models (GMMs) are trained on these streams to directly associate articulatory frames with acoustic frames in context, using Maximum Likelihood Estimation. Over a corpus of 17 minutes uttered by a French speaker, the RMS error was 1.62 mm with the HMMs and 2.25 mm with the GMMs.
|
|
[16]
|
Pierre Badin, Atef Ben Youssef, Gérard Bailly, Frédéric Elisei, and Thomas
Hueber.
Visual articulatory feedback for phonetic correction in second
language learning.
In Workshop on Second Language Studies: Acquisition, Learning,
Education and Technology, Tokyo, Japan, 2010.
[ bib |
.pdf ]
Orofacial clones can display speech articulation in an augmented mode, i.e. display all major speech articulators, including those usually hidden such as the tongue or the velum. Besides, a number of studies tend to show that the visual articulatory feedback provided by ElectroPalatoGraphy or ultrasound echography is useful for speech therapy. This paper describes the latest developments in acoustic-to-articulatory inversion, based on statistical models, to drive orofacial clones from speech sound. It suggests that this technology could provide a more elaborate feedback than previously available, and that it would be useful in the domain of Computer Aided Pronunciation Training
|
|
[17]
|
Atef Ben Youssef, Pierre Badin, Gérard Bailly, and Panikos Heracleous.
Acoustic-to-articulatory inversion using speech recognition and
trajectory formation based on phoneme hidden markov models.
In Proc. Interspeech, pages 2255-2258, Brighton, UK, September
2009.
[ bib |
.pdf ]
In order to recover the movements of usually hidden articulators such as tongue or velum, we have developed a data-based speech inversion method. HMMs are trained, in a multistream framework, from two synchronous streams: articulatory movements measured by EMA, and MFCC + energy from the speech signal. A speech recognition procedure based on the acoustic part of the HMMs delivers the chain of phonemes and together with their durations, information that is subsequently used by a trajectory formation procedure based on the articulatory part of the HMMs to synthesise the articulatory movements. The RMS reconstruction error ranged between 1.1 and 2. mm.
|
|
[18]
|
Atef Ben Youssef, Viet-Anh Tran, Pierre Badin, and Gérard Bailly.
Hmms and gmms based methods in acoustic-to-articulatory speech
inversion.
In Proc. RJCP, pages 186-192, Avignon, France, 2009.
[ bib |
.pdf ]
|
|
[19]
|
Laurent Besacier, Atef Ben Youssef, and Hervé Blanchon.
The lig arabic/english speech translation system at iwslt08.
In International Workshop on Spoken Language Translation (IWSLT)
2008, pages 58-62, Hawaii, USA, 2008.
[ bib |
.pdf ]
This paper is a description of the system presented by the LIG laboratory to the IWSLT08 speech translation evaluation. The LIG participated, for the second time this year, in the Arabic to English speech translation task. For translation, we used a conventional statistical phrase-based system developed using the moses open source decoder. We describe chronologically the improvements made since last year, starting from the IWSLT 2007 system, following with the improvements made for our 2008 submission. Then, we discuss in section 5 some post-evaluation experiments made very recently, as well as some on-going work on Arabic / English speech to text translation. This year, the systems were ranked according to the (BLEU+METEOR)/2 score of the primary ASR output run submissions. The LIG was ranked 5th/10 based on this rule.
|