EdSST - Research

The state-of-the-art in speech technology has benefitted from advances in machine learning and signal processing, but does not take good account of recent findings from speech science. On the other hand, speech science research is frequently not well-informed by these recent advances in speech technology. EdSST is an interdisciplinary research training programme that aims close the gap between speech science and technology, focussing on a number of overlapping research areas, each of which includes components from speech science and speech technology:

Articulatory instrumentation and modelling
Speech synthesis
Speech recognition
Human-computer dialogue systems
Inclusive design
Augmentative and alternative communication

Articulatory instrumentation and modelling

Recording and modelling articulatory data is a central activity in EdSST. Recording of articulatory data will take place at the multichannel articulatory recording facility at QMUC. This facility includes several modalities:

Electromagnetic articulography (EMA) to track the dynamic movements of the lips, soft palate, tongue and jaw during speech production;
Electropalatography (EPG) to track the contact between the tongue and the palate during speech production;
The VICON movement analysis system for optical tracking of lip and jaw movement;
Laryngograph measurements of vocal fold activity;
Ultrasound analysis of tongue movement during speech.

The QMUC facility enables these articulatory modalities to be recorded in parallel, synchronised with acoustic recordings of the speech and video recordings of the speaker's face.

Multichannel recordings of speech articulation provide a basis for modelling and visualizing speech production at the articulatory level. This has several technological implications. It is of considerable interest to develop improved articulatory representations of acoustic signals (eg using statistical machine learning, such as latent variable methods), and to investigate the use of articulatory constraints to improve current statistical models for automatic speech recognition and speech synthesis. Such an approach also offers the possibility of more precise diagnosis and assessment of speech disorders and more efficient therapy (eg by providing visual feedback to talkers with cleft palate).

Speech synthesis

Our speech synthesis research is built around the open source Festival speech synthesis system. Current research in speech synthesis includes the development of model-based approaches to unit selection, expressive speech synthesis and prosodic modelling, the development of perceptually rigorous approaches to speech synthesis evaluation, the development of articulatory constraints to improve speech synthesis (exploiting the articulatory modelling work discussed above), and the development of statistical models for speech synthesis, based on the trajectory HMM technique.

Recently, QMUC and CSTR have embarked on a joint project concerning the synthesis of normal speech with disfluencies in order to make synthetic voices sound more spontaneous and natural. QMUC also works on the creation of synthetic voices that people with hearing loss can understand easily, while ensuring that they are pleasant to listen to.

Speech recognition

The proposed work in speech recognition will involve the construction of automatic speech recognition systems, and the development of novel acoustic and language models. Much of our work in speech recognition is based around the construction of speech recognition systems and the development of novel acoustic and language models. A particular focus is the construction of speech recognition systems for multiparty conversational speech, which is characterized by overlap between talkers, as well as the usual phenomena observed in spontaneous speech such as disfluency. Since we are concerned with processing speech in realistic environments, we are increasingly concerned with capturing speech using distant (tabletop) microphones and microphone arrays, rather than the close-talking microphones usually employed in speech recognition. The research problems posed by such environments include detecting and separating overlapping speech, model-based beamforming algorithms for microphone arrays, and the development of language models for multiparty speech. This work is linked to research underway in the FP6 IST Integrated Project AMI, coordinated by CSTR.

We propose to develop acoustic models for speech recognition that are informed by results and data from speech science. This is in contrast to state-of-the-art approaches which use sophisticated machine learning techniques, but take little account of knowledge about human speech production. We are particularly interested in using articulatory constraints for both sub-word models and pronunciation models, and in the explicit modelling of speech as a set of multiple asynchronous streams of data. The models that we are exploring include dynamic Bayesian networks, trajectory models and switching state space models. A current strand of research is concerned with precisely characterizing the relationship between these models. A unique aspect of the current collaboration is the possibility of using multichannel articulatory data as auxiliary variables to inform speech recognition. This strand of work builds on the EPSRC-funded MOCHA project.

Other areas to be explored include the adaptation of acoustic models to changes specific to ageing voices and children's voices, and to the increased variability of dysarthric voices.

Human-computer dialogue systems

The main aim of our joint dialogue research is to develop strategies for tailoring spoken interactions between humans and computers to the needs and abilities of the human user. In particular, we are interested in accommodating limitations of hearing, memory and cognitive processes, such as those found in the ageing population. This research ties in with our work on inclusive design. We are also pursuing research into the development of automatic dialogue optimisation strategies. Results from the recognition and synthesis work will feed directly into this work.

A second area of dialogue research has recently begun that is concerned with the development of lifelike conversational agents. Current research in this area, concerned with agents that display human-like facial animation and gesture (as well as speech), has centred on making the agents seem natural. This involves controlling mouth/lip movement ("lip-synchronisation" to speech), and eye movement. Building on this, we are now concerned with controlling the movement of agent's head including facial expression. This is much less studied, even though head movement often plays a more important role in naturalness and intelligibility than the mouth and eyes.

Inclusive design

Our goal is to use principles of inclusive design in order to create speech technology solutions that can be used by people with a wide range of cognitive, speaking, and hearing abilities and that are designed with the close involvement of a representative sample of end users. There is a great potential demand for inclusive speech technology applications. This demand comes from several sources, most notably the growing market for solutions that allow older and disabled persons to be cared for in their own homes.

Examples for inclusive design are spoken dialogue systems that automatically accommodate to users with varying degrees of memory loss, speech synthesis systems that can be adjusted to different degrees of hearing impairment, and speech recognition systems that adapt easily to users with a range of articulatory and cognitive impairments due to stroke or traumatic brain injury. As these examples demonstrate, the theme of inclusive design is very much interlinked with our foci on speech recognition, speech synthesis, and spoken dialogue systems.

Augmentative and alternative communication

The state-of-the-art of speech technology described above will be used in creating communication aids for users with speech impairments, ranging from younger users with Cerebral Palsy or Motor Neurone Disease, to older users with Parkinson's Disease. Many of the communication aids available today are based on dated technology which is used mainly because it is quick, convenient, and has a small footprint. For example, many voices for communication systems are still based on formant synthesis, a technique which leads to artificial sounding speech, whereas modern unit selection systems allow the generation of far more natural voices. Unit selection techniques could also allow users' own voices to be used for communication systems, which is an option if a person is going to lose their voice to a progressive disorder.