The Centre for Speech Technology Research, The university of Edinburgh

12 Jun 2001

Vepa Jithendra & John Niekrasz


Unit Selection Synthesis: Better Concatenation costs using LSFs and MCA    (Vepa Jithendra)

This talk focuses on the use of different acoustic features to compute concatenation (join) costs in unit-selection based concatenative speech synthesis. We compare two methods for join cost computation:

  1. Line Spectral Frequencies(LSFs), which are derived from an all-pole model of speech.
  2. formant frequencies and bandwidths obtained from Multiple Centroid Analysis(MCA) of the speech power spectrum.
We present our results in the form of pairs of speech files synthesised using each of the of above features to compute the join costs. The talk concludes with some pointers to future work.

Applying Speech Technology to Singing with the Singing TIMIT corpus    (John Niekrasz)

It is often forgotten in voice research that speech represents only a part of the wide range of sounds produced by the human voice. My research as a PhD student at CSTR aims to broaden this traditional vision of voice research to include another major set of vocal sounds: singing.

Many of the important initial discoveries made in voice science, such as the source-filter model, are now the obvious foundation for research being done today is all fields of speech science. However, with the ever expanding power of computers, the increasing accessibility of speech corpora, and the existence of marketable applications, speech technology research today is now dominated by massive data-driven algorithms which aim to extract detailed information from these large collections of recorded speech. While this approach is clearly useful, particularly for research toward the improvement of large-vocabulary speech recognition systems, the incomplete representation of all possible vocal sounds, including linguistic sounds, in the data precludes this research from fully modeling the vocal mechanism in completely abstract ways.

With the above in mind, I have set out to create the Singing TIMIT corpus, an attempt to create a set of labeled English singing for the purpose of data-driven voice research. Very similar to the existing widely used TIMIT corpus, Singing TIMIT contains the same phonetically compact set of sentences as TIMIT with a similar labeling scheme. It, however, also strives to control and broaden the scope of other interesting variables such as fundamental frequency, syllable duration, and loudness, which are all explicitly controlled to a degree in singing. The expansion of some of the important variables beyond their normal ranges during speech, the ability to control these variables to a degree through the magic of written music, and the complete inclusion of the same linguistic richness contained in the original TIMIT corpus, could potentially lead to data-driven analyses which capture more meaningful, abstract features of the voice. I hope to initiate such research in the subsequent years of my PhD, and then finally explore research specific to singing technology such as singing transcription and synthesis as the obvious next step with the accessibility of such a corpus.

[back to PWorkshop Archives]

<owner-pworkshop@ling.ed.ac.uk>