The Centre for Speech Technology Research, The university of Edinburgh

Publications by M. Sam Ribeiro

[1] Manuel Sam Ribeiro, Junichi Yamagishi, and Robert A. J. Clark. A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis. In Proc. Interspeech, Dresden, Germany, September 2015. [ bib | .pdf ]
The Continuous Wavelet Transform (CWT) has been recently proposed to model f0 in the context of speech synthesis. It was shown that systems using signal decomposition with the CWT tend to outperform systems that model the signal directly. The f0 signal is typically decomposed into various scales of differing frequency. In these experiments, we reconstruct f0 with selected frequencies and ask native listeners to judge the naturalness of synthesized utterances with respect to natural speech. Results indicate that HMM-generated f0 is comparable to the CWT low frequencies, suggesting it mostly generates utterances with neutral intonation. Middle frequencies achieve very high levels of naturalness, while very high frequencies are mostly noise.

[2] Manuel Sam Ribeiro and Robert A. J. Clark. A multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Brisbane, Australia, April 2015. [ bib | .pdf ]
We propose a representation of f0 using the Continuous Wavelet Transform (CWT) and the Discrete Cosine Transform (DCT). The CWT decomposes the signal into various scales of selected frequencies, while the DCT compactly represents complex contours as a weighted sum of cosine functions. The proposed approach has the advantage of combining signal decomposition and higher-level representations, thus modeling low-frequencies at higher levels and high-frequencies at lower-levels. Objective results indicate that this representation improves f0 prediction over traditional short-term approaches. Subjective results show that improvements are seen over the typical MSD-HMM and are comparable to the recently proposed CWT-HMM, while using less parameters. These results are discussed and future lines of research are proposed.

[3] Philip N Garner, Rob Clark, Jean-Philippe Goldman, Pierre-Edouard Honnet, Maria Ivanova, Alexandros Lazaridis, Hui Liang, Beat Pfister, Manuel Sam Ribeiro, Eric Wehrli, et al. Translation and prosody in swiss languages. In Nouveaux cahiers de linguistique francaise, 31. 3rd Swiss Workshop on Prosody, Geneva, Switzerland, September 2014. [ bib | .pdf ]
The SIWIS project aims to investigate spoken language translation, where both the speaker characteristics and prosody are translated. This means the translation carries not only spoken content, but also speaker identification, emotion and intent. We describe the background of the project, and present some initial approaches and results. These include the design and collection of a Swiss bilingual database that both enables research in Swiss accented speech processing, and facilitates reliable evaluation.