This paper proposes a novel method for detecting the optimal sequence of prosodic phrases from continuous speech based on data-driven approach. The pitch pattern of input speech is divided into prosodic segments which minimized the overall distortion with pitch pattern templates of accent phrases by using the One Pass search algorithm. The pitch pattern templates are designed by clustering a large number of training samples of accent phrases. On the ATR continuous speech database uttered by 10 speakers, the rate of correct segmentation was 91.7 % maximum for the same sex data of training and testing, 88.6 % for the opposite sex.

We are concerned with integrating connectionist networks into a hidden Markov model (HMM) speech recognition system. This is achieved through a statistical interpretation of connectionist networks as probability estimators. We review the basis of HMM speech recognition and point out the possible benefits of incorporating connectionist networks. Issues necessary to the construction of a connectionist HMM recognition system are discussed, including choice of connectionist probability estimator. We describe the performance of such a system, using a multi-layer perceptron probability estimator, evaluated on the speaker-independent DARPA Resource Management database. In conclusion, we show that a connectionist component improves a state-of-the-art HMM system.

A frequent phenomen in spoken dialogs of the information seeking type are short elliptic utterances whose mood (declarative or interrogative) can only be distinguished by intonation. The main acoustic evidence is conveyed by the fundamental frequency or F0 contour. Many algorithms for F0 determination have been reported in the literature. A common problem are irregularities of speech known as laryngealizations. This article describes an approach based on neuronal network techniques for the improved determination of fundamental frequency. First, an improved version of our neuronal network algorithm for reconstruction of the voice source signal (glottis signal) is presented. Second, the reconstructed voice source signal is used as input to another neuronal network destinguishing the three classes 'voiceless', 'voiced-non-laryngealized', and 'voiced-laryngealized'. Third, the results are used to improve an existing F0 algorithm. Results of this approach are presented and discussed in the context of the application in a spoken dialog system.