Modelling global information with Latent Semantic Analysis

Aurélien Giraud

When humans have to transcribe speech, for instance when subtitling broadcast news shows, they use preceding and successor words to determine which word was being said. One of the reasons why automatic systems are still not as good as humans for this task is that humans have superior word prediction abilities.

Machines can use syntactic and semantic information but usually have a limited view of only 1 or 2 words to the left and right to predict a word; this is the reason why we speak about local Language Models. Thus automatic speech recognisers do not, and cannot, use topic information. A solution has been proposed to use the statistical method called Latent Semantic Analysis to build a global Language Model. This report investigates, through the required implementations of computer programs, how much work, and what kind of work, is needed to go from a corpus to this global model of the language. It also shows and discusses the results of some tests performed on the Language Model got at the end from the Latent Semantic Analysis.