WWW pages of 3rd European Master School on Language and Speech

Lexical Resources for Statistical Machine Translation

Peter Dirix
(KU Leuven)

This presentation describes a part of the METIS project. METIS is an EU-sponsored project for statistical machine translation (2002-2003). It is conducted by the Institute for Language and Speech Processing in Athens and the Centre for Computational Linguistics of the KU Leuven, in co-operation with the universities of Antwerp and Tilburg.

Machine translation (MT) has been the Holy Grail of computational linguistics since its start in the fifties. However, after fifty years of research and the application of a wide diversity of techniques, the results are still not satisfactory. The first MT systems were word-by-word translation systems. Later MT systems generally had a rule-based approach. These rule-based systems reached a bottleneck long ago, mainly because of the huge size of lexical, syntactic and semantic information that has to be encoded and manipulated. Since the eighties, new approaches have been tried. These new approaches are usually based on statistical and pattern matching techniques. The performance of statistical machine translation (SMT) systems is comparable to that of commercial rule-based systems, but the gap in terms of commercial applications still remains considerable.

Currently, all SMT approaches are based on very large parallel corpora or bitexts. Bitexts contain original and translated texts, usually aligned at the sentence level. A statistical model is trained on these bitexts and then used to translate new sentences. One big disadvantage for this approach is the fact that there are almost no large bitexts available, even for widespread languages as French and English. On the other hand, more and more reasonably large monolingual corpora are becoming available for a wide range of languages. The novelty in the METIS project is the elimination of the use of bitexts altogether. It aims to investigate the possibility of developing a reasonable SMT system without relying on bilingual corpora. Instead, the proposed system will rely on large corpora of monolingual texts, together with some standard linguistic technology. The translation will be achieved by using a set of language-specific resources (taggers and lemmatizers for the individual languages) and a set of bilingual resources (bilingual lexica and tag-mapping rules). This also means that there is only a minimal effort required to add a new language pair. The only resources that are needed besides the monoligual corpora, are a bilingual lexicon and tag-mapping rules.

Hence, the presentation describes the assessment of the lexical resources needed for the METIS project, i.e. monolingual corpora for English and Dutch, a bilingual lexicon Dutch-English and tag-mapping rules to go from Dutch to English.