This project extends work done for the Computational Natural Language Learning 2001 (CoNLL-2001) shared task. The goal at CoNLL-2001 was to identify clause boundaries in text. Clauses are used to predict phrasing in text-to-speech and to infer text aligment for machine translation. In the work described here, new performance bounds are estimated and clauses are classified as "main" or "subordinate" as well as boundaries being recognised.
A lower bound is estimated following Leffa (1998). This new symbolic baseline allows the machine learning results to be more effectively contextualised in the field as a whole, not limiting the comparison to machine learning systems. Upper bounds and performance goals are estimated from human performance, and state-of-the-art full parsers. Knowing whether a clause is main or subordinate will provide more information to the mentioned NLP tasks. We can imagine, for example, that this information could prove quite useful for proper text alignment.