Creating a Multidisciplinary Monolingual Text Corpus for Persian OR Modeling Persian Language Syntax and Morphology in LingBech IDE Parser

Peyman Nojoumian

LingBench IDETM [1] language models are mainly based on the morpho-syntactic features of the concerned natural language. An enriched lexicon, featuring subcategories, word frequencies and relative word probabilities gives the model simple but extremely efficient and fast parsing ability.  Furthermore, the models compacted by the application can be used in the form of high-speed embedded SDKs[2] in a variety of natural language engineering applications like Text-To-Speech, Automatic Speech Recognition, Information Extraction, Dialogue Systems, Machine Translation, Grammar and Spell Checkers and etc. During the three months internship, a language model has been developed for Persian with LingBench IDETM. This is a prototype model, which is able to parse Persian sentences and words with a very high precision rate


[1] “Integrated Development Environment”, a program user interface that allows the user to create and modify, save and reopen a set of data (for instance software source files) that belong together (because they interact), AND to ‘compile’ the data AND see where errors occur during compilation, and to activate the data (in the case of software, this means starting the program). (De Brabander, 2003)

[2] Software Development Kit, a package including a software library module and documentation, intended to allow software developers to link their programmers to software modules so that they can use the functionalities from within their application. (De Brabander, 2003)