An extensive review on text classification and categorization.

Jenwei Liu

This project is an extensive review on text classification and categorization. Automatic text categorization is an important task that can help people finding information on huge online resources. Classification or categorization is the task of assigning objects from a universe to two or more classes or categories. Text classification research and practice has exploded in the past decade. Text classification tasks include Text Categorization (TC), Information Retrieval (IR), Clustering and Text Filtering. By text classification we mean both the automated assignment of textual data to groups or classes (often referred to as categorization), as well as the use of automated techniques for discovering such classes (often referred to as clustering). Applications of text classification include indexing documents or Web pages by controlled vocabulary, construction of vertical portals and specialized information feeds, information security, help desk automation, content filtering, selective alerting, text mining, automated authorship attribution, and many others. The research side of text classification has been widely published via conferences and journals in IR, NLP, machine learning, and other fields. On the text categorization, I have also mention Latent Semantic Analysis into this paper. Because Latent Semantic Analysis (LSA) is a statistical/mathematical technique of word usage that permits comparisons of semantic similarity between pieces of textual information.

Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, attribute dependency, and multi- modality of categorization. Many techniques and algorithms for this subject have been devised and proposed in the literature. In this project would focus on the state of the art with respect to algorithms and their results. The goal of this project will show us which method in which condition would be best; we can accord this result to apply each method.