Cost-Sensitive Feature Extraction and Selection in Genre Classification

标题Cost-Sensitive Feature Extraction and Selection in Genre Classification
Publication TypeJournal Article
Year of Publication2009
AuthorsLevering, Ryan, and Michal Cutler
JournalJournal for Language Technology and Computational Linguistics
音量24
Pagination57–72
关键词automation, classificaiton, digital, genre, information science, web
Abstract

Automatic genre classification of Web pages is currently young comparedto other Web classification tasks. Corpora are just starting to be collected
and organized in a systematic way, feature extraction techniques are incon
sistent and not well detailed, genres are constantly in dispute, and novel
applications have not been implemented. This paper attempts to review
and make progress in the area of feature extraction, an area that we believe
can benefit all Web page classification, and genre classification in particular.
We first present a framework for the extraction of various Web-specific
feature groups from distinct data models based on a tree of potentials
models and the transformations that create them. Then we introduce the
concept of cost-sensitivity to this tree and provide an algorithm for per
forming wrapper-based feature selection on this tree. Finally, we apply the
cost-sensitive feature selection algorithm on two genre corpora and analyze
the performance of the classification results.