Abstract | Automatic genre classification of Web pages is currently young comparedto other Web classification tasks. Corpora are just starting to be collected
and organized in a systematic way, feature extraction techniques are incon
sistent and not well detailed, genres are constantly in dispute, and novel
applications have not been implemented. This paper attempts to review
and make progress in the area of feature extraction, an area that we believe
can benefit all Web page classification, and genre classification in particular.
We first present a framework for the extraction of various Web-specific
feature groups from distinct data models based on a tree of potentials
models and the transformations that create them. Then we introduce the
concept of cost-sensitivity to this tree and provide an algorithm for per
forming wrapper-based feature selection on this tree. Finally, we apply the
cost-sensitive feature selection algorithm on two genre corpora and analyze
the performance of the classification results.
|