Classifying XML Documents by Using Genre Features

malcyclark's picture
TitleClassifying XML Documents by Using Genre Features
Publication TypeConference Paper
Year of Publication2007
AuthorsClark, Malcolm, and Stuart Watt
Conference NameProceedings of the 18th International Conference on Database and Expert Systems Applications
PublisherIEEE Computer Society
Place PublishedWashington, DC, USA
ISBN Number0-7695-2932-1
Abstract

The categorization of documents is traditionally
topic-based. This paper presents a complementary
analysis of research and experiments on genre to show
that encouraging results can be obtained by using
genre structure (form) features. We conducted an
experiment to assess the effectiveness of using
extensible mark-up language (XML) tag information,
and part-of-speech (P-O-S) features, for the
classification of genres, testing the hypothesis that if a
focus on genre can lead to high precision on normal
textual documents, then good results can be achieved
using XML tag information in addition to P-O-S
information. An experiment was carried out on a
subsection of the initiative for the evaluation of XML
(INEX) 1.4 collection. The features were extracted and
documents were classified using machine learning
algorithms, which yielded encouraging results for
logistic regression and neural networks. We propose
that utilizing these features and training a classifier
may benefit retrieval for most world wide web (WWW)
technologies such as XML and extensible hypertext
markup language) XHTML.

URLhttp://dx.doi.org/10.1109/DEXA.2007.48
DOI10.1109/DEXA.2007.48