Towards Automatic Web Genre Identification

Publication TypeConference Paper
Year of Publication2002
AuthorsRehm, Georg
Conference Name35th Annual Hawaii International Conference on System Sciences
Date Published2002
Keywordsautomatic detection, classification, corpus, genre, personal homepage, web

We argue for a systematic analysis of one particular, well structureddomain—academic Web pages—with regard to a special class of digital
genres: Web genres. For this purpose, we have developed a database-driven
system that will ultimately consist of more than 3 000 000 HTML documents,
written in German, which are the empirical basis for our research.
We introduce the notions of Web genre type which constitutes the basic
framework for a certain Web genre, and compulsory and optional Web
genre modules. These act as building blocks which go together to make up
the structure characterised by theWeb genre type and furthermore, operate
as modifiers for the default assignment involved.
The analysis of a 200 document sample illustrates our notion of Web genre
hierarchy, into which Web genre types and modules are embedded. The
analysis of four different documents of theWeb genre Academic’s Personal
Homepage, not only illustrates our approach, but also our long-term goal
of automatically extracting the contents of Web genre modules in order
to build up structured XML documents of groups of unstructured HTML

