Improving a Page Classifier with Anchor Extraction and Link Analysis

Cohen, William W.

Improving a Page Classifier with Anchor Extraction and Link Analysis

Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)

Bibtex Metadata Paper

Authors

William W. Cohen

Abstract

Most text categorization systems use simple models of documents and document collections. In this paper we describe a technique that im- proves a simple web page classiﬁer’s performance on pages from a new, unseen web site, by exploiting link structure within a site as well as page structure within hub pages. On real-world test cases, this technique signiﬁcantly and substantially improves the accuracy of a bag-of-words classiﬁer, reducing error rate by about half, on average. The system uses a variant of co-training to exploit unlabeled data from a new site. Pages are labeled using the base classiﬁer; the results are used by a restricted wrapper-learner to propose potential “main-category anchor wrappers”; and ﬁnally, these wrappers are used as features by a third learner to ﬁnd a categorization of the site that implies a simple hub structure, but which also largely agrees with the original bag-of-words classiﬁer.

Improving a Page Classifier with Anchor Extraction and Link Analysis

Authors

Abstract

Name Change Policy