A Semi-Supervised Document Clustering Technique for Information Organization

Han Joon Kim, Sang Goo Lee

Research output: Contribution to conferencePaperpeer-review

27 Scopus citations

Abstract

This paper discusses a new type of semi-supervised document clustering that uses partial supervision to partition a large set of documents. Most clustering methods organizes documents into groups based only on similarity measures. Unfortunately, the traditional approaches to document clustering are often unable to correctly discern structural details hidden within the document corpus because their algorithms inherently strongly depend on the document themselves and their similarity to each other. In this paper, we attempt to isolate more semantically coherent clusters by employing the domain-specific knowledge provided by a document analyst. By using external human knowledge to guide the clustering mechanism with some flexibility when creating the clusters, clustering efficiency can be considerably enhanced. As a basic clustering strategy, we use a variant of complete-linkage agglomerative hierarchical clustering, and develop the concepts (or seeds) of requested clusters by exploiting user-relevance feedback. Although the proposed method is slow when applied to large document collection, it yields higher quality clusters. Through experiments using the Reuters-21578 corpus, we show that the proposed method outperforms unsupervised clustering method.

Original languageEnglish
Pages30-37
Number of pages8
DOIs
StatePublished - 2000
Event9th International Conference on Information and Knowledge Management (CIKM 2000) - McLean, VA, United States
Duration: 6 Nov 200011 Nov 2000

Conference

Conference9th International Conference on Information and Knowledge Management (CIKM 2000)
Country/TerritoryUnited States
CityMcLean, VA
Period6/11/0011/11/00

Keywords

  • Agglomerative Hierarchical Clustering
  • Document Clustering
  • Fuzzy Information Retrieval
  • Information Organization
  • Relevance Feedback

Fingerprint

Dive into the research topics of 'A Semi-Supervised Document Clustering Technique for Information Organization'. Together they form a unique fingerprint.

Cite this