Abstract
This paper discusses a new type of semi-supervised document clustering that uses partial supervision to partition a large set of documents. Most clustering methods organizes documents into groups based only on similarity measures. Unfortunately, the traditional approaches to document clustering are often unable to correctly discern structural details hidden within the document corpus because their algorithms inherently strongly depend on the document themselves and their similarity to each other. In this paper, we attempt to isolate more semantically coherent clusters by employing the domain-specific knowledge provided by a document analyst. By using external human knowledge to guide the clustering mechanism with some flexibility when creating the clusters, clustering efficiency can be considerably enhanced. As a basic clustering strategy, we use a variant of complete-linkage agglomerative hierarchical clustering, and develop the concepts (or seeds) of requested clusters by exploiting user-relevance feedback. Although the proposed method is slow when applied to large document collection, it yields higher quality clusters. Through experiments using the Reuters-21578 corpus, we show that the proposed method outperforms unsupervised clustering method.
Original language | English |
---|---|
Pages | 30-37 |
Number of pages | 8 |
DOIs | |
State | Published - 2000 |
Event | 9th International Conference on Information and Knowledge Management (CIKM 2000) - McLean, VA, United States Duration: 6 Nov 2000 → 11 Nov 2000 |
Conference
Conference | 9th International Conference on Information and Knowledge Management (CIKM 2000) |
---|---|
Country/Territory | United States |
City | McLean, VA |
Period | 6/11/00 → 11/11/00 |
Keywords
- Agglomerative Hierarchical Clustering
- Document Clustering
- Fuzzy Information Retrieval
- Information Organization
- Relevance Feedback