CISUC

Incremental and Hierarchical Document Clustering

Authors

Abstract

In the last few decades the importance, accessibility and number of digital documents have been increasing exponentially. Thus it is crucial that we find tools to automatically organize these huge collections of documents.
Document clustering is an active research domain that permits the automatic organization of documents into clusters. There are many different document clustering algorithms, with different motivations and approaches. Most algorithms receive a set of documents and process them as a whole. However, in nowadays online environment it is important that a system can receive and process documents continuously.
This thesis aims to develop a new algorithm of hierarchical document clustering with a fully incremental and unsupervised approach. Hierarchical because it produces a tree of clusters that facilitates browsing. By incremental we mean that there is no need that all documents were present at the beginning. Each document is processed as soon as it is available and the clusters are permanently updated.
The algorithm will be based on two well-known conceptual clustering algorithms (COBWEB /CLASSIT) that have seldom been applied to text. Also, as many of well-established techniques in document clustering are not suitable for incremental systems, one of the main challenges of this research is to adapt these mechanisms (document
representation, feature selection, evaluation) to incremental document clustering.

Keywords

clustering, conceptual clustering, document clustering, hierarchical clustering, incremental clustering, text clustering, TF-IDF, vector space model

Subject

PhD Thesis Proposal

TechReport Number

n/a

Cited by

No citations found