CISUC

Incremental and Hierarchical Document Clustering

Authors

Abstract

Over the past few decades, the volume of existing text data increased exponentially. Automatic tools to organize these huge collections of documents are becoming unprecedentedly important. Document clustering is important for organizing automatically documents into clusters. Most of the clustering algorithms process document collections as a whole; however, it is important to process these documents dynamically. This research aims to develop an incremental algorithm of hierarchical document clustering where each document is processed as soon as it is available. The algorithm is based on two well-known data clustering algorithms (COBWEB and CLASSIT), which create hierarchies of probabilistic concepts, and seldom have been applied to text data. The main contribution of this research is a new framework for incremental document clustering, based on extended versions of these algorithms in conjunction with a set of traditional techniques, modified to work in incremental environments.

Keywords

conceptual clustering, dimensionality reduction, document clustering, hierarchical clustering, incremental clustering, vector space model

Subject

Document Clustering

Conference

16th Portuguese Conference on Artificial Intelligence (EPIA 2013), September 2013

PDF File


Cited by

No citations found