A new information theory based algorithm for clustering categorical data

Do Si Truong, Lam Thanh Hien, Nguyen Thanh Tung
  • Do Si Truong Lac Hong University, 10 Huynh Van Nghe Street, Buu Long Ward, Bien Hoa City, Dong Nai Province, Viet Nam
  • Lam Thanh Hien Lac Hong University, 10 Huynh Van Nghe Street, Buu Long Ward, Bien Hoa City, Dong Nai Province, Viet Nam
  • Nguyen Thanh Tung Lac Hong University, 10 Huynh Van Nghe Street, Buu Long Ward, Bien Hoa City, Dong Nai Province, Viet Nam




Data mining, clustering, categorical data, information system, normalized variation of information.


Clustering is an important technique in data mining and in machine learning. Given a set of objects, the main goal of clustering is to group objects into clusters such that objects within a cluster have high similarity to one another, but objects in different clusters have high dissimilarity. In recent years, problems of clustering categorical data have attracted much attention from the data mining research community. Several rough-set based algorithms for clustering categorical data have been proposed. These algorithms make important contributions to the problem of clustering categorical data, some of them can handle uncertainty during the clustering process, while others allow users to obtain stable results. However, they have some limitations such as they often have low accuracy and high computational complexity. In this paper, we review two baseline algorithms for use with categorical data, namely Min-Min Roughness (MMR) and Mean Gain Ratio (MGR), and propose a new algorithm, called Minimum Mean Normalized Variation of Information (MMNVI). MMNVI algorithm uses the Mean Normalized Variation of Information of one attribute concerning another for finding the best clustering attribute, and the entropy of equivalence classes generated by the selected clustering attribute for binary splitting the clustering dataset. Experimental results on real datasets from UCI indicate that the MMNVI algorithm can be used successfully in clustering categorical data. It produces better or equivalent clustering results than the baseline algorithms.


