A NEW INFORMATION THEORY BASED ALGORITHM FOR CLUSTERING CATEGORICAL DATA
Keywords:Data mining, Clustering, Categorical data, Information system, Normalized Variation of Information
Clustering is an important technique in data mining and in machine learning. Given a set of objects, the main goal of clustering is to group objects into clusters such that objects within a cluster have high similarity to one another, but objects in different clusters have high dissimilarity. In recent years, problems of clustering categorical data have attracted much attention from the data mining research community. Several rough-set based algorithms for clustering categorical data have been proposed. These algorithms make important contributions to the problem of clustering categorical data, some of them can handle uncertainty during the clustering process, while others allow users to obtain stable results. However, they have some limitations such as they often have low accuracy and high computational complexity. In this paper, we review two baseline algorithms for use with categorical data, namely Min-Min Roughness (MMR) and Mean Gain Ratio (MGR), and propose a new algorithm, called Minimum Mean Normalized Variation of Information (MMNVI). MMNVI algorithm uses the Mean Normalized Variation of Information of one attribute concerning another for finding the best clustering attribute, and the entropy of equivalence classes generated by the selected clustering attribute for binary splitting the clustering dataset. Experimental results on real datasets from UCI indicate that the MMNVI algorithm can be used successfully in clustering categorical data. It produces better or equivalent clustering results than the baseline algorithms.
M.M. Baroud, S.Z.M. Hashim, J.U. Ahsan, A. Zainal, “Positive region: An enhancement of partitioning attribute based rough set for categorical data,” Periodicals of Engineering and Natural Sciences, vol. 8, no. 4, December 2020, pp. 2424–2439. Doi:
How to Cite
License1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.
2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.