A NEW INFORMATION THEORY BASED ALGORITHM FOR CLUSTERING CATEGORICAL DATA
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/18568Keywords:
Data mining, Clustering, Categorical data, Information system, Normalized Variation of InformationAbstract
Clustering is an important technique in data mining and in machine learning. Given a set of objects, the main goal of clustering is to group objects into clusters such that objects within a cluster have high similarity to one another, but objects in different clusters have high dissimilarity. In recent years, problems of clustering categorical data have attracted much attention from the data mining research community. Several rough-set based algorithms for clustering categorical data have been proposed. These algorithms make important contributions to the problem of clustering categorical data, some of them can handle uncertainty during the clustering process, while others allow users to obtain stable results. However, they have some limitations such as they often have low accuracy and high computational complexity. In this paper, we review two baseline algorithms for use with categorical data, namely Min-Min Roughness (MMR) and Mean Gain Ratio (MGR), and propose a new algorithm, called Minimum Mean Normalized Variation of Information (MMNVI). MMNVI algorithm uses the Mean Normalized Variation of Information of one attribute concerning another for finding the best clustering attribute, and the entropy of equivalence classes generated by the selected clustering attribute for binary splitting the clustering dataset. Experimental results on real datasets from UCI indicate that the MMNVI algorithm can be used successfully in clustering categorical data. It produces better or equivalent clustering results than the baseline algorithms.
Metrics
References
M.M. Baroud, S.Z.M. Hashim, J.U. Ahsan, A. Zainal, “Positive region: An enhancement of partitioning attribute based rough set for categorical data,” Periodicals of Engineering and Natural Sciences, vol. 8, no. 4, December 2020, pp. 2424–2439. Doi:
http://dx.doi.org/10.21533/pen.v8i4.1745
V. Ganti, J. Gehrke, R. Ramakrishnan, “CACTUS–clustering categorical data using summaries,”
in The Proceeding of Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 73–83.
D. Gibson, J. Kleinberg, P. Raghavan, “Clustering categorical data: An approach based on
dynamical systems,” Very Large Data Bases J., vol. 8, no 3–4, 2000, pp. 222–236.
S. Guha, R. Rastogi, K. Shim, “ROCK: A robust clustering algorithm for categorical attributes,”
in Proceeding of 15th ICDE, 1999, pp. 512–521.
J. Han, and M. Kamber, Data Mining: Concepts and Techniques, 3th Edition, Morgan
Kanufmann Publishers, 2012.
M. Halkidi, Y. Batistakis, M. Vazirgiannis, “On clustering validation techniques,” Journal of
Intelligent Information Systems, vol. 17, pp. 107-145, 2001.
W.A. Hassanein, “Clustering algorithms for categorical data using concepts of significance and
dependence of attributes,” European Scientific Journal, vol. 10, no 3, pp. 381-400, 2014.
W. Hassanein and A. Elmelegy, “An algorithm for selecting clustering attribute using significance
of attributes,” International Journal of Database Theory & Application, vol. 6, no. 5, pp.
-66, 2013.
Z. Huang, “Extensions to the k-averages algorithm for clustering large data sets with categorical
values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283–304, 1998.
Z. Huang, M. K. Ng, ““A fuzzy k-modes algorithm for clustering categorical data,” IEEE Trans.
Fuzzy Syst., vol. 7, no. 4, pp. 446–452, 1999.
T. Herawan, “Rough set approach for categorical data clustering,” A thesis submitted in fullfillment of requirements for the award of the Doctor of Philosophy, 2010.
T. Herawan, “Rough clustering for cancer data sets,” International Journal of Modern
Physics: Conference Series, vol. 09, pp. 240-258, 2012.
T. Herawan, I.T.R. Yanto, and M.M. Deris, “Rough set approach for categorical data clustering,”
D. Slezak et al. (Eds.): DTA 2009, CCIS 64, Springer-Verlag Berlin Heidelberg, 2009, pp. 179–
https://doi.org/10.1007/978-3-642-10583-8 21
T. Herawan, M. M. Deris, J. H. Abawajy, “A rough set approach for selecting clustering attribute,” Knowledge-Based Systems, vol. 23, pp. 220–231, 2010.
A NEW INFORMATION THEORY BASED ALGORITHM 19
T. Herawan, W.M.W. Mohd, A Noraziah, “Applying variable precision rough set for clustering
diabetics data set,” International Journal of Multimedia and Ubiquitous Engineering, vol.
, no. 1, pp. 219-230, 2014.
D. Ienco, R.G. Pensa, R. Meo, “From context to distance: Learning dissimilarity for categorical
data clustering,” ACM Transactions on Knowledge Discovery from Data, vol. 6, no. 1.
https://doi.org/10.1145/2133360.2133361
A.K. Jain, M.N. Murty, P.J. Flynn, “Data clustering: A review,” ACM Computing Surveys,
vol. 31, no. 3, pp. 264–323, 1999. https://doi.org/10.1371/journal.pone.0265190
Dr. Jyot, Clustering categorical data using rough sets: a review, International Journal of
Advanced Research in IT and Engineering, Vol. 2, No. 12 (2013), pp. 30-37.
D. Kim, K. Lee, D. Lee, “Fuzzy clustering of categorical data using fuzzy centroids,” Pattern
Recognition Letters, vol. 25, no. 1, pp. 1263–1271, 2004.
G. Khandelwal and R. Sharma, “A simple yet fast clustering approach for categorical data,”
International Journal of Computer Applications, vol. 120, no 17, pp. 25–30, 2015.
P. Kumar and B. Tripathy, “MMeR: “An algorithm for clustering heterogeneous data using
rough set theory,” International Journal of Rapid Manufacturing, vol. 1, no. 2, pp. 189-207,
J. McCaffrey, Data Clustering Using Entropy Minimization. 2018.
L.J. Mazlack, A. He, Y. Zhu, and S. Coppock, “A rough set approach in choosing clustering
attributes,” Proceedings of the ISCA 13th International Conference (CAINE 2000), 2000,
pp. 1–6.
S. Mesakar, M.S. Chaudhari, “Review paper on data clustering of categorical data,” International Journal of Engineering Research & Technology, vol. 1 no. 10, December, 2012.
I.-K. Park and G.-S. Choi, “Rough set approach for clustering categorical data using informationtheoretic dependency measure,” Information Systems, vol. 4, pp. 289-295, 2015.
D. Parmar, T. Wu, and J. Blackhurst, “MMR: An algorithm for clustering categorical data using
rough set theory,” Data and Knowledge Engineering, vol. 63, pp. 879–893, 2007.
Z.Z. Pawlak, Rough Sets - Theoretical Aspects of Reasoning about Data, Kluwer Academic
Publishers, Dordrecht, 1991.
H. Qin, Xiuqin Ma, T. Herawan, and J.M. Zain, “MGR: An information theory based hierarchical
divisive clustering algorithm for categorical data,” Knowledge-Based Systems, vol. 67, pp. 401–
, 2014.
F.M. Reza, An Introduction to Information Theory, Dover Publications, New York, 1994.
A. Skowron and S. Dutta, “Rough sets: Past, present, and future,” Natural Computing, vol.
, no. 4, pp. 855-876, 2018.
G.K. Singh and S. Mandal, “Cluster analysis using rough set theory,” Journal of Informatics
and Mathematical Sciences, vol. 9, no. 3, pp. 509–520, 2017.
DO SI TRUONG et al.
B. Tripathy and A. Ghosh, “SDR: An algorithm for clustering categorical data using rough
set theory,” in Recent Advances in Intelligent Computational Systems, IEEE, 2011, pp.
-872.
B.K. Tripathy, A. Goyal, R. Chowdhury, and P.A. Sourav, “MMeMeR: An algorithm for clustering Heterogeneous data using rough set theory,” I.J. Intelligent Systems and Applications,
vol. 8, pp. 25-33, 2017.
A. Frank, “UCI Machine Learning Repositories,” http://archive.ics.ici.edu/ml/
P.C. Xuyen, D.S. Truong, N.T. Tung, “An information-theoretic metric based method for selecting clustering attribute,” in Proceedings of 9th National Conference on Fundamental and
Applied Information Technology, 2016, pp. 31-40.
J. Uddin, R. Ghazali, and M.M. Deris, “An empirical analysis of rough set categorical clustering
techniques,” PLOS ONE, vol. 12, no. 1, 2017.
J. Uddin, R. Ghazali, J.H. Abawajy, H. Shah, N.A. Husaini, and A. Zeb, “Rough set based
information theoretic approach for clustering uncertain categorical data,” PLOS ONE, May 13,
https://doi.org/10.1371/journal.pone.0265190
W. Wei, J. Liang, X. Guo, P. Song, and Y. Sun, “Hierarchical division clustering framework for
categorical data,” Neurocomputing, vol. 341, pp. 118–134, 2019.
Y.Y. Yao, “Information-Theoretic measures for knowledge discovery and data mining,” in
Karmeshu (eds) Entropy Measures, Maximum Entropy Principle and Emerging Applications. Studies in Fuzziness and Soft Computing, vol 119. Springer, Berlin, Heidelberg,
https://doi.org/10.1007/978-3-540-36212-8_6
Y. Zhao, R and Data Mining: Examples and Case Studies. Published by Elsevier, December
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.