A new information theory based algorithm for clustering categorical data

Do Si Truong; Lam Thanh Hien; Nguyen Thanh Tung

doi:10.15625/1813-9663/18568

Author affiliations

Authors

Do Si Truong Lac Hong University, 10 Huynh Van Nghe Street, Buu Long Ward, Bien Hoa City, Dong Nai Province, Viet Nam
Lam Thanh Hien Lac Hong University, 10 Huynh Van Nghe Street, Buu Long Ward, Bien Hoa City, Dong Nai Province, Viet Nam
Nguyen Thanh Tung Lac Hong University, 10 Huynh Van Nghe Street, Buu Long Ward, Bien Hoa City, Dong Nai Province, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/18568

Keywords:

Data mining, clustering, categorical data, information system, normalized variation of information.

Abstract

Clustering is an important technique in data mining and in machine learning. Given a set of objects, the main goal of clustering is to group objects into clusters such that objects within a cluster have high similarity to one another, but objects in different clusters have high dissimilarity. In recent years, problems of clustering categorical data have attracted much attention from the data mining research community. Several rough-set based algorithms for clustering categorical data have been proposed. These algorithms make important contributions to the problem of clustering categorical data, some of them can handle uncertainty during the clustering process, while others allow users to obtain stable results. However, they have some limitations such as they often have low accuracy and high computational complexity. In this paper, we review two baseline algorithms for use with categorical data, namely Min-Min Roughness (MMR) and Mean Gain Ratio (MGR), and propose a new algorithm, called Minimum Mean Normalized Variation of Information (MMNVI). MMNVI algorithm uses the Mean Normalized Variation of Information of one attribute concerning another for finding the best clustering attribute, and the entropy of equivalence classes generated by the selected clustering attribute for binary splitting the clustering dataset. Experimental results on real datasets from UCI indicate that the MMNVI algorithm can be used successfully in clustering categorical data. It produces better or equivalent clustering results than the baseline algorithms.

Metrics

PDF views

95

References

M.M. Baroud, S.Z.M. Hashim, J.U. Ahsan, A. Zainal, “Positive region: An enhancement of partitioning attribute based rough set for categorical data,” Periodicals of Engineering and Natural Sciences, vol. 8, no. 4, December 2020, pp. 2424–2439. Doi:

http://dx.doi.org/10.21533/pen.v8i4.1745

V. Ganti, J. Gehrke, R. Ramakrishnan, “CACTUS–clustering categorical data using summaries,”

in The Proceeding of Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 73–83.

D. Gibson, J. Kleinberg, P. Raghavan, “Clustering categorical data: An approach based on

dynamical systems,” Very Large Data Bases J., vol. 8, no 3–4, 2000, pp. 222–236.

S. Guha, R. Rastogi, K. Shim, “ROCK: A robust clustering algorithm for categorical attributes,”

in Proceeding of 15th ICDE, 1999, pp. 512–521.

J. Han, and M. Kamber, Data Mining: Concepts and Techniques, 3th Edition, Morgan

Kanufmann Publishers, 2012.

M. Halkidi, Y. Batistakis, M. Vazirgiannis, “On clustering validation techniques,” Journal of

Intelligent Information Systems, vol. 17, pp. 107-145, 2001.

W.A. Hassanein, “Clustering algorithms for categorical data using concepts of significance and

dependence of attributes,” European Scientific Journal, vol. 10, no 3, pp. 381-400, 2014.

W. Hassanein and A. Elmelegy, “An algorithm for selecting clustering attribute using significance

of attributes,” International Journal of Database Theory & Application, vol. 6, no. 5, pp.

-66, 2013.

Z. Huang, “Extensions to the k-averages algorithm for clustering large data sets with categorical

values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283–304, 1998.

Z. Huang, M. K. Ng, ““A fuzzy k-modes algorithm for clustering categorical data,” IEEE Trans.

Fuzzy Syst., vol. 7, no. 4, pp. 446–452, 1999.

T. Herawan, “Rough set approach for categorical data clustering,” A thesis submitted in fullfillment of requirements for the award of the Doctor of Philosophy, 2010.

T. Herawan, “Rough clustering for cancer data sets,” International Journal of Modern

Physics: Conference Series, vol. 09, pp. 240-258, 2012.

T. Herawan, I.T.R. Yanto, and M.M. Deris, “Rough set approach for categorical data clustering,”

D. Slezak et al. (Eds.): DTA 2009, CCIS 64, Springer-Verlag Berlin Heidelberg, 2009, pp. 179–

https://doi.org/10.1007/978-3-642-10583-8 21

T. Herawan, M. M. Deris, J. H. Abawajy, “A rough set approach for selecting clustering attribute,” Knowledge-Based Systems, vol. 23, pp. 220–231, 2010.

A NEW INFORMATION THEORY BASED ALGORITHM 19

T. Herawan, W.M.W. Mohd, A Noraziah, “Applying variable precision rough set for clustering

diabetics data set,” International Journal of Multimedia and Ubiquitous Engineering, vol.

, no. 1, pp. 219-230, 2014.

D. Ienco, R.G. Pensa, R. Meo, “From context to distance: Learning dissimilarity for categorical

data clustering,” ACM Transactions on Knowledge Discovery from Data, vol. 6, no. 1.

https://doi.org/10.1145/2133360.2133361

A.K. Jain, M.N. Murty, P.J. Flynn, “Data clustering: A review,” ACM Computing Surveys,

vol. 31, no. 3, pp. 264–323, 1999. https://doi.org/10.1371/journal.pone.0265190

Dr. Jyot, Clustering categorical data using rough sets: a review, International Journal of

Advanced Research in IT and Engineering, Vol. 2, No. 12 (2013), pp. 30-37.

D. Kim, K. Lee, D. Lee, “Fuzzy clustering of categorical data using fuzzy centroids,” Pattern

Recognition Letters, vol. 25, no. 1, pp. 1263–1271, 2004.

G. Khandelwal and R. Sharma, “A simple yet fast clustering approach for categorical data,”

International Journal of Computer Applications, vol. 120, no 17, pp. 25–30, 2015.

P. Kumar and B. Tripathy, “MMeR: “An algorithm for clustering heterogeneous data using

rough set theory,” International Journal of Rapid Manufacturing, vol. 1, no. 2, pp. 189-207,

J. McCaffrey, Data Clustering Using Entropy Minimization. 2018.

http://visualstudiomagazine.com/Articles/2013/02/01/Data-Clustering-Using-EntropyMinimization.aspx?Page=2&p=1

L.J. Mazlack, A. He, Y. Zhu, and S. Coppock, “A rough set approach in choosing clustering

attributes,” Proceedings of the ISCA 13th International Conference (CAINE 2000), 2000,

pp. 1–6.

S. Mesakar, M.S. Chaudhari, “Review paper on data clustering of categorical data,” International Journal of Engineering Research & Technology, vol. 1 no. 10, December, 2012.

I.-K. Park and G.-S. Choi, “Rough set approach for clustering categorical data using informationtheoretic dependency measure,” Information Systems, vol. 4, pp. 289-295, 2015.

D. Parmar, T. Wu, and J. Blackhurst, “MMR: An algorithm for clustering categorical data using

rough set theory,” Data and Knowledge Engineering, vol. 63, pp. 879–893, 2007.

Z.Z. Pawlak, Rough Sets - Theoretical Aspects of Reasoning about Data, Kluwer Academic

Publishers, Dordrecht, 1991.

H. Qin, Xiuqin Ma, T. Herawan, and J.M. Zain, “MGR: An information theory based hierarchical

divisive clustering algorithm for categorical data,” Knowledge-Based Systems, vol. 67, pp. 401–

, 2014.

F.M. Reza, An Introduction to Information Theory, Dover Publications, New York, 1994.

A. Skowron and S. Dutta, “Rough sets: Past, present, and future,” Natural Computing, vol.

, no. 4, pp. 855-876, 2018.

G.K. Singh and S. Mandal, “Cluster analysis using rough set theory,” Journal of Informatics

and Mathematical Sciences, vol. 9, no. 3, pp. 509–520, 2017.

DO SI TRUONG et al.

B. Tripathy and A. Ghosh, “SDR: An algorithm for clustering categorical data using rough

set theory,” in Recent Advances in Intelligent Computational Systems, IEEE, 2011, pp.

-872.

B.K. Tripathy, A. Goyal, R. Chowdhury, and P.A. Sourav, “MMeMeR: An algorithm for clustering Heterogeneous data using rough set theory,” I.J. Intelligent Systems and Applications,

vol. 8, pp. 25-33, 2017.

A. Frank, “UCI Machine Learning Repositories,” http://archive.ics.ici.edu/ml/

P.C. Xuyen, D.S. Truong, N.T. Tung, “An information-theoretic metric based method for selecting clustering attribute,” in Proceedings of 9th National Conference on Fundamental and

Applied Information Technology, 2016, pp. 31-40.

J. Uddin, R. Ghazali, and M.M. Deris, “An empirical analysis of rough set categorical clustering

techniques,” PLOS ONE, vol. 12, no. 1, 2017.

J. Uddin, R. Ghazali, J.H. Abawajy, H. Shah, N.A. Husaini, and A. Zeb, “Rough set based

information theoretic approach for clustering uncertain categorical data,” PLOS ONE, May 13,

https://doi.org/10.1371/journal.pone.0265190

W. Wei, J. Liang, X. Guo, P. Song, and Y. Sun, “Hierarchical division clustering framework for

categorical data,” Neurocomputing, vol. 341, pp. 118–134, 2019.

Y.Y. Yao, “Information-Theoretic measures for knowledge discovery and data mining,” in

Karmeshu (eds) Entropy Measures, Maximum Entropy Principle and Emerging Applications. Studies in Fuzziness and Soft Computing, vol 119. Springer, Berlin, Heidelberg,

https://doi.org/10.1007/978-3-540-36212-8_6

Y. Zhao, R and Data Mining: Examples and Case Studies. Published by Elsevier, December