Similarity algorithms for fuzzy join computation in big data processing environment

Anh Cang Phan; Thuong Cang Phan

doi:10.15625/1813-9663/17589

Author affiliations

Authors

Anh Cang Phan Vinh Long University of Technology and Education, 73 Nguyen Hue, Vinh Long City, Vinh Long Province, Viet Nam
Thuong Cang Phan Can Tho University, 3/2 Street, Ninh Kieu District, Can Tho City, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/17589

Keywords:

Fuzzy joins, similarity algorithms, set-similarity joins, big data processing, spark.

Abstract

Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold.
We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.

Metrics

PDF views

283

References

F. N. Afrati, A. D. Sarma, D. Menestrina, A. Parameswaran, and J. D. Ullman, “Fuzzy joins

using mapreduce,” in 2012 IEEE 28th International Conference on Data Engineering. IEEE,

, pp. 498–509.

F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar, “Puma: Purdue mapreduce benchmarks

suite,” Electrical and Computer Engineering, Purdue University, Tech. Rep., 2012.

R. Baraglia, G. D. F. Morales, and C. Lucchese, “Document similarity self-join with mapreduce,”

in 2010 IEEE International Conference on data mining. IEEE, 2010, pp. 731–736.

Z. Chen, Y. Wang, V. Narasayya, and S. Chaudhuri, “Customizable and scalable fuzzy join for

big data,” Proceedings of the VLDB Endowment, vol. 12, no. 12, pp. 2106–2117, 2019. DOI: https://doi.org/10.14778/3352063.3352128

J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. DOI: https://doi.org/10.1145/1327452.1327492

R. W. Hamming, “Error detecting and error correcting codes,” The Bell system technical journal,

vol. 29, no. 2, pp. 147–160, 1950. DOI: https://doi.org/10.1002/j.1538-7305.1950.tb00463.x

P. Jaccard, “Distribution de la flore alpine dans le bassin des dranses et dans quelques r´egions

voisines,” Bull Soc Vaudoise Sci Nat, vol. 37, pp. 241–272, 1901.

M. A. Jaro, “Advances in record-linkage methodology as applied to matching the 1985 census of

tampa, florida,” Journal of the American Statistical Association, vol. 84, no. 406, pp. 414–420,

B. Kimmett, V. Srinivasan, and A. Thomo, “Fuzzy joins in mapreduce: an experimental study,”

Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1514–1517, 2015. DOI: https://doi.org/10.14778/2824032.2824049

B. Kimmett, A. Thomo, and V. Srinivasan, “Fuzzy joins in mapreduce: Edit and jaccard distance,” in 2016 7th International Conference on Information, Intelligence, Systems & Applications (IISA). IEEE, 2016, pp. 1–6. DOI: https://doi.org/10.1109/IISA.2016.7785408

V. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in

Soviet physics Doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707–710. DOI: https://doi.org/10.1070/PU1966v008n05ABEH003031

D. Mount, “Cmsc 251: Algorithms1 spring 1998,” University Lectures, 1998.

A. Sgarro, “A fuzzy hamming distance,” Bulletin math´ematique de la Soci´et´e des Sciences

Math´ematiques de la R´epublique Socialiste de Roumanie, pp. 137–144, 1977.

T. Thi-To-Quyen, P. Thuong-Cang, A. Laurent, and L. d’Orazio, “Optimization for large-scale

fuzzy joins using fuzzy filters in mapreduce,” in 2020 IEEE International Conference on Fuzzy

Systems (FUZZ-IEEE). IEEE, 2020, pp. 1–8.

R. Vernica, M. J. Carey, and C. Li, “Efficient parallel set-similarity joins using mapreduce,” in

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010,

pp. 495–506.

W. E. Winkler, “String comparator metrics and enhanced decision rules in the fellegi-sunter

model of record linkage.” ERIC, Institute of Education Sciences of the United States Department

of Education, Tech. Rep., 1990.

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing

with working sets,” in 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud

, 2010.

Similarity algorithms for fuzzy join computation in big data processing environment

Authors

DOI:

Keywords:

Abstract

Metrics

References

Downloads

Published

How to Cite

Issue

Section

License