SIMILARITY ALGORITHMS FOR FUZZY JOIN COMPUTATION IN BIG DATA PROCESSING ENVIRONMENT
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/17589Keywords:
Fuzzy joins, similarity algorithms, set-similarity joins, big data processing, SparkAbstract
Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold.
We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.
Metrics
References
F. N. Afrati, A. D. Sarma, D. Menestrina, A. Parameswaran, and J. D. Ullman, “Fuzzy joins
using mapreduce,” in 2012 IEEE 28th International Conference on Data Engineering. IEEE,
, pp. 498–509.
F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar, “Puma: Purdue mapreduce benchmarks
suite,” Electrical and Computer Engineering, Purdue University, Tech. Rep., 2012.
R. Baraglia, G. D. F. Morales, and C. Lucchese, “Document similarity self-join with mapreduce,”
in 2010 IEEE International Conference on data mining. IEEE, 2010, pp. 731–736.
Z. Chen, Y. Wang, V. Narasayya, and S. Chaudhuri, “Customizable and scalable fuzzy join for
big data,” Proceedings of the VLDB Endowment, vol. 12, no. 12, pp. 2106–2117, 2019. DOI: https://doi.org/10.14778/3352063.3352128
J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. DOI: https://doi.org/10.1145/1327452.1327492
R. W. Hamming, “Error detecting and error correcting codes,” The Bell system technical journal,
vol. 29, no. 2, pp. 147–160, 1950. DOI: https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
P. Jaccard, “Distribution de la flore alpine dans le bassin des dranses et dans quelques r´egions
voisines,” Bull Soc Vaudoise Sci Nat, vol. 37, pp. 241–272, 1901.
M. A. Jaro, “Advances in record-linkage methodology as applied to matching the 1985 census of
tampa, florida,” Journal of the American Statistical Association, vol. 84, no. 406, pp. 414–420,
B. Kimmett, V. Srinivasan, and A. Thomo, “Fuzzy joins in mapreduce: an experimental study,”
Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1514–1517, 2015. DOI: https://doi.org/10.14778/2824032.2824049
B. Kimmett, A. Thomo, and V. Srinivasan, “Fuzzy joins in mapreduce: Edit and jaccard distance,” in 2016 7th International Conference on Information, Intelligence, Systems & Applications (IISA). IEEE, 2016, pp. 1–6. DOI: https://doi.org/10.1109/IISA.2016.7785408
V. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in
Soviet physics Doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707–710.
D. Mount, “Cmsc 251: Algorithms1 spring 1998,” University Lectures, 1998.
A. Sgarro, “A fuzzy hamming distance,” Bulletin math´ematique de la Soci´et´e des Sciences
Math´ematiques de la R´epublique Socialiste de Roumanie, pp. 137–144, 1977.
T. Thi-To-Quyen, P. Thuong-Cang, A. Laurent, and L. d’Orazio, “Optimization for large-scale
fuzzy joins using fuzzy filters in mapreduce,” in 2020 IEEE International Conference on Fuzzy
Systems (FUZZ-IEEE). IEEE, 2020, pp. 1–8.
R. Vernica, M. J. Carey, and C. Li, “Efficient parallel set-similarity joins using mapreduce,” in
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010,
pp. 495–506.
W. E. Winkler, “String comparator metrics and enhanced decision rules in the fellegi-sunter
model of record linkage.” ERIC, Institute of Education Sciences of the United States Department
of Education, Tech. Rep., 1990.
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing
with working sets,” in 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud
, 2010.
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.