RECURSIVE JOIN PROCESSING IN BIG DATA ENVIRONMENT

Authors

  • Anh-Cang Phan
  • Thanh-Ngoan Trieu
  • Thuong-Cang Phan

DOI:

https://doi.org/10.15625/1813-9663/37/2/15889

Keywords:

Apache spark, big data, recursive join, optimize three-way join

Abstract

In the era of information explosion, Big data is receiving increased attention as having important implications for growth, profitability, and survival of modern organizations. However, it also offers many challenges in the way data is processed and queried over time. A join operation is one of the most common operations appearing in many data queries. Specially, a recursive join is a join type used to query hierarchical data but it is more extremely complex and costly. The evaluation of the recursive join in MapReduce includes some iterations of two tasks of a join task and an incremental computation task. Those tasks are significantly expensive and reduce the performance of queries in large datasets because they generate plenty of intermediate data transmitting over the network. In this study, we thus propose a simple but efficient approach for Big recursive joins based on reducing by half the number of the required iterations in the Spark environment. This improvement leads to significantly reducing the number of the required tasks as well as the amount of the intermediate data generated and transferred over the network. Our experimental results show that an improved recursive join is more efficient and faster than a traditional one on large-scale datasets.

Downloads

Published

2021-05-31

Issue

Section

Computer Science