An improved indexing method for querying big XML files

Dinh Duc Luong; Vuong Quang Phuong; Hoang Do Thanh Tung

doi:10.15625/1813-9663/19018

Author affiliations

Authors

Dinh Duc Luong Food Industrial College, 426 Nguyen Tat Thanh Street, Tan Dan Ward, Viet Tri City, Phu Tho Province, Viet Nam
Vuong Quang Phuong Institute of Information Technology, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet Street, Cau Giay District, Ha Noi, Viet Nam
Hoang Do Thanh Tung Institute of Information Technology, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet Street, Cau Giay District, Ha Noi, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/19018

Keywords:

Big data, indexing, analysis of XML, bio-XML files, XML query processing.

Abstract

The exponential growth of bioinformatics in the healthcare domain has revolutionized our understanding of DNA, proteins, and other biomolecular entities. This remarkable progress has generated an overwhelming volume of data, necessitating big data technologies for efficient storage and indexing. While big data technologies like Hadoop offer substantial support for big XML file storage, the challenges of indexing data sizes and XPath query performance persist. To enhance the efficiency of XPath queries and address the data size problem, a novel approach that is derived from the spatial indexing method of the R-tre family. The proposed method is to modify the structure of leaf nodes in the indexing tree to preserve XML-sibling connections. Then, new algorithms for constructing the new tree structure and processing sibling queries better are introduced. Experimental results demonstrate the superior efficiency of sibling XPath queries with reduced data sizes for indexing, while other XPath queries exhibit notable performance improvements. This research contributes to the development of more effective indexing methods for managing and querying large XML datasets in bioinformatics applications, ultimately advancing biomedical research and healthcare initiatives.

References

Norah Saleh Alghamdi, Wenny Rahayu and Eric Pardede, “Semantic-based Structural and Content indexing for the efficient retrieval of queries over large XML data repositories”, Journal of Future Generation Computer Systems, vol. 37, pp. 212-231, July. 2014.

Guttman, “R-Trees: A dynamic index structure for spatial searching”, Proceedings of SIGMOD (Boston, Massachusetts), vol. 14, issue. 2, pp. 47–57, June. 1984.

Tolani, P.M. and J.R. Haritsa, “XGRIND: A Query-friendly XML Compressor”, IEEE 18th international conference on Data Engineering (IEEE), pp. 225-234, August. 2002.

Min, J.K., M.J. Park and C.W. Chung, “XPRESS: a queriable compression for XML data”, Proceedings of the 2003 ACM SIGMOD international conference on Management of data (ACM, San Diego, California), pp. 122-133, June. 2003.

Cheng, J. and W. Ng, “XQZip: Querying Compressed XML using Structural Indexing”, International Conference on Extending Data Base Technology (EDBT, 2004), pp. 219-236.

Arion, A., A. Bonifati, I. Manolescu and A. Pugliese, “XQueC: A query-conscious compressed XML database”, ACM Trans. Internet Technol, vol. 7, issue 2, pp. 1-35, May. 2007.

Arroyuelo, D., F. Claude, S. Maneth, V. M¨Akinen, G. Navarro, K. Nguyen, J. Sir En and N. V Alim Aki, “Fast In-Memory XPath Search using Compressed Indexes”, Software: Practice and Experience (Wiley), vol45, issue. 3, pp. 399-434, March. 2015.

Qian, B., H. Wang, J. Li, H. Gao, Z. Bao, Y. Gao, Y. Gu, L. Guo, Y. Li, J. Lu, Z. Ren, C. Wang and X. Zhang, “Path-Based XML Stream Compression with XPath Query Support Web-Age”, Information Management (Springer Berlin, Heidelberg), pp. 329-339, 2012.

P. Diet, “Maintaining order in a linked list”, Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing (ACM), pp. 122–127, 1984.

Q. Li, B. Moon, et al, “Indexing and querying XML data for regular path expressions”, Proceedings of the International Conference on Very Large Data Bases, pp. 361–370, September. 2001.

H. Jiang, H. Lu, W. Wang and B. C. Ooi, “XR-Tree: Indexing XML Data for Efficient Structural Joins”, Proc. the 19th International Conference on Data Engineering (ICDE), pp. 253-263, March. 2003.

Yaokai Feng and Akifumi Makinouchi, “A New Structure for Accelerating XPath Location Steps”, IAENG International Journal of Computer Science, pp. 49-60, 2006.

T. Grust, M. V. Keulen, and J. Teubner, “Accelerating Xpath Evaluation in Any RDBMS”, ACM Transactions on Database Systems, vol.29, no. 1, pp. 91-131, 2005.

Haw S and Lee C, “Data Storage Practices and Query Processing in XML Databases: A Survey”, International Journal of Knowledge-Based Systems, vol. 24, issue 8, pp. 1317-1340, 2011.

Baxevanis, A.D. and Ouellette, B.F.F., “Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins”, third edition (Wiley), ISBN 0-471-478784, 2005.

B. Salzberg and V.J. Tsotras, “A Comparison of Access Methods for Time-Evolving Data”, ACM Computing Surveys, vol. 31, issue 2, pp. 158-221, 1999.

G. Li, J. Feng, J. Wang and L. Zhou, “Effective keyword search for valuable lcas over XML documents”, Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management (ACM), pp. 31–40, November. 2007.

Z. Liu and Y. Cher, “Reasoning and identifying relevant matches: for XML keyword search”, Proc. VLDB Endowment 1 (1), pp. 921–932, August. 2008.

Z. Bao, T. Ling, B. Chen and J. Lu, “Effective XML keyword search with relevance oriented ranking”, IEEE 25th International Conference on Data Engineering, (IEEE), pp. 517–528, April. 2009.

J. Tatemura, “XML stream processing: stack-based algorithms”, L. Changqing, L. Tok (Eds.), Advanced Applications and Structures in XML Processing: Label Streams, Semantics Utilization and Data Query Technologies (IGI Global), pp. 184–226, January. 2010.

S. Chen, H.-G. Li, J. Tatemura, W.-P. Hsiung, D. Agrawal and K.S. Candan, “Twig2stack: bottom-up processing of generalized-tree-pattern queries over XML documents”, Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, VLDB Endowment, pp. 283-294, January. 2006.

J. Lu, T.W. Ling, Z. Bao and C. Wang, “Extended XML tree pattern matching: theories and algorithms”, IEEE Trans. Knowl. Data Eng, vol. 23, no. 3, pp. 402–416, August. 2010.

N.S. Alghamdi, W. Rahayu and E. Pardede, “Semantic-based construction of content and structure XML index”, The 24th Australasian Database Conference (ADC), ADC’13, Adelaide, vol. 137, pp. 61-70, January. 2013.

N.S. Alghamdi, W. Rahayu and E. Pardede, “Object-based semantic partitioning for XML twig query optimization”, Proceedings of the 2013 IEEE International Conference on Advanced Information Networking and Applications, AINA ’13 (IEEE Computer Society, Barcelona, Spain), pp. 61–70, June. 2013.

S. Haw and C. Lee, “Stack-based pattern matching algorithm for XML query processing”, J. Digit. Inf. Manage, vol 5 (3), pp. 167-175, June. 2007.

Z. Liu and Y. Chen, “Identifying meaningful return information for XML keyword search”, Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (ACM), pp. 329–340, June. 2007.

E. Jiao, T. Ling and C.-Y. Chan, “Pathstack: a holistic path join algorithm for path query with not-predicates on XML data”, Database Systems for Advanced Applications (Springer), vol. 3453, pp. 113–124, 2005.

P. A. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger and J. Teubner, “MonetDB/XQuery: a fast XQuery processor powered by a relational engine”, SIGMOD, pp. 479–490, June. 2006.

M. Kay, “Ten reasons why Saxon XQuery is fast”, IEEE Data Eng, Bull 31(4), pp. 65–74, January. 2008.

G. Navarro and V. M¨akinen, “Compressed full-text indexes”, ACM Comp. Surv, vol. 39, issue. 1 2-es, April. 2007.

H.-L. Chan, W.-K. Hon, T.-W. Lam and K. Sadakane, “Compressed indexes for dynamic text collections”, ACM TALG, vol. 3, issue. 2, pp. 21-es, May. 2007.

V. M¨akinen and G. Navarro, “Dynamic entropy-compressed sequences and full-text indexes”, ACM TALG, vol.4, issue. 3, pp. 306-317, July. 2008.

K. Sadakane and G. Navarro, “Fully-functional static and dynamic succinct trees”, ACM Transactions on Algorithms, vol. 10, no. 16, pp. 1-39, May. 2014.

P. Ferragina, G. Manzini, V. M¨akinen and G. Navarro, “Compressed representations of sequences and full-text indexes”, ACM Transactions on Algorithms, vol. 3, issue. 2, pp. 20-es, May 2007.

T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong and S. M. Yiu, “Compressed indexing and local alignment of DNA”, Bioinformatics, vol. 24, issue. 6, pp. 791–797, January. 2008.

B. Langmead, C. Trapnell, M. Pop and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short dna sequences to the human genome”, Genome Biology, vol. 10, April. 2009.

H. Li and R. Durbin, “Fast and accurate short read alignment with burrows-wheeler transform”, Bioinformatics, vol. 25, issue. 14, pp. 1754-1760, July. 2009.

J. Sir´en, “Compressed suffix arrays for massive data”, SPIDE, pp. 63–74, August. 2009.

H. Bj ¨orklund, W. Gelade, M. Marquardt and W. Martens, “Incremental XPath evaluation”, ICDT, pp. 162–173, October. 2009.

S. Bog, et al., “XQuery 1.0: An XML Query Language (Second Edition)”, W3C Recommendation, 2010.

T. Bray, et al., “Extensible Markup Language (XML) 1.0 (Fifth Edition)”, W3C Recommendation, 2008.