AN IMPROVED INDEXING METHOD FOR QUERYING BIG XML FILES
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/19018Keywords:
Big data, Indexing, Analysis of XML , Bio-XML files, XML Query processing.Abstract
The exponential growth of bioinformatics in the healthcare domain has revolutionized our understanding of DNA, proteins, and other biomolecular entities. This remarkable progress has generated an overwhelming volume of data, necessitating big data technologies for efficient storage and indexing. While big data technologies like Hadoop offer substantial support for big XML file storage, the challenges of indexing data sizes and XPath query performance persist. To enhance the efficiency of XPath queries and address the data size problem, a novel approach that is derived from the spatial indexing method of the R-tre family. The proposed method is to modify the structure of leaf nodes in the indexing tree to preserve XML-sibling connections. Then, new algorithms for constructing the new tree structure and processing sibling queries better are introduced. Experimental results demonstrate the superior efficiency of sibling XPath queries with reduced data sizes for indexing, while other XPath queries exhibit notable performance improvements. This research contributes to the development of more effective indexing methods for managing and querying large XML datasets in bioinformatics applications, ultimately advancing biomedical research and healthcare initiatives.
Metrics
References
Norah Saleh Alghamdi, Wenny Rahayu and Eric Pardede, “Semantic-based Structural and Content indexing for the efficient retrieval of queries over large XML data repositories”, Journal of Future Generation Computer Systems, vol. 37, pp. 212-231, July. 2014. DOI: https://doi.org/10.1016/j.future.2014.02.010
Guttman, “R-Trees: A dynamic index structure for spatial searching”, Proceedings of SIGMOD (Boston, Massachusetts), vol. 14, issue. 2, pp. 47–57, June. 1984. DOI: https://doi.org/10.1145/971697.602266
Tolani, P.M. and J.R. Haritsa, “XGRIND: A Query-friendly XML Compressor”, IEEE 18th international conference on Data Engineering (IEEE), pp. 225-234, August. 2002.
Min, J.K., M.J. Park and C.W. Chung, “XPRESS: a queriable compression for XML data”, Proceedings of the 2003 ACM SIGMOD international conference on Management of data (ACM, San Diego, California), pp. 122-133, June. 2003. DOI: https://doi.org/10.1145/872757.872775
Cheng, J. and W. Ng, “XQZip: Querying Compressed XML using Structural Indexing”, International Conference on Extending Data Base Technology (EDBT, 2004), pp. 219-236. DOI: https://doi.org/10.1007/978-3-540-24741-8_14
Arion, A., A. Bonifati, I. Manolescu and A. Pugliese, “XQueC: A query-conscious compressed XML database”, ACM Trans. Internet Technol, vol. 7, issue 2, pp. 1-35, May. 2007. DOI: https://doi.org/10.1145/1239971.1239974
Arroyuelo, D., F. Claude, S. Maneth, V. M¨Akinen, G. Navarro, K. Nguyen, J. Sir En and N. V Alim Aki, “Fast In-Memory XPath Search using Compressed Indexes”, Software: Practice and Experience (Wiley), vol45, issue. 3, pp. 399-434, March. 2015. DOI: https://doi.org/10.1002/spe.2227
Qian, B., H. Wang, J. Li, H. Gao, Z. Bao, Y. Gao, Y. Gu, L. Guo, Y. Li, J. Lu, Z. Ren, C. Wang and X. Zhang, “Path-Based XML Stream Compression with XPath Query Support Web-Age”, Information Management (Springer Berlin, Heidelberg), pp. 329-339, 2012. DOI: https://doi.org/10.1007/978-3-642-33050-6_32
P. Diet, “Maintaining order in a linked list”, Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing (ACM), pp. 122–127, 1984.
Q. Li, B. Moon, et al, “Indexing and querying XML data for regular path expressions”, Proceedings of the International Conference on Very Large Data Bases, pp. 361–370, September. 2001.
H. Jiang, H. Lu, W. Wang and B. C. Ooi, “XR-Tree: Indexing XML Data for Efficient Structural Joins”, Proc. the 19th International Conference on Data Engineering (ICDE), pp. 253-263, March. 2003.
Yaokai Feng and Akifumi Makinouchi, “A New Structure for Accelerating XPath Location Steps”, IAENG International Journal of Computer Science, pp. 49-60, 2006. DOI: https://doi.org/10.1007/11775300_5
T. Grust, M. V. Keulen, and J. Teubner, “Accelerating Xpath Evaluation in Any RDBMS”, ACM Transactions on Database Systems, vol.29, no. 1, pp. 91-131, 2005. DOI: https://doi.org/10.1145/974750.974754
Haw S and Lee C, “Data Storage Practices and Query Processing in XML Databases: A Survey”, International Journal of Knowledge-Based Systems, vol. 24, issue 8, pp. 1317-1340, 2011. DOI: https://doi.org/10.1016/j.knosys.2011.06.006
Baxevanis, A.D. and Ouellette, B.F.F., “Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins”, third edition (Wiley), ISBN 0-471-478784, 2005.
B. Salzberg and V.J. Tsotras, “A Comparison of Access Methods for Time-Evolving Data”, ACM Computing Surveys, vol. 31, issue 2, pp. 158-221, 1999. DOI: https://doi.org/10.1145/319806.319816
G. Li, J. Feng, J. Wang and L. Zhou, “Effective keyword search for valuable lcas over XML documents”, Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management (ACM), pp. 31–40, November. 2007. DOI: https://doi.org/10.1145/1321440.1321447
Z. Liu and Y. Cher, “Reasoning and identifying relevant matches: for XML keyword search”, Proc. VLDB Endowment 1 (1), pp. 921–932, August. 2008. DOI: https://doi.org/10.14778/1453856.1453956
Z. Bao, T. Ling, B. Chen and J. Lu, “Effective XML keyword search with relevance oriented ranking”, IEEE 25th International Conference on Data Engineering, (IEEE), pp. 517–528, April. 2009. DOI: https://doi.org/10.1109/ICDE.2009.16
J. Tatemura, “XML stream processing: stack-based algorithms”, L. Changqing, L. Tok (Eds.), Advanced Applications and Structures in XML Processing: Label Streams, Semantics Utilization and Data Query Technologies (IGI Global), pp. 184–226, January. 2010. DOI: https://doi.org/10.4018/978-1-61520-727-5.ch009
S. Chen, H.-G. Li, J. Tatemura, W.-P. Hsiung, D. Agrawal and K.S. Candan, “Twig2stack: bottom-up processing of generalized-tree-pattern queries over XML documents”, Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, VLDB Endowment, pp. 283-294, January. 2006.
J. Lu, T.W. Ling, Z. Bao and C. Wang, “Extended XML tree pattern matching: theories and algorithms”, IEEE Trans. Knowl. Data Eng, vol. 23, no. 3, pp. 402–416, August. 2010. DOI: https://doi.org/10.1109/TKDE.2010.126
N.S. Alghamdi, W. Rahayu and E. Pardede, “Semantic-based construction of content and structure XML index”, The 24th Australasian Database Conference (ADC), ADC’13, Adelaide, vol. 137, pp. 61-70, January. 2013.
N.S. Alghamdi, W. Rahayu and E. Pardede, “Object-based semantic partitioning for XML twig query optimization”, Proceedings of the 2013 IEEE International Conference on Advanced Information Networking and Applications, AINA ’13 (IEEE Computer Society, Barcelona, Spain), pp. 61–70, June. 2013. DOI: https://doi.org/10.1109/AINA.2013.74
S. Haw and C. Lee, “Stack-based pattern matching algorithm for XML query processing”, J. Digit. Inf. Manage, vol 5 (3), pp. 167-175, June. 2007.
Z. Liu and Y. Chen, “Identifying meaningful return information for XML keyword search”, Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (ACM), pp. 329–340, June. 2007. DOI: https://doi.org/10.1145/1247480.1247518
E. Jiao, T. Ling and C.-Y. Chan, “Pathstack: a holistic path join algorithm for path query with not-predicates on XML data”, Database Systems for Advanced Applications (Springer), vol. 3453, pp. 113–124, 2005. DOI: https://doi.org/10.1007/11408079_12
P. A. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger and J. Teubner, “MonetDB/XQuery: a fast XQuery processor powered by a relational engine”, SIGMOD, pp. 479–490, June. 2006. DOI: https://doi.org/10.1145/1142473.1142527
M. Kay, “Ten reasons why Saxon XQuery is fast”, IEEE Data Eng, Bull 31(4), pp. 65–74, January. 2008.
G. Navarro and V. M¨akinen, “Compressed full-text indexes”, ACM Comp. Surv, vol. 39, issue. 1 2-es, April. 2007. DOI: https://doi.org/10.1145/1216370.1216372
H.-L. Chan, W.-K. Hon, T.-W. Lam and K. Sadakane, “Compressed indexes for dynamic text collections”, ACM TALG, vol. 3, issue. 2, pp. 21-es, May. 2007. DOI: https://doi.org/10.1145/1240233.1240244
V. M¨akinen and G. Navarro, “Dynamic entropy-compressed sequences and full-text indexes”, ACM TALG, vol.4, issue. 3, pp. 306-317, July. 2008. DOI: https://doi.org/10.1145/1367064.1367072
K. Sadakane and G. Navarro, “Fully-functional static and dynamic succinct trees”, ACM Transactions on Algorithms, vol. 10, no. 16, pp. 1-39, May. 2014. DOI: https://doi.org/10.1145/2601073
P. Ferragina, G. Manzini, V. M¨akinen and G. Navarro, “Compressed representations of sequences and full-text indexes”, ACM Transactions on Algorithms, vol. 3, issue. 2, pp. 20-es, May 2007. DOI: https://doi.org/10.1145/1240233.1240243
T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong and S. M. Yiu, “Compressed indexing and local alignment of DNA”, Bioinformatics, vol. 24, issue. 6, pp. 791–797, January. 2008. DOI: https://doi.org/10.1093/bioinformatics/btn032
B. Langmead, C. Trapnell, M. Pop and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short dna sequences to the human genome”, Genome Biology, vol. 10, April. 2009. DOI: https://doi.org/10.1186/gb-2009-10-3-r25
H. Li and R. Durbin, “Fast and accurate short read alignment with burrows-wheeler transform”, Bioinformatics, vol. 25, issue. 14, pp. 1754-1760, July. 2009. DOI: https://doi.org/10.1093/bioinformatics/btp324
J. Sir´en, “Compressed suffix arrays for massive data”, SPIDE, pp. 63–74, August. 2009. DOI: https://doi.org/10.1007/978-3-642-03784-9_7
H. Bj ¨orklund, W. Gelade, M. Marquardt and W. Martens, “Incremental XPath evaluation”, ICDT, pp. 162–173, October. 2009. DOI: https://doi.org/10.1145/1514894.1514915
S. Bog, et al., “XQuery 1.0: An XML Query Language (Second Edition)”, W3C Recommendation, 2010.
T. Bray, et al., “Extensible Markup Language (XML) 1.0 (Fifth Edition)”, W3C Recommendation, 2008.
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.