Automatic main text extraction from web pages

Phan Thi Ha; Ha Hai Nam

doi:10.15625/0866-708X/51/1/9557

Automatic main text extraction from web pages

Phan Thi Ha, Ha Hai Nam

Author affiliations

Authors

Phan Thi Ha Posts and Telecomunications Institute of Technology
Ha Hai Nam Posts and Telecomunications Institute of Technology

DOI:

https://doi.org/10.15625/0866-708X/51/1/9557

Keywords:

HTML, BTE, body text etraction, main content text

Abstract

This paper presents a novel method for extracting body text from web pages used for building text corpus. The algorithm for extracting body text proposed by Aidan Finn [1] is extended with some enhancements in this research. The experimental results on a set of websites show that the proposed method significantly improves the performance of body text extraction without decrease in accuracy compared to the original algorithm.

Downloads

Download data is not yet available.

Downloads

Published

09-04-2017

How to Cite

[1]P. Thi Ha and H. Hai Nam, “Automatic main text extraction from web pages”, Vietnam J. Sci. Technol., vol. 51, no. 1, pp. 11–18, Apr. 2017.

IEEE

Download Citation

Issue

Vol. 51 No. 1 (2013)

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Vietnam Journal of Sciences and Technology (VJST) is an open access and peer-reviewed journal. All academic publications could be made free to read and downloaded for everyone. In addition, articles are published under term of the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) Licence which permits use, distribution and reproduction in any medium, provided the original work is properly cited & ShareAlike terms followed.

Copyright on any research article published in VJST is retained by the respective author(s), without restrictions. Authors grant VAST Journals System a license to publish the article and identify itself as the original publisher. Upon author(s) by giving permission to VJST either via VJST journal portal or other channel to publish their research work in VJST agrees to all the terms and conditions of https://creativecommons.org/licenses/by-sa/4.0/ License and terms & condition set by VJST.

Authors have the responsibility of to secure all necessary copyright permissions for the use of 3rd-party materials in their manuscript.