Automatic main text extraction from web pages

Phan Thi Ha, Ha Hai Nam
Author affiliations

Authors

  • Phan Thi Ha Posts and Telecomunications Institute of Technology
  • Ha Hai Nam Posts and Telecomunications Institute of Technology

DOI:

https://doi.org/10.15625/0866-708X/51/1/9557

Keywords:

HTML, BTE, body text etraction, main content text

Abstract

This paper presents a novel method for extracting body text from web pages used for building text corpus. The algorithm for extracting body text proposed by Aidan Finn [1] is extended with some enhancements in this research. The experimental results on a set of websites show that the proposed method significantly improves the performance of body text extraction without decrease in accuracy compared to the original algorithm.

Downloads

Published

09-04-2017

How to Cite

[1]
P. Thi Ha and H. Hai Nam, “Automatic main text extraction from web pages”, Vietnam J. Sci. Technol., vol. 51, no. 1, pp. 11–18, Apr. 2017.

Issue

Section

Articles