Automatic main text extraction from web pages

Phan Thi Ha, Ha Hai Nam
Author affiliations

Authors

  • Phan Thi Ha Posts and Telecomunications Institute of Technology
  • Ha Hai Nam Posts and Telecomunications Institute of Technology

DOI:

https://doi.org/10.15625/0866-708X/51/1/9557

Keywords:

HTML, BTE, body text etraction, main content text

Abstract

This paper presents a novel method for extracting body text from web pages used for building text corpus. The algorithm for extracting body text proposed by Aidan Finn [1] is extended with some enhancements in this research. The experimental results on a set of websites show that the proposed method significantly improves the performance of body text extraction without decrease in accuracy compared to the original algorithm.

Downloads

Download data is not yet available.

Published

09-04-2017

How to Cite

[1]
P. Thi Ha and H. Hai Nam, “Automatic main text extraction from web pages”, Vietnam J. Sci. Technol., vol. 51, no. 1, pp. 11–18, Apr. 2017.

Issue

Section

Articles