Open Access Open Access  Restricted Access Subscription Access

A HYBRID PARAGRAPH-LEVEL PAGE SEGMENTATION

Ha Dai Ton, Nguyen Duc Dung

Abstract


Automatic transformation of paper documents into electronic forms requires geometrydocument layout analysis at the rst stage. However, variations in character font sizes, text-linespacing, and layout structures have made it dicult to design a general-purpose method. Page seg-mentation algorithms usually segment text blocks using global separation objects, or local relationsamong connected components such as distance and orientation, but typically do not consider infor-mation other than local component's size. As a result, they cannot separate blocks that are veryclose to each other, including text of dierent font sizes and paragraphs in the same column. Toovercome this limitation, we proposed to use both separation objects at the whole page level andcontext analysis at text-line level to segment document images into paragraphs. The introduced hy-brid paragraph-level page segmentation (HP2S) algorithm can handle dicult cases where the purelytop-down and bottom-up approaches are not sucient to separate. Experimental results on the testset ICDAR2009 competition and UW-III dataset shown that our algorithm boost the performancesignicantly comparing to the state of the art algorithms.

Keywords


Page segmentation text-lines homogenous regions separation objects paragraphs evaluation result.

References


Breuel, T.M.: Two geometric algorithms for layout analysis. In Document Analysis Systems, Princeton, NY, pp. 188-199, Aug 2002.

Chen, M., Ding, X.Q.: Unied HMM-based Layout Analysis Framework and Algorithm, SCI CHINA Ser F, 46(6), Dec. 2003, pp 401-408.

Kise, K., Sato, A. and Iwata, M.: Segmentation of page images using the area Voronoi diagram. Computer Vision and Image Understanding, vol. 70, no. 3, pp. 370-382, June 1998.

Mao, S. and Kanungo, T.: Software architecture of PSET: a page segmentation evaluation toolkit. International Journal on Document Analysis and Recognition, vol. 4, no. 3, pp. 205-217, July 2001.

O'Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 15, no. 11, pp. 1162-1173, Nov. 1993.

Pavlidis, T. and Zhou, J.: Page Segmentation and Classication, CVGIP: Graphical Models and Image Processing, 54(6), November 1992, pp 484-496.

Phillips, I.: User's Reference Manual,CD-ROM, UW-III Document Image Database-III, July 1996.

Wong, K.Y., Casey, R.G. and Wahl, F.M.: Document analysis system. IBM Journal of Research and Development, vol. 26, no. 6, pp. 647-656, 1982.

Wahl, F.M., Wong, K.Y. and Casey, R.G.: Block segmentation and text extraction in mixed text/image documents, Computer Graphics and Image Processing, 20, 1982, pp 375-390.

Antonacopoulos, A., Pletschacher, S., Bridson, D. and Papadopoulos, C.: ICDAR 2009: Pagesegmentation competition. in Proc. 10th Intl. Conf. on Document Analysis and Recognition, University of Salford, Manchester, United Kingdom, July 2009.

Antonacopoulos, A., Bridson, D., Papadopoulos, C. and Pletschacher, S.: A Realistic Dataset for Performance Evaluation of Document Layout Analysis. in Proc. 10th Intl. Conf. on Document Analysis and Recognition, University of Salford, Manchester, United Kingdom, July 2009.

Antonacopoulos, A., Clausner, C., Papadopoulos, C., and Pletschacher, S.: ICDAR2013 Competition on Historical Newspaper Layout Analysis, Proc. 13th ICDAR, pp 1454-1458, 2013.

G. Louloudis, B. Gatos, I. Pratikakis, K. Halatis, A block-based Hough transform mapping for text line detection in handwritten document, in: Proceeding of the 10th International Workshop on Frontiers in Handwriting Recognition, 2006, pp. 515-520.

Smith, R.: Hybrid page layout analysis via tab-stop detection. In Proc. Int. Conf. on Document Analysis and Recognition, pages 241- 245, Barcelona, Spain, July 2009.

Chowdhury, S.P., Mandal, S., Das, A.K. and Chanda, B.: Segmentation of Text and Graphics from Document Images, Proc. of the 9th Int. Conf. on Document Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp 619-623.


Full Text: PDF

Refbacks

  • There are currently no refbacks.


Journal of Computer Science and Cybernetics ISSN: 1813-9663

Published by Vietnam Academy of Science and Technology