Ha Dai Ton, Nguyen Duc Dung


Automatic transformation of paper documents into electronic forms requires geometrydocument layout analysis at the rst stage. However, variations in character font sizes, text-linespacing, and layout structures have made it dicult to design a general-purpose method. Page seg-mentation algorithms usually segment text blocks using global separation objects, or local relationsamong connected components such as distance and orientation, but typically do not consider infor-mation other than local component's size. As a result, they cannot separate blocks that are veryclose to each other, including text of dierent font sizes and paragraphs in the same column. Toovercome this limitation, we proposed to use both separation objects at the whole page level andcontext analysis at text-line level to segment document images into paragraphs. The introduced hy-brid paragraph-level page segmentation (HP2S) algorithm can handle dicult cases where the purelytop-down and bottom-up approaches are not sucient to separate. Experimental results on the testset ICDAR2009 competition and UW-III dataset shown that our algorithm boost the performancesignicantly comparing to the state of the art algorithms.


Page segmentation text-lines homogenous regions separation objects paragraphs evaluation result.


