Saturday, October 6, 2007

Research and Development of OCR Systems on Bangla Script

Research Papers

SL

Paper Citation

Year

1.

A. K. Roy and B. Chatterjee, "Design of a Nearest Neighbor Classifier for Bengali Character Recognition", J. IETE, vol. 30, 1984.

1984

2.

U. Pal and B. B. Chaudhuri, "OCR In Bangla: An Indo-Bangladeshi Language", Proc. of 12th Int. Conf. on Pattern Recognition, IEEE Computer Society Press, pp. 269-274, 1994.

1994

3.

B. B. Chaudhuri and U. Pal, "Computer Recognition of printed Bangla Script", Int. Journal of Systems Science, vol. 26, pp. 2107-2123, 1995.

1995

4.

B. B. Chaudhuri and U. Pal, "OCR Error Detection and correction of an Inflectional Indian Language Script", Proceedings of ICPR, 1996.

1996

5.

B.B. Chaudhuri and U. Pal, "Automatic Separation of Words in Multi-Lingual Multi-Script Indian Documents", IEEE Trans, 1997.

1997

6.

U. Pal, Ph.D. Thesis, "On The Development of An Optical Character Recognition (OCR) System For Printed Bangla Script", Indian Statistical Institute, 1997.

1997

7.

U. Pal and B. B. Chaudhuri, "Printed Devnagari Script OCR System", Vivek, vol.10, pp.12-24, 1997.

1997

8.

B. B. Chaudhuri and U. Pal, "An OCR System To Read Two Indian Language Scripts: Bangla And Devnagari (Hindi)", Proc. Fourth Int. conf. on Document Analysis and Recognition, IEEE Computer Society Press, pp. 1011-1016, 1997.

1997

9.

B.B. Chaudhuri and U. Pal, "Skew Angle Detection Of Digitized Indian Script Documents", IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19, pp.182-186, 1997.

1997

10.

B. B. Chaudhuri and U. Pal, "A Complete Printed Bangla OCR System", Pattern Recognition, vol. 31, pp. 531-549, 1998.

1998

11.

B.B.Chaudhuri and U.Pal, “A complete Bangla OCR System”, Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, 1998.

1998

12.

U. Pal and B. B. Chaudhuri, "Script Line Separation From Indian Multi-Script Documents", Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR '99, Bangalore, India.

1999

13.

U. Pal and B. B. Chaudhuri, "Automatic Separation of Machine-Printed and Hand-Written Text Lines", Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR '99, Bangalore, India.

1999

14.

U. Pal and B. B. Chaudhuri, "Automatic Identification of English, Chinese, Arabic, Devnagari And Bangla Script Line", In Proc. Sixth Int. Conf. on Document Analysis and Recognition, IEEE Computer Society Press, pp.790-794, 2001.

2001

15.

U. Pal, M. Mitra and B. B. Chaudhuri, "Multi-Skew Detection of Indian Script Documents", IEEE trans, 2001.

2001

16.

Veena Bansal and R.M.K. Sinha, A Devanagari OCR and A Brief Overview of OCR Research for Indian Scripts in Proceedings of STRANS01, held at IIT Kanpur, 2001.

2001

17.

A.Mandal and Prof. B. B. Chaudhuri, "Page Layout Analyzer For Multilingual Indian Documents", will be published in the Proceedings of the Language Engineering Conference 2002 by IEEE CS Press.

2002

18.

Ahmed Asif Chowdhury, Ejaj Ahmed, Shameem Ahmed, Shohrab Hossain and Chowdhury Mofizur Rahman"Optical Character Recognition of Bangla Characters using neural network: A better approach". 2nd International Conference on Electrical Engineering (ICEE 2002), Khulna, Bangladesh.

2002

19.

Utpal Garain And Bidyut B. Chaudhuri, "Segmentation Of Touching Characters In Printed Devnagari And Bangla Scripts Using Fuzzy Multifactorial Analysis", IEEE Transactions On Systems, Man, And Cybernetics—Part C: Applications And Reviews, Vol. 32, No. 4, November 2002

2002

20.

Jalal Uddin Mahmud, Mohammed Feroz Raihan and Chowdhury Mofizur Rahman, "A Complete OCR System for Continuous Bangla Characters", IEEE TENCON-2003: Proceedings of the Conference on Convergent Technologies for the Asia Pacific, 2003.

2003

21.

A. M. Shoeb Shatil and Mumit Khan, “Minimally Segmenting High Performance Bangla OCR using Kohonen Network”, Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006.

2006

22.

S. M. Murtoza Habib, Nawsher Ahmed Noor and Mumit Khan, Skew correction of Bangla script using Radon Transform, Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006.

2006

23.

Md. Abul Hasnat, S. M. Murtoza Habib, and Mumit Khan, Segmentation free Bangla OCR using HMM: Training and Recognition, Proc. of 1st International Conference on Digital Communications and Computer Applications (DCCA2007), Irbid, Jordan, 2007.

2007

Implementations

Name

Page Link

BOCRA [2006]

http://bocra.sourceforge.net/doc/

Apona-pathak [2006]

http://www.apona-bd.com/apona-pathak/bangla-ocr-apona-pathak.html

BanglaOCR [2007]

http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=215908

Sunday, September 9, 2007

:-- Optical Character Recognition --:


Name:: BanglaOCR

Summary::

This projects aims to develop an Optical Character Recognizer that can recognize Bangla Language Scripts. The entire OCR research and development task is mainly divided into five parts: Preprocessing, Feature Extraction, Training, Recognition and Post-processing. We performed experiment with several techniques for each individual parts and choose the appropriate methods in our implementation. We used Hidden Markov Model (HMM) technique for pattern training and classification. Hidden Markov Model Toolkit (HTK) is used to implement the Training and Classification Task.


Details::

BanglaOCR is the Optical Character Recognizer for Bangla Script. It takes scanned images of a printed page or document as input and converts them into editable Unicode text. BanglaOCR allows users to train the data set from any document and observe the recognition performance.

BanglaOCR deals will several independent parts as listed below:

  • Preprocessing
  • Feature Extraction
  • Training
  • Recognition
  • Post-processing


The Preprocessing task involves image acquisition, binarization, noise elimination, skew correction, line and word separation and character segmentation. In this step we put our effort up to minimal segmentation of characters.

The Feature Extraction task involves the extraction of meaningful characteristics of a minimally segmented character. First we divide the character image into several frames using a certain frame length. Then we performed Discrete Cosine Transform (DCT) calculation over each frames. We consider the number of frames and DCT calculated values of each frame as the features for each character.

Training is performed over the calculated features of each minimally segmented character image. We created separate HMM model for each segmented character image. The model creating involves dynamically choosing a prototype HMM model and creates a model using the prototype HMM and the extracted feature data. The model creation task is automatically performed by invoking HInit tool of the HTK toolkit.

The Recognition process invokes the recognition tool HVite of the HTK toolkit. HVite uses feature file of the segmented character image where the features are written in a specified format, the word network that describes the allowable word sequence build up from task grammar, the dictionary that define each character or word, the entire list of HMMs and the .mmf file where the description of each HMM model is written. All these files are constructed according to HTK Toolkit understandable format. After the recognition process is completed the model name is read from the Master Label File (.mmf) and the associated Unicode character for the recognized model is written to the output file.

The final task that BanglaOCR perform is post-processing. This involves a suggestion based spell checker that is capable to identify the erroneous words and produce suggestions. This take the recognizer’s output file as input and produce output file that marks the erroneous words and provides up to a certain number of suggestions.

The project goal of BanglaOCR is to develop a market place standard multilingual OCR system that will be capable to perform the digitization of a wide domain of Bangla Document images. This will help to archive the documents from all spheres and prevent the damage and lost of valuable documents and books.


Team::

Status::

  • First version of open source BanglaOCR is released under GNU Public License (GPL) version 2 or later.
  • Research work on different parts of BanglaOCR is continuing to enhance the performance and usability.


Research Scope::

  • Research on Feature Extraction of Bengali Characters.
  • Research on proper Segmentation of Bengali Characters from any type of document image.
  • Research on the preprocessing of historical Bangla Document Image.
  • Research on Bangla Handwritten Image.
  • Research on the Training and Recognition using different techniques.
  • Research on Multi Lingual OCR.

Development Scope::

  • Implement the existing developed versions using different language.

Timeline:: 2007 – 2009.

Character Recognition


Character recognition techniques associate a symbolic identity with the image of character. Character recognition is commonly referred to as optical character recognition (OCR), as it deals with the recognition of optically processed characters. The modern version of OCR appeared in the middle of the 1940’s with the development of the digital computers. OCR machines have been commercially available since the middle of the 1950’s. Today OCR-systems are available both as hardware devices and software packages, and a few thousand systems are sold every week.

In a typical OCR systems input characters are digitized by an optical scanner. Each character is then located and segmented, and the resulting character image is fed into a preprocessor for noise reduction and normalization. Certain characteristics are the extracted from the character for classification. The feature extraction is critical and many different techniques exist, each having its strengths and weaknesses. After classification the identified characters are grouped to reconstruct the original symbol strings, and context may then be applied to detect and correct errors.

Optical character recognition has many different practical applications. The main areas where OCR has been of importance are text entry (office automation), data entry (banking environment) and process automation (mail sorting).

The present state of the art in OCR has moved from primitive schemes for limited character sets, to the application of more sophisticated techniques for omnifont and handprint recognition. The main problems in OCR usually lie in the segmentation of degraded symbols which are joined or fragmented. Generally, the accuracy of an OCR system is directly dependent upon the quality of the input document. Three figures are used in ratings of OCR systems; correct classification rate, rejection rate and error rate. The performance should be rated from the systems error rate, as these errors go by undetected by the system and must be manually located for correction.

In spite of the great number of algorithms that have been developed for character recognition, the problem is not yet solved satisfactory, especially not in the cases when there are no strict limitations on the handwriting or quality of print. Up to now, no recognition algorithm may compete with man in quality. However, as the OCR machine is able to read much faster, it is still attractive.

In the future the area of recognition of constrained print is expected to decrease. Emphasis will then be on the recognition of unconstrained writing, like omnifont and handwriting. This is a challenge which requires improved recognition techniques. The potential for OCR algorithms seems to lie in the combination of different methods and the use of techniques that are able to utilize context to a much larger extent than current methodologies. May be exchanged electronically or printed in a more computer readable form, for instance barcodes.

The applications for future OCR-systems lie in the recognition of documents where control over the production process is impossible. This may be material where the recipient is cut off from an electronic version and has no control of the production process or older material which at production time could not be generated electronically. This means that future OCR-systems intended for reading printed text must be omnifont.

Another important area for OCR is the recognition of manually produced documents.

Within postal applications for instance, OCR must focus on reading of addresses on mail produced by people without access to computer technology. Already, it is not unusual for companies etc., with access to computer technology to mark mail with barcodes. The relative importance of handwritten text recognition is therefore expected to increase.

[Source]: Line Eikvil, "Optical Character Recognition", available at: “citeseer.ist.psu.edu/142042.html".

Tuesday, July 31, 2007

Related Works: Page Layout analysis for Bangla Document Image


A. Ray Chaudhuri, A. K. Mandal and B. B. Chaudhuri at [1] present a page layout analyzer for multilingual Indian documents more specifically for Brahmi script based documents. The headline (matra for Bangla, sirorekha for Hindi) is identified as the main feature of these documents. They applied is a bottom up approach for the page layout analysis. Here the text regions are extracted as rectangular blocks containing moderate size connected components satisfying homogeneity or compatibility in terms of headline and other (textual) features particularly available in the considered scripts. Degree of homogeneity is calculated from the compatibility criterion, in terms of selected features. Textual regions or blocks are generated from the optimal bounding boxes (BBs) of all connected components.

In their work first the likelihood of the headline in a word and vertical line in a word is calculated from the statistical analysis on a large scale of documents. The primary features are extracted from the BBs viz. corners, headline (position and its thickness), the vertical line in the middle zone of UMD, lengthwise distribution of UMD and average number of object pixels per BB, i.e. density of the components. From these primary features some secondary features are calculated as follows: average pixel density of each component, average vertical span of M within a block, average horizontal gap among closest pairs of BBs, inter-headline distances between adjacent vertical BBs, inter-block distance (both horizontal and vertical) and the ratio of number of components between adjacent blocks. The Compatibility criterion for any two scalar quantities L1 and L2 is defined as:

Comp (L1, L2) = ½(|1 – L1/L2| + |1 – L2/L1|)

The entire process is completed in three modules. The task in the first module is the block construction using the primary features. The constructed blocks and the remaining BBs are labeled into seven categories (C1 – C7). The second module is applicable for the blocks where one block (Fin) is completely or partially resides into the other (Fout). The third module is for block merging which is actually applicable only for those blocks that are not yet considered as text (but they are real text in the title).

S. Khedekar, V. Ramanaprasad, S. Setlur, V. Govindaraju at [2] present a top-down, projection-profile based algorithm to separate text blocks from image blocks in a Devanagari document. They used a distinctive feature called Shirorekha (Header Line) to analyze the pattern. The horizontal profile corresponding to a text block possesses certain regularity in frequency, orientation and shows spatial cohesion. Their algorithm uses these features to identify text blocks.

The algorithm first generates the horizontal histogram of the entire image. When the horizontal histogram is plotted of the document image, the rows where Shirorekha is present will have maximum number of black pixels. The patterns (profiles) formed by text blocks are characterized by ranges of thick black peaks, separated by white gaps. Any graphics or images by contrast have relatively uniform pattern, with no prominent peaks. Another main feature of a text block is that the histogram corresponding to that block possesses certain frequency and orientation and shows spatial cohesion i.e. adjacent lines are of similar height. Since text lines appear adjacent to each other, we look for adjacent patterns which have identical shape and size distribution. If this profile is found, then this portion of document must contain Devanagari text. Any variation from this characteristic must be an image with or without surrounding text.

Reference:

[1]. A. Ray Chaudhuri, A. K. Mandal, B. B. Chaudhuri, “Page Layout Analyzer for Multilingual Indian Documents”, Proc. of the Language Engineering Conference (LEC’02), 2002.

[2]. S. Khedekar, V. Ramanaprasad, S. Setlur, V. Govindaraju, “Text -image separation in Devanagari documents", Proc. of Seventh International Conference on Document Analysis and Recognition, 2003.

Wednesday, July 4, 2007

Planning for the Development of a Database of Document Images for Research on Bangla Optical Character Recognition


First we need to complete the basic study on the existing established Image databases of Document Images for Research purpose. I have observed that the most widely used database for research on document image analysis and recognition is the University of Washington Database where the creation of this database started at around 1990. Papers related to this database were published in 1993 at the Second International Conference on Document Analysis and Recognition. So we need to know details about the technique associated with this database creation. Some papers related to this are listed into the Study material section.

Next we have to look at the existing database for Bangla document Image and their availability. I have found that similar task has been done on “Resource Centre for Indian Language Technology Solutions Bangla” with limited varieties of document from bangla novel. I tried to get access to that resource, but unfortunately I failed. So, we need to build a database with large varieties and distribute this as open.

The followings are the basic about the UW databases.

University of Washington English Document Image Database

Series (UW-I, UW-II, UW-III) of document image databases produced by the Intelligent Systems Laboratory, at the University of Washington, Seattle, Washington, USA.

Now the most widely used database for existing research and performance evaluation is the University of Washington III (UWIII) database. The database consists of 1600 English document images with bounding boxes of 24177 homogeneous page segments or blocks, which are manually labeled into different classes depending on their contents, making the data very suitable for evaluating a block classification system. The documents in the UW-III dataset are categorized based on their degradation type as follows:

1. Direct scans of original English journals

2. Scans of first generation English journal photocopies

3. Scans of second or later generation English journal photocopies

Many of the documents in the dataset are duplicated and differ sometimes only by the degradation applied to them. This type of collection is useful when one is evaluating a page segmentation algorithm to see how well the algorithm performs when photocopy effect degradation is applied to a document.

UW – I

Contains 1147 document page images from English Scientific and Technical Journals having

  • Binary images scanned from 1st and other generation photocopies
  • Binary and grayscale images scanned directly from technical journals
  • Synthetic noise-free images generated from LaTeX files
  • Document images from UNLV ISRI database
  • All document images zoned and tagged
  • Software for OCR performance evaluation
  • Software for simulation of photocopy degradation
  • Text ground truth generated from two independent data-entry operators followed by three independent verifications

Each document page has associated with it

  • Text ground truth data for each text zone
  • Bounding box information for each zone on the page
  • Coarse level attributes for each document page
  • Finer level attributes (such as font size, alignment etc.) for each zone
  • Qualitative information on the condition of each page
  • AT & T Bell Labs degraded character images database

UW – II

Contains the followings:

  • 624 English journal document pages
    • 43 Complete articles
    • 63 MEMO pages
  • 477 JAPANESE journal document pages.
  • Illuminator software produced by RAF
  • Corrected data files for the known errors in UW-I

Each document page has associated with it

  • Text ground truth data for each text zone
  • Bounding box information for each zone on the page
  • Coarse level attributes for each document page
  • Finer level attributes (such as font size, alignment etc.) for each zone
  • Qualitative information on the condition of each page

Illuminator is an editor for building document image under- standing test and training sets, for easy correction of OCR errors, and for reverse- encoding the essential information and attributes of a document image page. It has been developed by RAF Technology, Inc., under the auspices of ARPA.

UW – III

Contains the followings:

  • 25 pages of Chemical formulae and ground-truth in XFIG 25 pages of Mathematical formulae and ground-truth in XFIG and LaTeX 40 pages of Line drawings and ground-truth in AUTOCAD AND IGES 44 TIFF images of engineering drawings ground-truthed in AUTOCAD 33 TIFF images of English text generated from LaTeX and their

  • Character level ground-truth; 33 printed and scanned pages containing English

  • Text and their character level ground-truth 979 pages from UW-I in DAFS format, corrected for skew, and the word.

  • Bounding boxes for each word on these pages 623 pages from UW-II in DAFS format corrected for skew, and the word.

  • Bounding boxes for each word on 607 of these pages Software for:

- Generating ground-truth for synthetic documents;

- Synthetically degrading document images;

- Registering ideal document images to real document images;

- Transferring character ground-truth between a pair of registered images;

- Estimating document degradation model parameters;

- Validating degradation models;

- Formatting Arabic, Hindi, and music documents in LaTeX;

- Testing hypotheses about parameters from a multivariate Gaussian population

- TIFF libraries (public domain software from ftp.sgi.com)

Required Study Materials:

Research Papers:

[1]

I.T. Phillips, S. Chen, J. Ha, and R.M. Haralick, English document database design and implementation methodology, Proc. of the Second Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, April 1993, 65-104.

[2]

Guyon, I., Haralick, R. M., Hull, J. J., and Phillips, I. T. (1997). Data sets for OCR and document image understanding research. In Bunke, H. and Wang, P., editors, Handbook of character recognition and document image analysis, pages 779–799. World Scientific, Singapore.

[3]

Phillips, Ihsin T.; Ha, Jaekyu; Chen, Su; Haralick, Robert M., "Implementation methodology and error analysis for the University of Washington English Document Image Database-I", Proc. SPIE Vol. 2103, p. 155-173, 22nd AIPR Workshop: Interdisciplinary Computer Vision: Applications and Changing Needs.

Abstract: Document database preparation is a very time-consuming job and usually requires the involvement of many people. Any database is prone to having errors however carefully it was constructed. To assure the high quality of the document image database, a carefully planned implementation methodology is absolutely needed. In this paper, an implementation methodology that we employed to produce the UW English Document Image Database I is described. The paper also discusses how to estimate the distribution of errors contained in a database based on a double-data entry/double verification procedure.

[4]

Bhattacharya, U.; Chaudhuri, B.B., "Databases for research on recognition of handwritten characters of Indian scripts", Eighth International Conference on Document Analysis and Recognition, 2005. Proceedings, Vol. 2 On page(s): 789- 793

Abstract: Three image databases of handwritten isolated numerals of three different Indian scripts namely Devnagari, Bangla and Oriya are described in this paper. Grayscale images of 22556 Devnagari numerals written by 1049 persons, 12938 Bangla numerals written by 556 persons and 5970 Oriya numerals written by 356 persons form the respective databases. These images were scanned from three different kinds of handwritten documents - postal mails, job application form and another set of forms specially designed by the collectors for the purpose. The only restriction imposed on the writers is to write each numeral within a rectangular box. These databases are free from the limitations that they are neither developed in laboratory environments nor they are non-uniformly distributed over different classes. Also, for comparison purposes, each database has been properly divided into respective training and test sets.

Web Links:

[1] http://www.isical.ac.in/~ujjwal/download/database.html [Information about the Bangla Character Recognition database.]

[2] http://www.science.uva.nl/research/dlia/datasets/uwash1.html [UW English Document Image Database-I]

[3] http://documents.cfar.umd.edu/resources/database/UWII.html [UW-II English/Japanese Document Image Database]

[4] http://documents.cfar.umd.edu/resources/database/3UWCdRom.html [UW-III English/Technical Document Image Database]

[5] http://www.nist.gov/srd/ [NIST Scientific and Technical Database]

[6] http://www.cfar.umd.edu/~kia/ocr-faq.html#SECTION00022000000000000000 [OCR Frequently Asked Questions.]

[7] http://www.isical.ac.in/~rc_bangla/products.html#ocr [Resource Centre for Indian Language Technology Solutions Bangla]

Feature based approach for Page Layout Extraction


This post is actually the underscore of the paper "Document Image Zone Classification: A Simple High-Performance Approach" by Thomas M. Bruel

Introduction

In this paper we address this shortcoming by comparing a large set of commonly used features for block classification and include in the comparison three features that are known to yield good performance in content-based image retrieval (CBIR) and are applicable to binary images (Deselaers et al., 2004). Interestingly, we found that the single feature with the best performance is the Tamura texture histogram, which belongs to this latter class.


Related Work and Contribution

Inglis and Witten (Inglis and Witten, 1995) used seven features based on connected components and run lengths, the authors apply various machine learning techniques to the problem. In the paper of Okun et al. (Okun et al., 1999) the predominant feature type is based on connected components and run-length statistics. Other features used include the cross-correlation between scan-lines, vertical projection profiles, wavelet coefficients, learned masks, and the black pixel distribution. The most common classifier used is a neural network.

The widespread use of features based on connected components run-length statistics, combined with the simplicity of implementation of such features, led us to use these feature types in our experiments as well, comparing them to the use of features used in content-based image retrieval. Our CBIR features are based on the open source image retrieval system FIRE (Deselaers et al., 2004). We restrict our analysis for zone classification to those features that are promising for the analysis of binary images.

The most recent and detailed overview of the progress in document zone classification and a very accurate system is presented in (Wang et al., 2006). The authors use a decision tree classifier and model contextual dependencies for some zones. In our work we do not model zone context, although it is likely that a context model (which can be integrated in a similar way as presented by Wang et al.) would help the overall classification performance.

We expand on the work presented in (Wang et al., 2006) in the following ways:

• We include a detailed feature comparison including a comparison with commonly used CBIR features. It turns out that the single best feature is the Tamura texture histogram which was not previously used for zone classification.

• We present results both for a simple nearest neighbor classifier and for a very fast linear classifier based on logistic regression and the maximum entropy criterion.

• We introduce a new class of blocks containing speckles that has not been labeled in the UW-III database. This typical class of noise is important to detect during the layout analysis especially for images of photocopied documents.

• We present results for the part of the UW-III database without using duplicates and achieve a similar error rate of 1.5%.

• We introduce the use of histograms for the measurements of connected components and run lengths and show that this leads to a performance increase.


Feature Extractions

We extract the following features from each block, where features 1-3 are chosen based on their performance in CBIR (Deselaers et al., 2004) feature 4 was expected to help distinguish between the types ‘drawing’ and ‘text’ and features 5-9 were chosen based on their common use in block classification (Okun et al., 1999;Wang et al., 2006). Due to space limitations we refer the interested reader to the references for implementation details.

1. Tamura texture features histogram (TTFH)

2. Relational invariant feature histograms (RIFH)

3. Down-scaled images of size 32×32 (DSI)

4. The fill ratio, i.e. the ratio of the number of black pixels in a horizontally smeared (Wong et al., 1982) image to the area of the image (FR)

5. Run-length histograms of black and white pixels along horizontal, vertical, main diagonal, and side diagonal directions; each histogram uses eight bins, spaced apart as powers of 2, i.e. counting runs of length _ 1,3,7,15,31,63,127 and _ 128 (RL{B,W}{X,Y,M,S}H)

6. The vector formed by the total number, mean, and variance of the runs of black and white pixels along the horizontal, vertical, main diagonal, and side diagonal directions as used in (Wang et al., 2006) (RL{B,W}{X,Y,M,S}V)

7. Histograms (as in 5) of the widths and heights of connected components (CCXH, CCYH)

8. The joint distribution of the widths and heights of connected components as a 2-dimensional 64-bin histogram (CCXYH)

9. The histogram of the distances between a connected component and its nearest neighbor component (CCNNH)

Fig: Examples of document image block types distinguished in our approach

Classifications

To evaluate the various features, we use a simple nearest neighbor classifier, that is, a test sample is classified into the class the closest training sample belongs to. The distance measures used are the Jensen- Shannon divergence for histograms and the Euclidean distance for all other features (Deselaers et al., 2004). If different feature sets are combined, the overall distance is calculated as the weighted sum of the individual normalized distances. The weights are proportional to the inverse of the error rate of a particular feature.


Source: Daniel Keysers, Faisal Shafait, Thomas M. Breuel, "DOCUMENT IMAGE ZONE CLASSIFICATION A Simple High Performance Approach", VISAPP 2007, pages 44-51

Sunday, July 1, 2007

Enhanced Constrained Run-Length Algorithm


Introduction

This Enhanced CRLA is a solution to the limitations of CRLA and it can be applied to non-Manhattan (irregularly shaped graphics embedded in text) page layout successfully. It can also extract text surrounded by a box. Both cases cannot be processed by the original CRLA.

Preprocess

The proposed new version of CRLA is performed on a label image, which is derived from the input document image and has the same size as the document image. To generate the label image, a connected component labeling algorithm is used to locate the foreground components in the document image and calculate their size. Then the label image can be generated by assigning its pixels with certain labels according to the size of the foreground component related to the pixel. Three labels are defined in the present system based on two threshold values (thVal1 & thVal2) as follows:

(1) If the height of the foreground component is less than thVal1, its pixels in the label image are assigned the label of “1”.

(2) If the height of the foreground component is between thVal1 and thVal2, its pixels in the label image are assigned the label of “2”.

(3) If the height of the foreground component is larger than thVal28, its pixels in the label image are assigned the label of “3”

The height is used instead of the width when checking the component size because it is assumed that the text on the document is horizontal alignment. As the characters in a text line may be connected in the scanned document image, checking width may lead to misjudgment of the component size.

After the label image is yielded, the new CRLA is executed on it in the following manner. Consider, for example, a label string as below

110001110002003330000110000000111.

By setting a constraint C = 5 to the run-length of zeros that are surrounded by label 1 on both sides, if the length of the consecutive zeros is not longer than C, these zeros are replaced by ones. Based on this operation, the preceding string is converted into

111111110002003330000110000000111.

This operation is referred to as selective-CRLA{1} where “selective” means it is applied only to those zeros surrounded by the specific label(s) given in the braces. The selective CRLA is performed with a two-pass processing scheme on the label image to separate text from graphics and halftone-image components.

Process

First Pass

The process of the first pass uses the following steps to accomplish page segmentation.

1. Apply selective-CRLA{1} row by row to the label image using constraint Chor-1.

2. Apply selective-CRLA{1} column by column to the label image using a constraint Cver-1.

3. Combine the images yielded by steps 1 and 2 using a logical AND operation.

4. Apply an additional selective- CRLA{1} row by row to the image yielded by step 3 using a constraint Csm-1.

The purposes of steps 1 and 2 are to link the foreground components horizontally and vertically. Step 3 is to cut the joined foreground areas according to the space between columns and text lines. Finally, step 4 is to remove the small gaps that may exist between characters in a text line. The parameters Chor-1, Cver-1, and Csm-1 are chosen according to the analysis on documents (Please read Post: 2 to know detail about this).

After dividing the document image into homogeneous regions, these regions must be further classified into text or non-text according to their features. A number of statistical measurements have been proposed for distinguishing between text and non-text regions for binary document images. Two of them are adopted in the present system: (i) the mean length of horizontal black runs and (ii) the white-black transition count per unit width. These two features can be calculated by scanning a region from left to right in a row-by-row manner. As the scanning proceeds, the horizontal black run-length is accumulated by a counter, BRL, and the white-black transition count by another counter, TC. After the whole region is scanned, the two features are computed by mean length of horizontal black runs

MBRL = BRL/TC -- (1)

White-black transition count per unit width

MTC = TC/W -- (2)

where W is the width (in pixel) of the region under consideration. If MBRL_min <= MBRL <= MBRL_max and MTC_min <= MTC <= MTC_max , the region is determined to be text. The parameter values of MBRL_min, MBRL_max, MTC_min and MTC_max in the present system are determined based on tests performed on of several document images.


The text regions extracted by the first pass are removed from the label image and the corresponding areas are filled with zeros, i.e. white.

Second Pass

The algorithm used in the second pass is similar to that used in the first pass, except for some changes in parameter settings. Specifically, the following procedure is applied to the label image output from the first pass.

1. Apply selective-CRLA{1, 2} row by row to the label image using a constraint Chor-2.

2. Apply selective-CRLA{1, 2} column by column to the label image using a constraint Cver-2.

3. Combine the images yielded by steps 1 and 2 using a logical AND operation.

4. Apply an additional selective- CRLA{1, 2} row by row to the image yielded by step 3 using a constraint Csm-2.


Again the value of the parameters Chor-2, Cver-2, and Csm-2 are chosen according to the analysis on documents (Please read Post: 2 to know detail about this).


To identify the text regions, the features and rules used in the first pass is employed again here, but with new values of MBRL_min, MBRL_max, MTC_min and MTC_max determined based on tests performed on of several document images.]

Discussion

The strength of the selective CRLA relies on the utilization of the global component information (i.e. the component size), which conducts the run-length smearing process more precisely than the original CRLA. In the first pass, the selective CRLA aims to extract body text, which usually has a smaller character size, and it can prevent erroneous linkage of text and non-text regions. In the second pass, the pixel labels and the parameter settings are relaxed to join the large headline characters and the text lines that have wide inter-character space.

Source:

Hung-Ming Sun, 2006, "Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing", International Journal of Applied Science and Engineering 2006.4, 3: 297-309

Constrained Run-Length Algorithm (CRLA)


The Constrained Run-Length Algorithm (CRLA) is a well-known technique for page segmentation which is also known as the Run-Length Smoothing/Smearing Algorithm (RLSA). This is a very well known technique for segmenting a document image into homogeneous regions. The algorithm is very efficient for partitioning documents with Manhattan layouts (i.e. the text/graphics/halftone-image regions were separable by horizontal and vertical line segments, for example two-column text together with rectangle-alignment graphics and halftone images) but not suited to deal with complex layout pages, e.g. irregular graphics embedded in a text paragraph. Its main drawback is to use only local information during the smearing stage, which may lead to erroneous linkage of text and graphics.

The previous two posts briefly describe the Constrained Run-Length Algorithm (CRLA) or Run Length Smoothing Algorithm (RLSA) and its related issue.


Source :

[1] Hung-Ming Sun, 2006, "Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing", International Journal of Applied Science and Engineering 2006.4, 3: 297-309

Wednesday, June 27, 2007

Determination of Run-Length Smoothing Values For Document Segmentation


Lets start with the review of the Run Length Smoothing Algorithm (RLSA) technique. Usually, the RLSA has three main stages. First, in the original image, RLSA is applied horizontally, row-by-row by using hsv and then vertically, column-by-column with vsv. After horizontal and vertical smoothing, we have two bit-maps, which are next combined by a logical AND operation to produce a new smoothing image. This image has small gaps that interrupt blocks of text lines. Therefore, an additional horizontal smoothing operation is performed by using a new suitable smoothing value ahsv.

The smoothing values used for the Run-length based segmentation methods are widely varies depending on the document structure (both physical and logical layout) and text types in the document. More specifically they depend on character size and font that may also vary depending on whether a document is in a single font or multiple fonts etc. The smoothing methods use heuristic techniques to determine the values of the smoothing variables.

According to the work of Wong et al. (1982), the smoothing values of hsv and vsv depend on "the number of pixels covering the length of long words, whereas ahsv should be set to cover a few character widths". We have found that almost all the papers where the RLSA technique is used and discussed, applied their own diagnostic value as the smoothing value. However, they do not propose an analytical method for the calculation of the smoothing values.

Here in this post, an algorithm is described for automated calculation of the proper horizontal and vertical smoothing values. Specifically, the horizontal smoothing value (hsv) and the vertical smoothing value (vsv) are calculated according to the mean character length (mcl) and the mean text line distance (mtld) of the document. The values of mcl and mtld are determined using the contributions of the horizontal and vertical runlengths.

The proposed method is based on pre-estimation of the mean character length and the mean text line distance of a document. To do this, a technique is developed that uses the horizontal and vertical distributions of white and black runs. They consider that a proper value for hsv must be equal to twice the mcl, while in the vertical direction, the proper value of vsv must be taken equal to the mtld of the document. That is

hsv = 2 mcl ( 1 )

vsv = mtld (2)

To obtain the proper values of mcl and mtld, we examine not only the white runs but also the black runs in the vertical and horizontal directions. Specifically, initially we calculate three histograms. These histograms give the distributions of black runs in the horizontal and vertical directions and of white runs in the horizontal direction.

The global maximum of the histogram of horizontal black runs (gmhbr) constructed mainly from part of the characters. Therefore, it gives information about the mean character length. We observe that the mcl belongs to a region defined by multiply the gmhbr by two suitable coefficients m1 and m2. That is

mcl Є [int (m1 * gmhbr), intU (m2 * gmhbr)] --- (3)

where int(x) is the integer part of x, and

intu(x) = x, if x is integer / int(x) + 1, otherwise ----- (4)

On the other hand, we know that the mean character height (mch) is close enough to the mcl. However, it is obvious that the mch corresponds to a local or global maximum in the histogram of vertical black runs, and moreover, we accept that this maximum belongs to the region defined by (3).

Therefore, the proper values of coefficients m1 and m2 must be determined . To do this they examine a large number of documents. Some estimation are made after examining number of documents.
  • We can accept that in a readable document the mtld is always greater than int(0.8mcl). This is the lower limit for mtld.
  • The upper limit of mtld can be large enough and it is taken equal to be 80. So, the mild is detennined as the position of the global histogram maximum in the range [int(0.8mc1),80]
Source: N. Papamarkos, J. Tzortzakis and B. Gatos, "Determination of Run-Length smoothing values for document segmentation", Third IEEE Int. Conf. on Page(s): 684-687 vol.2

Tuesday, June 26, 2007

Run Length Smoothing Algorithm (RLSA)


The Run Length Smoothing Algorithm (RLSA) is a method that can be used for Block segmentation and text discrimination. The method developed for the Document Analysis System consists of two steps. First, a segmentation procedure subdivides the area of a document into regions (blocks), each of which should contain only one type of data (text, graphic, halftone image, etc.). Next, some basic features of these blocks are calculated.

The basic RLSA is applied to a binary sequence in which white pixels are represented by 0’s and black pixels by 1’s. The algorithm transforms a binary sequence x into an output sequence y according to the following rules:

1. 0’s in x are changed to 1’s in y if the number of adjacent 0’s is less than or equal to a predefined limit C.
2. 1’s in x are unchanged in y .

For example, with C = 4 the sequence x is mapped into y as follows:

x : 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
y : 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1

When applied to pattern arrays, the RLSA has the effect of linking together neighboring black areas that are separated by less than C pixels. With an appropriate choice of C, the linked areas will be regions of a common data type.

The RLSA is applied row-by-row as well as column-by-column to a document, yielding two distinct bit-maps. Because spacings of document components tend to differ horizontally and vertically, different values of C are used for row and column processing. Thet wo bit-maps are then combined in a logical AND operation. Additional horizontal smoothing using the RLSA produces the final segmentation result.

Figure 1 : (a) Block segmentation example of a mixed text/image document, which here is the original digitized document. (b) and (c) Results of applying the RLSA in the horizontal and vertical directions. (d) Final result of block segmentation. (e) Results for blocks considered to be text data (class 1).

The blocks shown in Fig. 1(d) must next be located and classified according to content. To provide identification for subsequent processing, a unique label is assigned to each block. Simultaneously with block labeling, the following measurements are taken:
  • Total number of black pixels in a segmented block (BC).
  • Minimum x-y coordinates of a block and its x, y lengths (Xmin, Ymin, delX, delY).
  • Total number of black pixels in original data from the block (DC).
  • Horizontal white-black transitions of original data (TC).
These measurements are stored in a table and are used to calculate the following features:
  1. The height of each block segment: H = delY.
  2. The eccentricityo f the rectangle surrounding theb lock: E = delX/delY.
  3. The ratio of the number of block pixels to the area of the surrounding rectangle: S=BC/(delX * delY). If S is close to one, the block segment is approximately rectangular.
  4. The mean horizontal length of the black runs of the original data from each block: Rm=DC/TC.
The mean value of block height Hm, and the block mean black pixel run length Rm, for the text cluster may vary for different types of documents, depending on character size and font. Furthermore, the text cluster’s standard deviations sigma(Hm) and sigma(Rm) may also vary depending on whether a document is in a single font or multiple fonts and character sizes. To permit self-adjustment of the decision boundaries for text discrimination, estimates are calculate for the mean values Hm and Rm of blocks from a tightly defined text region of the R-H plane. Additional heuristic rules are applied to confirm that each such block is likely to be text before it is included in the cluster. The members of the cluster are then used to estimate new bounds on the features to detect additional text blocks. Finally, a variable, linear, separable classification scheme assigns the following four classes to the blocks:

Class 1 Text:
R <> C_21 Rm and
H <> I/C_23 and
H > C_22 Hm.

Class 4 Vertical solid black lines:
E <> C_22 Hm.

Values have been assigned to the parameters based on several training documents. With C_21=3, C_22=3, and C_23=5, the outlined method has been tested on a number of test documents with satisfactory performance.

There are certain limitations to the block segmentation and text discrimination method described so far. On some documents, text lines are linked together by the block segmentation algorithm due to small line-to-line spacing, and thus are assigned to class 3. A line of characters exceeding 3 times H, (with C_22=3), such as a title or heading in a document, is assigned to class 3. An isolated line of text printed vertically, e.g., a text description of a diagram along the vertical axis, may be classified as a number of small text blocks, or else as a class 3 block.

Consequently, in order to further classify different data types within class 3, a second discrimination based on shape factors is used. The method uses measurements of border-to-border distance within an arbitrary pattern. These measurements can be used to calculate meaningful features like the “line-shapeness” or “compactness” of objects within binary images.


Source:
K.Y. Wong, R.G. Casey and F.M. Wahl, "Docuinent analysis system," IBM J. Res. Devel., Vol. 26, NO. 6,111). 647-656, 1982.

Additional Reading:
[1] F. M. Wahl, K. Y. Wong, and R. G. Casey, “Block Segmentation and Text Extraction in Mixed Text/Image Documents,” Research Report RJ 3356, IBM Research Laboratory, SanJose, CA, 1981.

[2] F. M. Wahl, “A New Distance Mapping and its Use for Shape Measurement on Binary Patterns,” Research Report RJ 3361, IBM Research Laboratory, San Jose, CA, 1982.