Вернуться к разделу "Программа DjvuOCR".


The DjvuOCR v2.2 program. Brief instructions for users


DjvuOCR 2.2

Program for making DJVU files with embedded OCR text layer and search,
based on information extracted from internal files of FineReader 7.0/8.0 (*.FRF)

Brief instructions for users


Working title: DjvuOCR.exe

Version: 2.2 beta (previous was: 2.1)

Similar program: FRFGrab 1.12, console application.

Author: gencho

Platform: Windows 9x/ME/NT/2000/XP, Dialog-based API

Feedback: djvuocr [почтознак] mail2world.com

FineReader version for testing:

- FineReader 7.0 PE Build 7.0.0.543 part# 3648

- FineReader 8.0 PE Build 8.0.0.684 part# 4571

Attention: This package does not contain FineReader. The user should install their copy of FineReader, at least a try&buy version. See at the end of this text: How to work with FineReader in batch mode.


NEW FEATURES !!!!!!!!!!


Text must be edited in FineReader before burning in DJVU! It is good if you save some original characters when editing text (i.e. intervals) for best coordinate recover (all this is for better text selection on screen before copy from book).

For another corrections in DjvuOCR, look in russian documentation DjvuOCR-ru.txt. :-(

I have new mail listed above, please send any problems to me.



Main modes of operation:

- "DjVu Decoder"
- "Batch mode OCR manager"
- "Manual mode OCR manager"
- "Burn existing OCR file in DjVu book"
- "Extract OCR layers"
- "Remove OCR layers"

Mode is selected by pressing the proper button.

All options selected by the user during a work session will be automatically saved into the file FRFGrab.ini. I selected these options for my convenience, best choices for most cases.


"DjVu Decoder"

DjVu Decoder decodes a DJVU book into standard graphical formats suchs as TIF, BMP etc. DJVU Decoder is also a part of the "Batch mode OCR manager."

DJVU-books to be decoded should be added to the list box "DjVu File list" using the button "Add". All the books should be in one directory (folder)!

Then you set "Options for each file".

There are two ways of setting options:

1) for all books. You select some options and then press "Apply to all files";

2) for one book. Usually after having done 1), you select one book with the mouse and change its options.

All selected options will be applied to all newly added books.

"Save as default" will save these options as default values.

"Name output convention" chooses the method for naming the resulting files. For example tiff files for book #1 may be numbered aa_0001.tif, aa_0002.tif..., for book #2 as ab_0001.tif, ab_0002.tif etc.

"Output directory" is the folder where the output files will be created.

This folder should exist before you start processing!

If everything is set, you can press "Process". Books will be decoded one by one, the process is not very fast because processing each page requires to run Djvused.exe.


POSSIBLE PROBLEM when using DjvuDecode:

Under WinXP sometimes (very rarely) DjvuDecode stops working, sits and does nothing, the number of processed pages does not change. This looks like a bug in DjvuDecode.exe. In fact, DjvuDecode exports a page and creates a file, but somehow never finishes doing that.

(note: there is a very similar problem with ddjvu under Linux! This problem becomes worse when the computer is under heavy CPU load.)

To avoid this problem: note the name of the last bitmap file created in the "Output directory". Then run Task Manager and kill the DjvuDecoder process (not the DjvuOCR!!!). DjvuOCR then resumes working. Then after the file is finished you will need to edit the PROJECT FILE (see below in the section "structure of PROJECT FILES".)


If a book has empty pages or damaged pages, DjvuOCR will skip these pages and add them to the list of ignored pages which you can see after the end of the processing.

After the end of the processing you will see the message "Save project file?" and you will be able to write all information about the decoded books and pages to a file. This file is the so-called PROJECT FILE, and it is very convenient to use this file later in the "Batch mode OCR manager" because this file contains all the necessary information about page numbering, missing pages, and so on. The PROJECT FILE can be saved also by pressing the button "Save as OCR project". This finishes decoding.

In order to load the resulting TIFF files into FineReader, I do this:

1) Open the folder with TIFF files with Explorer. Run Finereader. Make a new batch and save it. Using Explorer, select some files, but no more than 1167 (this is a limitation in FineReader: it cannot load at once more than about 1170 TIFF files). Then select "Open image" Ч FineReader, add the group until the last file selected in Explorer. After this, I delete the selected set of files in Explorer and select the next group of files. And so on until the end of the set of files.

Recommendation: One FineReader project should not have more than about 3800-4000 files because otherwise the text recognition becomes slow.

Recommendation: In FineReader 8 there is an option "Fast Recognition", in Tools/Options/2.Read/Fast. It can be used when the images have good quality. Then the recognition speed can increase significantly. At AMD 2000 computer the speed can reach 30-60 pages per minute.

2) (version 2.1 and higher) The buttons "Create FR7 batch" and "Create FR8 batch" were added. By choosing one of them the contents in folder specified by the "Output Directory" field is converted into a batch project for the corresponding version of FineReader. The button should be chosen after filling of the folder with images. The images will be sorted alphabetically and then are renamed to 0001.tif, 0002.tif and so on, and then are added additional work files needed by FineReader. After such a project is loaded, in FineReader appears a message "... _FRBatch.pac was not found". After that FineReader creates its own file *.pac. Recognition language should be selected and then the recognition can be started. Finereader 8 can recognize batch files created by both buttons.

Note: This method is still in testing. It works well for black and white images and hasn't been used much with greyscale images. If problems appear at the creating of a batch file, a batch file can be created instead as described above (p.1), using the renamed files 0001.tif, 0002.tif, etc.

Note: the original image files should not be named 0001.tif, 0002.tif, etc. - these names will be used after renaming the files.


Structure of a PROJECT FILE:

This is a text file with extension ".dprj". It begins with "DECODER_PROJECT". After that for every book in this project there is the following group of commands. For example:

file=H:\_@djvu9\bee\Greiner W. Classical mechanics. Systems of particles and hamiltonian dynamics (Springer, 2003)(K)(400dpi)(T)(563s).djvu
lastpage=563
misspages=12,18,24,62,102,104,186,292,360,362,533,535,557
prefix=aa_
processed=550

- lastpage:     number of the last page in the djvu-file.

- misspages:  list of empty (or dameged) pages. The corresponding TIFF files are missing, but they should be had in mind for
                       the proper numbering of the pages. If there are no such pages, this command is skipped;

- prefix:         prefix for naming of files, for example to prefix "aa_" corresponds file "aa_0001.tif" ...

- processed:  how many pages should be processed. This number is equal to "lastpage - misspages".

Finally, if the problem described above (under WinXP, hanging of DjvuDecoder) has to be fixed, it should be done like this:

- open the project file in a text editor;
- find the group corresponding to the page written down;
- remove the page from "misspages" list for this file (and the "misspages" command itself if empty);
- increase "processed" pages by 1;
- save the file.


Reccomendation: If the pages are decoded only for recognition purposes, very often good results are achieved at decoding a book in "black and white" mode (the file's size is significantly smaller);

Reccomendation: If the book is scanned in color or gray, it can be decoded for OCR as "black and white" not only with "To bitonal" option, but by choosing"Layer"=mask and "To bitonal" at the same time. This helps (sometimes) decrease the unnecessary color background.


"Batch mode OCR manager"

Here you process a finished FR project and embed the resulting OCR text into DJVU files.

1) DJVU files can be specified in two ways:

- by loading a PROJECT FILE, which you made before using the "DjVu Decoder";

- by adding books manually using the "Add" button. Then you will have to select files manually and also possibly fill out the "Missing page list", i.e. the list of pages for which you don't have a page image file in FineReader's project; pages in the list are comma-separated.

"Scale": for every book these 2 fields are used if the TIFF files are decoded in other dpi different from the original. For example, from book in 150 dpi are extracted TIFF files in 600 dpi. The last is necessary when the book is scanned with low quality and the recognition quality should be increased. Then in the "Scale" field should be specified the relation between the original and the decoded dpi. For example in this case 1/4. With the choosing of these numbers the OCR text may be set to lie exactly over the graphic. In this case for compesation of rounding mistakes, it may be more appropriate relation of 100/403.

It is not always obvious what is the right relation. That is why we should use the method of "tries and mistakes" - after processing of the book, we compare the marked text to the position of the original graphic and repeat the processing with another relation till we reach the wanted result.

2) In the "FineReader Project Directory" field should be specified the position of the folder containing the project of the FineReader (there are the work files created after recognition, with ".frf" extension). It should be checked immediately if the program "DjvuOCR" recognizes the format of the FineReader by clicking the "Test Project" button. If there are problems, you see a list of the problematic pages, and they must be edited in FineReader. See below "Work with FR in batch mode/known problems". There are described some common problems and their solutions. Repeat "Test Project" until all problems are solved.

3) "Output OCR text Directory": This is the folder where the OCR-text files will be created. In version 2.1 is added "Save produced OCR layers as TXT files" checkbox. If after the processing the OCR-texts are not needed, uncheck the checkbox. Then the wokfiles are created in system folder "Temp" and are removed after the processing of the book.

4) Select options:

- "Normal hyphenation":
This option *switches off* the new method of handling hyphenated words, which I designed, and which is now always used by default. The idea of the method is to avoid the problem when a hyphenated word is split into two parts, and cannot be found when performing search in DJVU files. For example:

"this function is int-"
"egrable on an interval..."

The word "integrable" cannot be found by searching, only the pieces of it, "int" and "egrable". The new method is to repeat the entire word in the OCR text, If you *do not* check "Normal hyphenation", the OCR text will be:

"this function is "
"integrable on an interval..."

Now the entire word "integrable" can be found by search.If you don't want this feature, use the option "Normal hyphenation".

(In vesion 2.1 the hyphenation problem is completely solved. When searching or copying the text, the word with hyphenation will be whole as described in the above case, and the extra part is removed. That is why we do not recommend "Normal hyphenation" to be checked).

- "Ignore error checking": Check this option if there are problems in the pages at "Test project" checking and you want to ignore them. Then from the page will be taken the text to the place of the problem.

- "Direct UTF8 translation": enable direct decoding from FineReader UNICODE into UTF8. This option enables processing of recognition languages other than those set in the Windows' "Regional Settings" of your computer and other than your localized version of Windows. I recommend to have this option always checked.

- "Create HTML file": if you want to create an HTML file with the recognized text (will be in UTF8 encoding).

- "Burn DJVU books": If this option is selected, the processing result (OCR-Layer) is burned in the Djvu-book. If it is not selected, only OCR-file with the recognized text is created.

"Start page# in FineReader project": This field is the first page of the FR-project, which the processing starts from.

Button "Process" starts the book processing. A window appears for every book showing the processing condition.

Recommendation: After processing every book's beginning and end should be checked for correspondence between the selected text and the graphic. Manual loading of great number of pages in a FineReader project may lead to mistakes and desynchronization of the FR-project and the consistency of the book pages.


"Manual mode OCR manager"

This mode is created for a more flexible processing. Here can be processed a part of the FineReader project. This mode is rarely used and little tested, and it may have serious errors. I recommend using mainly "Batch mode OCR manager"

"FineReaded Project directory": a folder where the OCR-layer file is created.

"Page interval in FR project": here can be specified what part of the FR project should be processed.

"Start page# in DJVU book": specifies a start page in a Djvu-file, from where OCR should be situated.

"Burn DJVU book": if this option is selected, the OCR text is inserted straight into the Djvu-file.

"Djvu file": a Djvu file, where OCR will be inserted, if the above option is selected. All other options are the same as in
                     "Batch mode OCR Manager", except the fact that the "Direct UTF8 translation" option is always selected.


"Burn existing OCR file in DjVu book"

In this mode an existing OCR layer is embedded into a Djvu book.

The file name of the Djvu book and an existing OCR-text file must be specified, and then should be clicked the "Process" button.

Recommendation: if the OcR file is received directly from Djvused.exe, instead of from DjvuOCR, it is possible to encounter a problem, because Djvused names pages and later doesn't find a page with such a name in the Djvu file.

(In the all2djvu package there is a Python script named djvuOCR-extract.py, which solves this problem.) It is more appropriate to use OCR text created by the following operation "Extract OCR layers".


"Extract OCR layers"

It gives opportuniy to extract the OCR text from djvu book in an appropriate format for Djvused.

"Djvu File List" - Many Djvu books can be selected (with the "Add" button), even from different folders, also entire folders can be added (and their subfolders, with the "Add folder" button).

"Output Directory": the place where the OCR layers, extracted from the books, will be saved. The process of extraction is optimized such that the empty pages and damaged pages are ignored, and the result is completely compatible with Djvused.exe.

Processing starts after selecting of the "Process" button, and is comparatively slow.


"Remove OCR Layer"

Removes OCR text from a group of books.

"Djvu File List": a list of the books to be processed.

There is a possibility to remove only a part of the OCR text using "Remove from page"..."to page" controls.

Processing starts after selecting of the "Process" button,


Using FineReader in batch mode:


1) Run FineReader;

2) In menu "File/New batch" make a new batch. Choose recognition languages.

3) In menu "File/Save batch" (FineReader 7.0) specify a name for the batch. After recognition, a correspondingly-named directory will contain all the FRF files.

4) In menu "File/Open" add pages to the batch. You can add either a few pages at a time, or all the pages (but not more than 1167 pages at one time!) Be sure that the pages are added in the correct order of increasing page numbers.

5) Select all pages and press "Read"; recognition starts.

6) Wait. :)). With a Pentum IV/1600, it takes about 20-40 minutes to recognize 250 pages, on Athlon/1000 - about 35-50 minutes. Pages with mathematical equations are recognized about half as fast. However, some scans are recognized at 200 pages in 10 minutes.

7) Be sure to press the menu "File/Close batch", or else some files may not be written to disk.

8) Run FRFgrab or DjvuOCR, as shown above.


Attention! It is quite sufficient to use the try&buy version or demo version of FineReader, even if its free evaluation time is expired. This is because we do not need to export any text in any formats. The try&buy version works fine and creates correct FRF files; this is all we need.


Known problems:


Since the FRF format is undocumented (I need FineReader SDK, which I cannot afford), the format was guessed.

1) It seems that FRF files contain information about the FR version. The program works with the format version 2.10 (FineReader 7.x) and 2.15 (FineReader 8.0). Other versions are not supported.

2) If after automatic recognition by FineReader you edited something in the text by hand, the edited text will lose the information about page coordinates (x coordinate = 0). This, of course, will give incorrect rectangles on the screen when you perform marking & search in DJVU files. I do not recommend to edit anything after recognition! This problem is unsolved.

3) The program does not process pages containing tables whose cells contain pictures (and I can't fix this!) Such tables should be removed by hand in FR (change the parameters of the frame, so that there are no tables with pictures but only text) and recognize that page again.

4) If a page is scanned at an angle (skewed), FineReader tries to deskew it, and then the coordinates of selected text are slightly incorrect on the screen. This cannot be fixed. You should deskew all pages before running FineReader.

5) FineReader sometimes fails completely to recognize some page, but instead marks it as an "error" (the red icon). Such pages should be dealt with like this: with the mouse, mark a part of the text (excluding equations and graphs), and press Read (recognize just this page). If everything is OK, mark the text block, recognize, and so on. If something fails, look at the line of text where FR fails. Exclude this line from the block, or exclude some special symbols that present a problem for FR.

6) FineReader does not recognize pages containing a picture of a grid, or a large graph in logarithmic scale. This is recognized as a large table and sometimes FR generates a General Protection Failure and crashes. Such pages should be recognized in FR after selecting only the text blocks.

7) If some pages are recognized without errors but FRFGrab/DjvuOCR fails to test them (-t fails), this happens for these reasons:

- there are too many blocks on the page. You should then mark all blocks as one large "Text frame", and run the recognition again on that page.

- there is a very long paragraph, for example, a full-page list. Then you can make two blocks by hand, splitting the page in half, and repeat the recognition.

(This problem seems to have been corrected.)


Project history:


07.01.2007 - version 2.2, console version FRFGrab 1.12

See russian documentation for details DjvuOCR-ru.txt :-(

30.08.2006 - version 2.1, console version FRFGrab 1.11
  • Support for FineReader 8 is added.
  • Corrected (end of August) essential problems connected with presence of colour images in the recognized page.
  • The word hyphenation problem is finally solved.
  • In DjvuOCR were added the "Extract..." and "Remove OCR Layers" functions.
  • New possibilities were added in "Djvu Decoder" and "Batch mode Ocr manager" modes.
07.04.2006 - version 2.0 final.

added support of UNICODE in FineReader 7, the next version should support FineReader 8.x

28.06.2004 - version 2.0 pre

Finished dialogs, automation of running the processes. There are still some unimplemented ideas, no documentation, thus it's a pre-version

20.06.2004 - version 1

basically all functionality ported from FRFGrab 1.09

04.06.2004 - version 0: beginning GUI
22.02.2004 - version 1.09 of FRFGrab - console version.

07.01.2007, Bulgaria, Sofia.

<gencho>  djvuocr [почтознак] mail2world.com


Подготовил: monday2000.

9 марта 2007 г.

E-Mail  (monday2000 [at] yandex.ru)

Hosted by uCoz