The DjvuOCR v2.2 beta program. The FRFGrab utility

Вернуться к разделу "Программа DjvuOCR".

The DjvuOCR v2.2 beta program. The FRFGrab utility

Program for extracting information from FineReader's internal files
(*.FRF) into an OCR layer file for the DJVU format.

Working title: FRFGrab.exe

Version: 1.12. This version has FineReader 8.x support

Author: gencho

Platform: Windows 9x/ME/NT/2000/XP, console application

Feedback: djvuocr [почтознак] mail2world.com

FineReader version for testing:

- FineReader 7.0 PE Build 7.0.0.543 part# 3648

- FineReader 8.0 PE Build 8.0.0.684 part# 4571

Attention: This package does not contain FineReader. The user should install their copy of FineReader, at least a try&buy version.

Current project status:

05.03.2006 - tested on more than 4000000 scanned pages (more than 10000 books) with Finereader 7.0 (100% working)

07.01.2007 - tested on more than 90000 pages for FineReader 8.

Synopsis:

FRFGrab [options] <FRF file> [options]

Options:

-p <n> -- initial page number in DJVU file, starting from 1

-g -- use automatically generated page numbers: for example, the file 0017.FRF -> page #17

-h -- switches off the improved hyphenation method (see remark in "project history" below)

-i -- ignore all pages that give errors when "testing" . The program will try to extract as much text as possible, until
an error is encountered. The page numbers are correctly preserved.

Additional options (only for debugging purposes or for studying the FRF file format!)

-v -- verbose (printing messages about the entire process)

-l -- list of format groups

-d -- hex dump (entire file)

-t -- check whether everything is recognized

-q -- when using "-t", print only error messages

    <FRF file> -- name of one FR internal file (.FRF), or wildcard. If the filename starts with '@', it means a file with a list of
                           filenames. If using a wildcard, the filenames will be sorted. Only "*" is allowed as a wildcard, which is
                           equivalent to "*.frf".

This brief help message is printed when the program is run without arguments.

Example list of filenames:

--- begin of file ----
#comment line #1
;comment line #2
0010.frf
0009.frf
0008.frf
file_page7.frf
0006.frf
0005.frf
file_page.frf
0003.frf
0002.frf
0001.frf
--- end of file ----

Examples of usage:

FRFGrab -t 0005.FRF

--- tests whether the FRF format was recognized. Prints OK or Error.

FRFGrab -g *.frf > output.txt

--- Processes all FRF files in the current directory. The page numbers are decided from filenames.

--- This is the most typical example of using this program to process all pages of an entire book. It extracts OCR information from the FRF file and creates an OCR layer file in output.txt. This OCR layer can be embedded into a DJVU file by the command, djvused -f output.txt MyFile.djvu

FRFGrab -p 4 *.frf

--- Processes all FRF files in the current directory. Specifies that the DJVU file will have OCR text starting from page 4. The OCR information is printed to STDOUT.

FRFGrab -g 0005.FRF > output5.txt

--- extracts OCR information from the FRF file and creates an OCR layer file in output5.txt. This OCR layer can be embedded into a DJVU file by the command,

djvused -f output5.txt MyFile.djvu

--- Then MyFile.djvu will have page #5 with OCR.

FRFGrab -p 8 0005.FRF > output.txt

--- Same thing as before, except it is indicated that the file 0005.FRF actually contains OCR information about page #8 in the DJVU file into which output.txt will be eventually embedded.

FRFGrab -v 0005.FRF

--- prints information about the entire process of FRF parsing.

A normal working session:

FRFGrab -t -q *

and if there are no errors with FRFGrab,

FRFGrab -p 1 * > book.txt

and then:

djvused -f book.txt book.djvu

For automation, the program FRFGrab can be used in MSDOS .BAT files. The program returns ERRORLEVEL=0 after normal termination and ERRORLEVEL=1 or 2 on errors.

Sample procedure when making new DJVU files:

1a) Scanning;

2a) Make a DJVU file using DjvuSolo or DjvuEditor if the DJVU file already exists (someone else gave it to you),

1b) Extract TIF pages from DJVU using DJVUDECODE.EXE;

2b) DJVU file already exists :)

3) Recognize the TIF files using FineReader in batch mode (see below),

4) Run FRFGrab and extract the OCR text from the FRF files,

5) Run DJVUSED to embed the OCR text into DJVU file, as shown above.

Using FineReader in batch mode:

1) Run FineReader;

2) In menu "File/New batch" make a new batch. Choose recognition languages.

3) In menu "File/Open" add pages to the batch. You can add either a few pages at a time, or all the pages (but not more than 1167 pages at one time!) Be sure that the pages are added in the correct order of increasing page numbers.

4) In menu "File/Save batch" (FineReader 7.0) specify a name for the batch. After recognition, a correspondingly-named directory will contain all the FRF files.

5) Select all pages and press "Read"; recognition starts.

6) Wait. :)). With a Pentum IV/1600, it takes about 20-40 minutes to recognize 250 pages, on Athlon/1000 - about 35-50 minutes. Pages with mathematical equations are recognized about half as fast. However, some scans are recognized at 200 pages in 10 minutes.

7) Be sure to press the menu "File/Close batch", or else some files may not be written to disk.

8) Run FRFgrab, as shown above.

Attention! It is quite sufficient to use the try&buy version or demo version of FineReader, even if its free evaluation time is expired. This is because we do not need to export any text in any formats. The try&buy version works fine and creates correct FRF files; this is all we need.

Known problems:

Since the FRF format is undocumented (I need FineReader SDK, which I cannot afford), the format was guessed.

1) It seems that FRF files contain information about the FR version. The program works with the format version 2.10 (FineReader 7.x). Other versions are not supported. Support of FineReader 8.x is not yet ready.

2) If after automatic recognition by FineReader you edited something in the text by hand, the edited text will lose the information about page coordinates (x coordinate = 0). This, of course, will give incorrect rectangles on the screen when you perform search in DJVU files. I do not recommend to edit anything after recognition! This problem is unsolved.

3) The program does not process pages containing tables whose cells contain pictures (and I can't fix this!) Such tables should be removed by hand in FR (change the parameters of the frame, so that there are no tables with pictures but only text) and recognize that page again.

4) If a page is scanned at an angle (skewed), FineReader tries to deskew it, and then the coordinates of selected text are slightly incorrect on the screen. This cannot be fixed. You should deskew all pages before running FineReader.

5) FineReader sometimes fails completely to recognize some page, but instead marks it as an "error" (the red icon). Such pages should be dealt with like this: with the mouse, mark a part of the text (excluding equations and graphs), and press Read (recognize just this page). If everything is OK, mark the text block, recognize, and so on. If something fails, look at the line of text where FR fails. Exclude this line from the block, or exclude some special symbols that present a problem for FR.

6) FineReader does not recognize pages containing a picture of a grid, or a large graph in logarithmic scale. This is recognized as a large table and sometimes FR generates a General Protection Failure and crashes. Such pages should be recognized in FR after selecting only the text blocks.

7) If some pages are recognized without errors but FRFGrab fails to test them (-t fails), this happens for these reasons:

- there are too many blocks on the page. You should then mark all blocks as one large "Text frame", and run the recognition again on that page.

- there is a very long paragraph, for example, a full-page list. Then you can make two blocks by hand, splitting the page in half, and repeat the recognition.

Project history:

07.01.2007 -	version 1.12 For details see DjvuOCR-ru.txt
30.08.2006 -	version 1.11 Added support for FineReader 8.x The word hyphenation problem is finally solved - in copy/pasted text are no ehtra words.
05.03.2006 -	version 1.10 Added support of UNICODE generated by FineReader. Now pages containing several languages can be processed, and, for any recognition language, it is not necessary to have a version of Windows localized for that language.
22.02.2004 -	version 1.09 Some problems solved on pages without text, only pictures. A new option is added: "-i" - If some pages have errors, ignore the errors, extracting as much text as possible.
16.02.2004 -	version 1.08 Solved problem when internal image name missing.
09.02.2004 -	version 1.07 Added a new option "-h". This option switches off the new method of handling hyphenated words, which I designed, and which is now always used by default. The idea of the method is to avoid the problem when a hyphenated word is split into two parts, and cannot be found when performing search in DJVU files. For example: "this function is int-" "egrable on an interval..." The word "integrable" cannot be found by searching, only the pieces of it, "int" and "egrable". The new method is to repeat the entire word in the OCR text, for example: "this function is int-" "integrable on an interval..." Now the entire word "integrable" can be found by search. This however somewhat disturbs copy/paste operations... If you don't want this feature, use the option "-h".
07.02.2004 -	version 1.06 Fixed some problems with pictures in text. Fully processed 131 books (more than 37900 pages).
31.01.2004 -	version 1.05 Fixed problem with bad word coordinates, coming from FineReader.
26.01.2004 -	version 1.04 Fixed problem in browser - line with coordinates {0,0,0,0}
25.01.2004 -	version 1.03. Fixed problem with pictures in text block. Fixed problem with big tables.
19.01.2004 -	made 4 books with OCR without any problem with FRFrab.EXE! first release version 1.0.
18.01.2004 -	version 0.3 Processes pages containing tables & pictures. 100% FineReader 7.x, 100% FineReader 5.x - solved problem with special characters while generating lisp-text.
08.01.2004 -	version 0.2 100% working on pages that have no tables & pictures
05.01.2004 -	version 0.1 processes text-only pages.
22.12.2003 -	initial implementation.

Notes:

If FRFgrab does not process a page, look at that page in FineReader. Sometimes FR makes strange tables. If you delete the tables and mark the entire text as a "text frame" and repeat the recognition, it often solves all the problems.

If DjVused.exe cannot process large files, you should split the files and work on pieces.

A simple but necessary check:

djvused -e output-txt MyFile.djv > MyFile.txt

If at the end of the file MyFile.txt you see some error messages, this is a problem! Let me know and I will try to solve problems with FRFgrab.

Everyone who would like to help, please send FRF files that cannot be processed by FRFgrab to me, gencho [почтознак] yourwap.com, after giving me a notice. I need only the .FRF file.

If DJVUSED.EXE cannot process the result of FRFgrab, please send that result also to me, in a ZIP/RAR archive.

And, of course, I need your comments and suggestions.

07.01.2007, Bulgaria, Sofia.

<gencho> djvuocr [почтознак] mail2world.com

Подготовил: monday2000.

9 марта 2007 г.

E-Mail (monday2000 [at] yandex.ru)