Вернуться к разделу "Программа DjvuOCR".
- added option '-j'
- Improved the processing of lines with hyphenated words at the end of line
I use the program dtSearch to make a CD with a full-text search. Since DJVU files are not recognized by dtSearch, I made a utility that converts the OCR layer file into an HTML file with the recognized text. This HTML file can be stored within a ZIP file together with the book (dtSearch can search inside ZIP files). In this way you can have a large DJVU collection with full-text search. When dtSearch finds something within a ZIP file, you should load the corresponding DJVU file, with a suitable naming convention, for example,
myfile.djvu myfile.djvu.zip |
cvthtml [-j] <in_file> <out_file> |
-j - glues together lines that appear to be parts
of one paragraph. (i.e. removes CR/LF at the end of lines that do not end by
a punctuation sign)
in_file - a text file, result of FRFGrab.EXE or extracted form a DJVU file using the command
djvused -e output-txt Myfile.djvu > ocrfile.txt |
Note: please check at the end of the file ocrfile.txt, whether there are any error messages from djvused.exe
out_file - resulting HTML file in UTF8 encoding. This file can be directly viewed in a web browser.
Автор: gencho djvuocr [почтознак] mail2world.com
Подготовил: monday2000.
9 марта 2007 г.
E-Mail (monday2000 [at] yandex.ru)