|
Creating OCRed Text in
DjVu
A report by PlanetDjVu,
February 11, 2003
In October, 2000,
the DjVu file format was upgraded to 3.0, and the big new addition to the format was
searchable text. Searchable text in DjVu version 3.0 and above is mapped to the image, so
that when you search for a word or phrase, it will be highlighted in the image of the
page.
With searchable
text, DjVu behaves very much like Searchable-Image PDF files. Although the searchable text
cannot properly be classified a layer, it is nevertheless useful to think of it as a
hidden text layer. We have presented a general discussion of text-image maps in another
news article, which you can view by clicking here.
When a page image
is scanned, it is just contains the image of text, and does not contain any
"computer" text that can be searched or copied. The searchable text is generated
from an image of a page using a technology called Optical Character Recognition (OCR).
OCR is a complex
operation, and there are relatively few OCR software vendors. Often, OCR engines from one
company are used in products from other companies. Such was the case with DjVu products
from LizardTech, which were compiled in the October, 2000 upgrade with the Expervision OCR
engine, for the English language only. The products that included the Expervision
OCR engine were: DjVu Solo Professional, DjVu Workgroup, and DjVu Enterprise Edition (also
called DjVu Command Line Encoder).
In March, 2002,
LizardTech announced a 3.5 upgrade to DjVu Enterprise Edition, with support for OCR in
Japanese, German, French and Dutch in addition to English. This was a long overdue
improvement, since the Expervision OCR engine had always supported these additional
languages, but they were not enabled.
In June, 2002,
LizardTech discontinued these existing DjVu products and announced a new, "Document
Express" line of products. We thought we understood the switching of product names
and assumed that OCR remained a part of the re-branded products. But then, in the fall of
2002, a user in the Forum of PlanetDjVu complained that LizardTech had ceased to offer OCR
for DjVu in their products.
This was enough
of a concern to search through the LizardTech website looking for information about the
OCR feature for DjVu. To our suprise, we find that it is no longer referenced as a feature
at all! Does this mean that LizardTech no longer offers an OCR engine in its DjVu
products?
We cannot tell
from experience, since we have not seen any of the Document Express products, and no users
have reported to us that they are using Document Express products, and our requests to
evaluate these products have been turned down.
Here at
PlanetDjVu, we are able to use JRAPublish for the OCRing of DjVu in support of our DjVu
portal site. JRAPublish is a commercial application that we developed but are not able to
release because of DjVu licensing issues that remain outstanding. This product uses the
ABBYY FineReader OCR engine and it supports 176 languages! The supported languages
are listed here.
Many examples in different languages are presented in the International Collection of the Gallery of PlanetDjVu.
Fortunately, you
have the option to perform OCR and create a searchable DjVu file at the Any2DjVu
Conversion Server. This uses an outdated copy of the Expervision OCR engine,
however, and performs OCR in English only. It is also a tool for one-at-a-time file
conversion. But it is available to you free of charge.
|
|
|