Creating OCRed Text in DjVu
A report by PlanetDjVu, February 11, 2003
In October, 2000, the DjVu file format was upgraded to 3.0, and the big new addition to the format was searchable text. Searchable text in DjVu version 3.0 and above is mapped to the image, so that when you search for a word or phrase, it will be highlighted in the image of the page.
With searchable text, DjVu behaves very much like Searchable-Image PDF files. Although the searchable text cannot properly be classified a layer, it is nevertheless useful to think of it as a hidden text layer. We have presented a general discussion of text-image maps in another news article, which you can view by clicking here.
When a page image is scanned, it is just contains the image of text, and does not contain any "computer" text that can be searched or copied. The searchable text is generated from an image of a page using a technology called Optical Character Recognition (OCR).
OCR is a complex operation, and there are relatively few OCR software vendors. Often, OCR engines from one company are used in products from other companies. Such was the case with DjVu products from LizardTech, which were compiled in the October, 2000 upgrade with the Expervision OCR engine, for the English language only. The products that included the Expervision OCR engine were: DjVu Solo Professional, DjVu Workgroup, and DjVu Enterprise Edition (also called DjVu Command Line Encoder).
In March, 2002, LizardTech announced a 3.5 upgrade to DjVu Enterprise Edition, with support for OCR in Japanese, German, French and Dutch in addition to English. This was a long overdue improvement, since the Expervision OCR engine had always supported these additional languages, but they were not enabled.
In June, 2002, LizardTech discontinued these existing DjVu products and announced a new, "Document Express" line of products. We thought we understood the switching of product names and assumed that OCR remained a part of the re-branded products. But then, in the fall of 2002, a user in the Forum of PlanetDjVu complained that LizardTech had ceased to offer OCR for DjVu in their products.
This was enough of a concern to search through the LizardTech website looking for information about the OCR feature for DjVu. To our suprise, we find that it is no longer referenced as a feature at all! Does this mean that LizardTech no longer offers an OCR engine in its DjVu products?
We cannot tell from experience, since we have not seen any of the Document Express products, and no users have reported to us that they are using Document Express products, and our requests to evaluate these products have been turned down.
Here at PlanetDjVu, we are able to use JRAPublish for the OCRing of DjVu in support of our DjVu portal site. JRAPublish is a commercial application that we developed but are not able to release because of DjVu licensing issues that remain outstanding. This product uses the ABBYY FineReader OCR engine and it supports 176 languages! The supported languages are listed here. Many examples in different languages are presented in the International Collection of the Gallery of PlanetDjVu.
Fortunately, you have the option to perform OCR and create a searchable DjVu file at the Any2DjVu Conversion Server. This uses an outdated copy of the Expervision OCR engine, however, and performs OCR in English only. It is also a tool for one-at-a-time file conversion. But it is available to you free of charge.
We wish we could recommend software to you that OCRs DjVu files on your desktop or on your own server, but the truth is that, as far as we can tell, there is none!
We maintain that searchable text is as important to the usefulness of DjVu files as the compression methods are, and the fact that it is not mentioned at all by LizardTech is very sad. The appearance that OCR has been dropped by LizardTech for DjVu products is troubling, as is much else about the DjVu format these days.
LizardTech, if you are reading this article, and you are still offering OCR in your DjVu products, please let us know at PlanetDjVu and we will update this article with the corrections. Please also publish a feature list and maybe even spec sheets for your products so we know what is going on with OCR in partcular, and DjVu products in general. It has been 7 months now since you re-branded your DjVu products, re-tooled your website, and went "silent". While you remain mute, what remains of the original initiative for public adoption and acceptance of the DjVu format is quickly slipping away. Public credibility of DjVu is being lost, and once gone, even superior engineering and limited open-source availability cannot save DjVu from the dustbin of now-obsolete file formats.