METAe and the absence of Text-Image Maps (e.g. DjVu)

METAe and the absence of Text-Image Maps (e.g. DjVu)

A paper by PlanetDjVu, March 14, 2003

Introduction

At the CeBIT Exposition in Hannover, Germany this week and next, in Hall 5, Stand G36, the companies CCS GmbH, WebPark GmbH, Amenotec GmbH and Zissor AS are presenting themselves under the umbrella of the Socrates Gruppe and invite businesses and organisations to acquaint themselves with the jointly established business advantages of their clients.

http://www.ccs-gmbh.de/

The Company CCS GmbH is charged with the commercialization of the METAe - the Metadata Engine Project, a joint project of 14 organizations and libraries funded by the European Union.

The homepage of the METAe project is located at: http://meta-e.uibk.ac.at/

One component of the METAe project that we note here with interest is the use of FineReader OCR for "Fraktur" and other old type faces

The METAe software package will consist of a specialized omnifont Optical Character Recognition (OCR) engine adapted to recognize old type faces and historical texts. The software will be based on the FineReader OCR, developed by ABBYY Europe. This is an overdue task, especially for the German typeface "Fraktur" (a German style of black-letter text type used in the majority of printed texts in Central and Northern European countries up to the middle of the 20th century). The OCR engine will be supported by five historical dictionaries representing the historical orthography of English, French, German, Italian, and Spanish language. The OCR engine will be part of the METAe engine (application program interface) as well as an individual commercial product for the end-user. The OCR engine is developed by ABBYY Europe.

The pupose of METAe is for the recognition and extraction of textual data for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language.

Why just XML and not Text-Image Maps?

METAe offers the ability to make much of the European printed matter of the 19th and 20th enturies available in searchable form on the web, but we must here ask a fundamental question of the esteemed bodies that make up the METAe project, which is: why the exclusive focus on XML extraction, and why is the concept of Text-Image maps not being addressed?

Please refer to our earlier News article on Text-Image maps.

It seems to us here at PlanetDjVu that while the benefits of text extraction into structured XML cannot be argued, the benefits of mapping the text UNDER THE IMAGE have been overlooked.

With Text-Image Maps, which are supported by both DjVu and PDF, text that is searched can be highlighted and copied IN-CONTEXT with the image of the original paper. This is a benefit that cannot be achieved in XML, where a link out to an external image is all that is possible.

We raise this important question for the METAe group, and also for LizardTech, the owner of the DjVu format. LizardTech, who might well be the champion of the Text-Image Map concept that is fundamental to the DjVu format, instead has dropped all references to OCR on their website presentation of the DjVu format! This is perhaps the greatest strength of the DjVu format, greater than image compression, and it is being completely ignored!

Demonstration of Text-Image Maps using DjVu

Our first demonstration is of an early German printing of thestory "Snow White", printed in German Fraktur, one of the typefaces being addressed by the METAe project. This has been OCRed in German using the ABBYY FineReader engine. OCR results are poor because we did not have the ability to recognize Fraktur. We can only imagine at this point the OCR results that will come from the METAe OCR engine using a new Fraktur recognition dictionary. Try out the text-find and text-copy features as a demonstration of the benefit of Text-Image Mapping.

Snow White in Fraktur

Our final demonstration here is the five published newsletters of the METAe project! As you enjoy the benefits of Text-Image Mapping in these demonstration files, you can also read and learn more about the METAe project.

These are presented in DjVu format, and were converted from PDF. We converted using the Any2DjVu conversion server, which uses an early beta version of the DjVuDigital conversion software. This was capable of performing DjVuDigital segmentation and transferring hyperlinks, but was not capable of transferring the ASCII text. We then OCRed using ABBYY FineReader (English only) in JRAPublish software to get the final results.

Conclusion

We hope that by raising the question of Text-Image Mapping, which is so well implemented with DjVu when you have the supporting OCR engine and OCR dictionaries, such as the ABBYY FineReader OCR engine used by JRAPublish for contemporary printed material, that consideration will be given to supporting the OCR of historic printed material and the output of this OCR as Text-Image Maps using the DjVu file format for effective web delivery.