Metadata Storage for DjVu files (A White Paper)

Metadata Storage for DjVu files

A White Paper

By James Rile, James Rile Associates, December 4, 2001

Introduction:

We have created a method and a standard architecture for embedding metadata in DjVu files, which is supported by the DjVu file parser of the DjVuSearch product and by the Metadata Import Utility that we have developed.

Metadata is information about documents, stored in individual data fields. These fields define, classify and categorize the documents. Historically these data fields have resided in databases, apart from the documents themselves. DjVuSearch embeds the data fields directly into the document, making them portable. In other words, they “take their metadata with them”. As another alternative to the database-storage of metadata, DjVuSearch permits you to store the metadata in “companion” text files.

Each document format supported by DjVuSearch has its own structure for storing embedded metadata. For example, HTML has metatags and PDF has the DocInfo field dictionary. The DjVu file format lacks a formal metadata structure, but the open architecture of the DjVu format permits us to create our own standard for metadata storage “inside” of DjVu files.

The purpose and benefit of storing metadata in the DjVu file structure is that it will not be lost (left behind) during INDIRECT<->BUNDLED conversions, as well as when downloading a DjVu file with the Viewer plug-in. Because metadata is an integrated part of the DjVu file, the DjVu file is portable, as it “carries” its metadata with it at all times. As a result, for example, a DjVu file can be downloaded from the web and re-indexed on the desktop or on CD, and the metadata will still be present. We call this form of metadata storage “Embedded Metadata”.

In our standard architecture, metadata fields are stored with tag delimiters. For example, this document in the DjVu format has title, author and date fields that are expressed and stored as:

<title>Metadata Storage for DjVu files</title>

<author>James Rile</author>

Metadata is stored in our standard architecture at three levels: document-level, shared-page-level and page-level.

Metadata Storage Architecture:

There are two global .iff files per document. The names of these files are docmeta.iff and pagemeta.iff.

The docmeta.iff file contains document-wide metadata and will be used when indexing the document as a whole. This file is mentioned in the DIRM chunk of the DjVu file as an included object.

The pagemeta.iff file contains shared page-level metadata. It is referenced by the metachunk of the DjVu page.

The metachunk of each DjVu page in a new section called JRMD contains unique page-level metadata, along with the reference to pagemeta.iff, if present.

External Metadata Storage Option:

We call this form of metadata storage “External Metadata” because the metadata is stored in external .txt files. These files will be “left behind” if the DjVu files are transformed, since they are not embedded in the DjVu file.

Instead of embedding document-wide metadata in “docmeta.iff”, it is possible to store this metadata externally in a docmeta.txt file.

Instead of embedding shared-page-level metadata in “pagemeta.iff”, it is possible to store this metadata externally in a pagemeta.txt file.

Instead of embedding unique page-level metadata in the metachunk of a DjVu page, it is possible to store this metadata externally in a page_xxx.djvu.txt file

Combination Embedded and External Metadata Storage is Possible

It is possible for some metadata to be Embedded and some metadata to be External. For example, document-level metadata can be stored internally in a docmeta.iff file, while page-level data is stored externally in page_xxx.djvu.txt files.

We recommend that unless you have specific reasons for doing this, you go ahead and embed the metadata in the DjVu files.

External Metadata can be converted to Embedded Metadata, as detailed below.

Metadata Embedding Utility

The utility will take metadata that is stored in external text files, and store them internally in the DjVu file.

The program will expect the following as the input file structure for Embedding:

BUNDLED and SINGLE:

* <doc_name>.djvu

* <doc_name>.djvu.txt

* there are no restrictions on what <doc_name> should be.

INDIRECT:

index.djvu

*page_xxx.djvu (1 per page)

**page_xxx.djvu

docmeta.txt

pagemeta.txt

* ”page” can be any name.

** when a page is indexed at the page-level, then the name should end with an

underscore followed by a number that is equal to the DjVu page number, and which is

part of the DjVu page file name.

Example layouts for the Metadata Import Utility

Bundled:

Able.djvu

Able.djvu.txt

Bertha.djvu

Bertha.djvu.txt

Convey.djvu

Convey.djvu.txt

Indirect - Document Level:

Able (folder)

Index.djvu

Able_001.djvu

Able_002.djvu

Able_003.djvu

Docmeta.txt

Indirect - Page Level:

Able (folder)

Index.djvu

Able_001.djvu

Able_001.djvu.txt

Able_002.djvu

Able_002.djvu.txt

Able_003.djvu

Able_003.djvu.txt

Docmeta.txt

Pagemeta.txt

Sample metadata fields in the DjVuSearch Model Application:

Bundled and Indirect - Document Level:

Title

Author

Date

Subject

Summary

Publisher

Indirect - Page Level:

Docmeta.iff (permits optional document-level search-indexing instead when present)

Title

Author

Date

Subject

Summary

Publisher

Pagemeta.iff (shared page-level metadata)

Author

Date

Subject

Publisher

Page_xxx.djvu (JRMD chunk) (unique page-level metadata)

Title

Summary

Methods of Search-Indexing DjVu Files with Embedded Metadata

This is what happens at the search-indexing stage:

BUNDLED documents:

The bundled DjVu file is indexed directly. The parser looks for a file with name docmeta.iff in the bundle. The contents of this file (if any) will be used as document-level metadata. Metachunks in the pages will be ignored along with the shared-page-level metafile.

INDIRECT documents indexed as a document:

The “index.djvu” file is indexed. The parser will recognize that index.djvu is an index for the INDIRECT document, will look for a docmeta.iff file, and will use it as document-level metadata. Metachunks in the pages will be ignored along with the pagemeta.iff file, if any.

INDIRECT documents indexed page-by-page:

The “index.djvu” file is not indexed. The individual DjVu pages are indexed instead. The parser will find a metachunk in the page file, will decode it, retrieve the metadata, and notice the name pagemeta.iff, which it will load next. docmeta.iff will be completely ignored in this scenario. If the name pagemeta.iff has not been found (referenced) in the metachunk, it won't be opened and processed.

The extra overhead of page-by-page indexing (of opening the pagemeta.iff file) will only occur when indexing an INDIRECT document page-by-page.

Methods of Search-Indexing DjVu Files with External Metadata

The methods are the same as for Embedded Metadata, since if Embedded Metadata is not found, the search indexer looks for the external form of metadata storage in text files, as described above.

How to Index DjVu files in dtindexer.exe (the search-indexing application for DjVuSearch)

Assign the DjVu Viewer for use with DjVu files:

For all types of DjVu files, the DjVu Viewer must be defined as the external viewer for the DjVu file type. Otherwise, the search application will display a DjVu file as extracted text in HTML instead of as DjVu in the DjVu Viewer.

External Viewer definition for DjVu files

Indexing DjVu-Bundled files:

To build a search-index for DjVu-Bundled files, set up the index with an Include filter for “*.djvu”, as follows:

Indexing DjVu-Indirect files at the Document Level:

To index DjVu - Indirect files at the document level, we need to open and index the “index.djvu” file of each Indirect DjVu document, while excluding all of the page-level DjVu files that are “exposed” in the Indirect format.

The way to do this is to provide an Include filter for “index.djvu”, while also providing an Exclude filter for “*.djvu”, as follows:

Indexing DjVu-Indirect files at the Page Level:

To index DjVu - Indirect files at the page-level, we need to open and index the page-level DjVu files that are “exposed” in the DjVu-Indirect format, , while excluding the “index.djvu” files..

The way to do this is to provide an Include filter for “*.djvu”, while also providing an Exclude filter for “index.djvu”, as follows:

Setup for indexing Indirect DjVu Files at the Page Level

In DjVuSearch, there are two methods to open page-level DjVu files. The first is by creating a “simple” hyperlink directly to the DjVu page. In this method, the DjVu is treated as a stand-alone page and the internal page navigation buttons of the DjVu Viewer will be “grayed-out”.

The more advanced method is to create a hyperlink that opens the “index.djvu” file instead for the page, and then “goto” the page using a CGI “page-open” argument. In this method, the DjVu page is opened with internal Viewer page navigation enabled.

Note: for this later method to work, the names of the page-level DjVu files must end with a number that corresponds to the page number within the multi-page DjVu document. Here is an example:

Index.djvu

Page_001.djvu

Page_002.djvu

Page_003.djvu

The page number will be parsed from the filename, and will be used as the “goto page” argument in DjVuSearch.