Metadata Storage for DjVu files
A White Paper
By James Rile, James Rile Associates, December 4, 2001
Introduction:
We have created a method and a standard architecture for embedding metadata in DjVu files, which is supported by the DjVu file parser of the DjVuSearch product and by the Metadata Import Utility that we have developed.
Metadata is information about documents, stored in individual data fields. These fields define, classify and categorize the documents. Historically these data fields have resided in databases, apart from the documents themselves. DjVuSearch embeds the data fields directly into the document, making them portable. In other words, they take their metadata with them. As another alternative to the database-storage of metadata, DjVuSearch permits you to store the metadata in companion text files.
Each document format supported by DjVuSearch has its own structure for storing embedded metadata. For example, HTML has metatags and PDF has the DocInfo field dictionary. The DjVu file format lacks a formal metadata structure, but the open architecture of the DjVu format permits us to create our own standard for metadata storage inside of DjVu files.
The purpose and benefit of storing metadata in the DjVu file structure is that it will not be lost (left behind) during INDIRECT<->BUNDLED conversions, as well as when downloading a DjVu file with the Viewer plug-in. Because metadata is an integrated part of the DjVu file, the DjVu file is portable, as it carries its metadata with it at all times. As a result, for example, a DjVu file can be downloaded from the web and re-indexed on the desktop or on CD, and the metadata will still be present. We call this form of metadata storage Embedded Metadata.
In our standard architecture, metadata fields are stored with tag delimiters. For example, this document in the DjVu format has title, author and date fields that are expressed and stored as:
<title>Metadata Storage for DjVu files</title>
<author>James Rile</author>
<date>20011204</date>
Metadata is stored in our standard architecture at three levels: document-level, shared-page-level and page-level.
Metadata Storage Architecture:
There are two global .iff files per document. The names of these files are docmeta.iff and pagemeta.iff.
The docmeta.iff file contains document-wide metadata and will be used when indexing the document as a whole. This file is mentioned in the DIRM chunk of the DjVu file as an included object.
The pagemeta.iff file contains shared page-level metadata. It is referenced by the metachunk of the DjVu page.
The metachunk of each DjVu page in a new section called JRMD contains unique page-level metadata, along with the reference to pagemeta.iff, if present.
External Metadata Storage Option:
We call this form of metadata storage External Metadata because the metadata is stored in external .txt files. These files will be left behind if the DjVu files are transformed, since they are not embedded in the DjVu file.
Instead of embedding document-wide metadata in docmeta.iff, it is possible to store this metadata externally in a docmeta.txt file.
Instead of embedding shared-page-level metadata in pagemeta.iff, it is possible to store this metadata externally in a pagemeta.txt file.
Instead of embedding unique page-level metadata in the metachunk of a DjVu page, it is possible to store this metadata externally in a page_xxx.djvu.txt file
Combination Embedded and External Metadata Storage is Possible
It is possible for some metadata to be Embedded and some metadata to be External. For example, document-level metadata can be stored internally in a docmeta.iff file, while page-level data is stored externally in page_xxx.djvu.txt files.
We recommend that unless you have specific reasons for doing this, you go ahead and embed the metadata in the DjVu files.
External Metadata can be converted to Embedded Metadata, as detailed below.
Metadata Embedding Utility
The utility will take metadata that is stored in external text files, and store them internally in the DjVu file.
The program will expect the following as the input file structure for Embedding:
BUNDLED and SINGLE:
* <doc_name>.djvu
* <doc_name>.djvu.txt
* there are no restrictions on what <doc_name> should be.
INDIRECT:
index.djvu
*page_xxx.djvu (1 per page)
**page_xxx.djvu
docmeta.txt
pagemeta.txt
* page can be any name.
** when a page is indexed at the page-level, then the name should end with an
underscore followed by a number that is equal to the DjVu page number, and which is
part of the DjVu page file name.
Example layouts for the Metadata Import Utility
Bundled:
Able.djvu
Able.djvu.txt
Bertha.djvu
Bertha.djvu.txt
Convey.djvu
Convey.djvu.txt
Indirect - Document Level:
Able (folder)
Index.djvu
Able_001.djvu
Able_002.djvu
Able_003.djvu
Docmeta.txt
Indirect - Page Level:
Able (folder)
Index.djvu
Able_001.djvu
Able_001.djvu.txt
Able_002.djvu
Able_002.djvu.txt
Able_003.djvu
Able_003.djvu.txt
Docmeta.txt
Pagemeta.txt
Sample metadata fields in the DjVuSearch Model Application:
Bundled and Indirect - Document Level:
Title
Author
Date
Subject
Summary
Publisher
Indirect - Page Level:
Docmeta.iff (permits optional document-level search-indexing instead when present)
Title
Author
Date
Subject
Summary
Publisher
Pagemeta.iff (shared page-level metadata)
Author
Date
Subject
Publisher
Page_xxx.djvu (JRMD chunk) (unique page-level metadata)
Title
Summary
Methods of Search-Indexing DjVu Files with Embedded Metadata
This is what happens at the search-indexing stage:
BUNDLED documents:
The bundled DjVu file is indexed directly. The parser looks for a file with name docmeta.iff in the bundle. The contents of this file (if any) will be used as document-level metadata. Metachunks in the pages will be ignored along with the shared-page-level metafile.
INDIRECT documents indexed as a document:
The index.djvu file is indexed. The parser will recognize that index.djvu is an index for the INDIRECT document, will look for a docmeta.iff file, and will use it as document-level metadata. Metachunks in the pages will be ignored along with the pagemeta.iff file, if any.
INDIRECT documents indexed page-by-page:
The index.djvu file is not indexed. The individual DjVu pages are indexed instead. The parser will find a metachunk in the page file, will decode it, retrieve the metadata, and notice the name pagemeta.iff, which it will load next. docmeta.iff will be completely ignored in this scenario. If the name pagemeta.iff has not been found (referenced) in the metachunk, it won't be opened and processed.
The extra overhead of page-by-page indexing (of opening the pagemeta.iff file) will only occur when indexing an INDIRECT document page-by-page.
Methods of Search-Indexing DjVu Files with External Metadata
The methods are the same as for Embedded Metadata, since if Embedded Metadata is not found, the search indexer looks for the external form of metadata storage in text files, as described above.
How to Index DjVu files in dtindexer.exe (the search-indexing application for DjVuSearch)
Assign the DjVu Viewer for use with DjVu files:
For all types of DjVu files, the DjVu Viewer must be defined as the external viewer for the DjVu file type. Otherwise, the search application will display a DjVu file as extracted text in HTML instead of as DjVu in the DjVu Viewer.
Indexing DjVu-Bundled files:
To build a search-index for DjVu-Bundled files, set up the index with an Include filter for *.djvu, as follows:
Indexing DjVu-Indirect files at the Document Level:
To index DjVu - Indirect files at the document level, we need to open and index the index.djvu file of each Indirect DjVu document, while excluding all of the page-level DjVu files that are exposed in the Indirect format.
The way to do this is to provide an Include filter for index.djvu, while also providing an Exclude filter for *.djvu, as follows:
Indexing DjVu-Indirect files at the Page Level:
To index DjVu - Indirect files at the page-level, we need to open and index the page-level DjVu files that are exposed in the DjVu-Indirect format, , while excluding the index.djvu files..
The way to do this is to provide an Include filter for *.djvu, while also providing an Exclude filter for index.djvu, as follows:
In DjVuSearch, there are two methods to open page-level DjVu files. The first is by creating a simple hyperlink directly to the DjVu page. In this method, the DjVu is treated as a stand-alone page and the internal page navigation buttons of the DjVu Viewer will be grayed-out.
The more advanced method is to create a hyperlink that opens the index.djvu file instead for the page, and then goto the page using a CGI page-open argument. In this method, the DjVu page is opened with internal Viewer page navigation enabled.
Note: for this later method to work, the names of the page-level DjVu files must end with a number that corresponds to the page number within the multi-page DjVu document. Here is an example:
Index.djvu
Page_001.djvu
Page_002.djvu
Page_003.djvu
The page number will be parsed from the filename, and will be used as the goto page argument in DjVuSearch.
|