Multiple Index Files for a Single INDIRECT DjVu

Multiple Index Files for a Single INDIRECT DjVu

A report by PlanetDjVu, February 8, 2003

DjVu BUNDLED and INDIRECT Formats

DjVu files can be stored in either BUNDLED (one file) or INDIRECT (many files) formats. The INDIRECT format is optimal for web viewing, because it avoids the need to open up one large multipage file. It is much faster to open up single page files. A file that is created in one storage format can be opened and saved in the other storage format (BUNDLED or INDIRECT) using the DjVu web browser plugin.

INDIRECT Format Structure

With the INDIRECT storage format, each page is stored as a separate DjVu file, and the index information that identifies all the pages in the document is stored in a separate file, also with a .djvu file extension. When you open up the index file, then you are in effect opening up all the pages that are identified by the index file.

If you look at the files of an INDIRECT DjVu file in a file manager, it can be confusing because the index file has the same file extension as the page files, but normally it will be named either index.djvu or directory.djvu, and normally the page files will be named something like page001.djvu, so you can tell them apart by the naming conventions used. An index file can have any name, and this is a useful feature, as we will see below.

OBSOLETE INDEXED Structure

Early versions of DjVu files did not store the index information in a separate file, but instead the index information was stored inside the DjVu file of the first page of the document. This architecture meant that only one index could be created for a multipage document. This was seen as a limitation by AT&T Labs developers, and so this design for the INDIRECT format was made obsolete.

For the past three years, all DjVu files stored as INDIRECT format files store the index information in a separate file. All of the DjVu files in the Gallery of PlanetDjVu are stored in the modern INDIRECT format. To see examples of DjVu files that are still stored in the OBSOLETE INDEXED format, visit the DjVu Examples section of the LizardTech website. The example files there are more than three years old, created with early DjVu software from AT&T Labs when the OBSOLETE INDEXED format was still being created. If you encounter an OBSOLETE INDEXED DjVu file, you can use the DjVu plugin to save as INDIRECT, and the saved file will then be in the modern INDIRECT format.

No Path Information in an Index File for INDIRECT DjVu

An index file that is created for a DjVu document stored in INDIRECT format must be kept in the same directory or folder as the page DjVu files. It cannot be moved to a different folder. This is because there is no provision for storing a file path in the index file. The index file always expects the page DjVu files to be in the same location.

Multiple Index Files for a DjVu INDIRECT Document

The modern architecture for INDIRECT DjVu files keeps the index information in a separate file, and there can be more than one index file for a set of pages in a document. Let's consider a few examples of documents where having multiple index files will be useful.

First, consider a bound volume of magazines for a year. We have all the pages of the volume stored as DjVu pages. Now, for this volume, we could have an index file for the entire volume, an index file for each of the individual issues in the volume, and an index file for each article within each issue.

Second, consider a scanned book that has a public section and a confidential section. There can be one index file for the entire book, and one index file for the public section only. Readers without security clearance are only given a link to the index file for the public section of the book.

Multiple Index Files Demonstrated

Now we will demonstrate the use of multiple index files!

We have taken an article on Amy Grant from an issue of the Saturday Evening Post. We have made 5 DjVu pages - the Cover of the issue, a full-page photo Amy preceeding the article, and the 3 pages of the actual article. Here are thumbnail images of the 5 pages in this demonstration, and links to open them up directly as single-page DjVu files without using an index file:


page001.djvu	page002.djvu	page003.djvu	page004.djvu	page005.djvu

Now, suppose that this article is one of thousands of articles for which we are generating search-indexes. We want one search-index to be for just issue covers, one for just photos, one for article text only, and one for complete articles. We can do this by creating and using multiple index files, and the pages can be shared between these multiple index files. This means that only one copy of each page is stored on disk. There is only one single-page DjVu file, that is referenced by multiple index files.

We will use a naming convention for the index files, starting the names with an underscore. This way, when the files are viewed in a file manager, the index files always sort to the top of the folder, before the page files (which all start with a letter or number that sorts after the files with an underscore. We will also give the index files standard names that correspond to the types of search indexes we are creating.

Here is a list of the multiple index files and pages in our folder, which is stored on the web as http://www.searchpdf.com/gallery/amy/, and which contains 4 index files and 5 single-page DjVu files:

_article.djvu

_complete.djvu

_cover.djvu

_photo.djvu

page001.djvu

page002.djvu

page003.djvu

page004.djvu

page005.djvu

Now we have a file structure that allows us to build multiple search-indexes. For the search index of complete articles, we will apply an INCLUDE filter of "_complete.djvu". For the search index of covers, we will appy an INCLUDE filter of "_cover.djvu", and so on. The search-indexing of an index file includes (indexes) only the pages referenced by the index file.

Let's take a look!

Click on a link to open that index file for this document:

Article Text Only	_article.djvu	Pages 3 - 5
Complete Article with Cover	_complete.djvu	Pages 1 - 5
Cover Only	_cover.djvu	Page 1 only
Photo Only	_photo.djvu	Page 2 only

Making Multiple Index files for an INDIRECT DjVu

You can make multiple index files for an INDIRECT DjVu file using JRAConvert, a commercial application created by JRA and used in this demonstration. Unfortunately, it is not a released product at this time because there are DjVu licensing issues to be resolved.

There are no other GUI applications that you can use to create multiple index files, but you may be able to get the job done by writing a custom script for the Command Line Encoder (now called Document Express Enterprise Edition, from LizardTech), or by compiling the DjVuLibre open source reference library for DjVu, and writing a custom script for that.

Embedded Metadata Stored in each Index File

JRA Software allows you to embed metadata fields in index files. In our example, the Title field of the index file: _cover.djvu will be "The Saturday Evening Post Cover", while the Title file of the index file _photo will be "Amy Grant Photo". This metadata is then used in search and retrieval operations.

Embedded metadata, combined with multiple index files for an INDIRECT document, allows you to search collections of articles, while the files are actually stored on the computer at the volume level. This is ideal for magazines and newspapers. One page may contain more than one article, but this is no problem, because more than one index file can reference the single-page DjVu file.

Admittedly, there is a shortcoming in the design of embedded metadata storage in JRA Software, when multiple index files are used. We assume that there is only one index file in an INDIRECT folder, and we therefore store embedded metadata in a file called "docmeta.iff". The .iff file is referenced by the index file. We need to change this so that the name of the .iff file corresponds to the index file name. Then, using our example, we would have the following files in the folder:

_article.djvu

_article.iff

_complete.djvu

_complete.iff

_cover.djvu

_cover.iff

_photo.djvu

_photo.iff

page001.djvu

page002.djvu

page003.djvu

page004.djvu

page005.djvu

Index File Generator for DjVu?

Here is a great idea for an application to automate the generation of index files for an INDIRECT DjVu folder. A database can be read that identifies the name of each index file to generate, and the pages that are to be included with each index file. If we are generating an index file for each article in a magazine issue, for example, then the include pages will be those for each article.

A great idea to be sure. Too bad that LizardTech does not license the DjVu Library for third-party application development... Today, sadly, there are no commercial third-party applications for DjVu that are available on the market (unlike PDF, for which there are 1,000+ third-party applications).

Conclusion

The DjVu file format designers at AT&T Labs made sure that INDIRECT DjVu files could be created with multiple index files. This fact was overlooked and nearly forgotten, and so it has not been implemented in the DjVu products from LizardTech, and it has not been presented until now. It is an important feature of the INDIRECT storage format for DjVu, and it can be highly useful in digital publishing projects that use multiple search-indexes, or those that use hyperlink (bookmark) trees to access documents at multiple levels.

Real-Life Example

2/22/2003 - Jeffery Triggs has provided the following link to a real-life example:

Illustrated Shakespeare Online contains a three volume edition of the histories, comedies, and tragedies as well as the volume of non-Shakespearean Elizabethan drama. There is an index for each volume, but also separate indexes for each individual play, which allows them to be printed off one at a time, searched efficiently, etc.

http://www.leoyan.com/djvu-editions.com/SHAKESPEARE/COMPLETE/

Illustrated Shakespeare Online