DjVu as Digital Surrogate for Paper Publications
by James Rile, PlanetDjVu, August, 2001

Digital Surrogates Needed Despite Basic Differences
Paper-based publications are constructed of atoms, while digital publications are constructed of bits. Clearly one does not equal the other by their basic construction.

Another irrefutable difference is the environments in which they reside. You cannot curl up (not quite yet) with a digital book, and you cannot search a paper publication (although you can leaf through the pages).

Despite the differences, we have very important reasons now to create digital publications that are faithful surrogates (stand-ins for) paper publications.

We can distribute and provide access to digital surrogates globally and instantly via the web, and in large collections on CD and DVD.

What makes a digital file a surrogate for paper?
Let's begin by stating that the layout and page formatting of the paper publication must be faithfully maintained, as well as the text contained on the pages. This disqualifies HTML, which cannot retain the full formatting layout of the paper.

We must be using a raster (digital) image of the page in our digital publication for it to be a faithful surrogate.

We have already come a long way towards accepting raster images as surrogates, with our acceptance of photocopies and faxes as replacementss for original paper. While these are paper-surrogates for paper, they were created with raster images by the equipment that produced them.

How are raster images used in a digital surrogate file format?
Quite simply, the raster image must be what is viewed when the page is presented on the display device, and what is printed when the page is sent to a printer.

Is Color a requirement for a faithful digital surrogate?
The obvious answer is yes.  When we look a the paper we see color, and our digital display device can show us color (and the web is rich with color), so what's the problem?  The basic problem has been that color raster image formats have been to large for the available internet bandwidth connections of today.  The DjVu format solves this problem by providing superior compression of color raster images so that they can be easily transmitted via the web.  We are approaching the day when black and white page images will seem as antique (and unrealistic) as black and white movies.

The Concept and Importance of Digital Bookbinding
When a paper publication becomes digital, it loses the physical binding.  With no spine or page-flipping ability, we must find other ways to bind digital paper together.  

The binding of digital pages creates a paginated format, as opposed to non-paginated formats like HTML and XML.  With the use of page images, we cannot reformat page breaks anyway in our digital surrogate publication.

Within the paginated digital file, then, we must have means of navigation that are the best that the digital environment can provide.

Meta-Information is the Digital Bookbinding
Our fingers cannot flip pages in the digital environment, they press keys and operate pointer devices instead.  These actions control the navigation through a digital publication, using meta-information.

The types of meta-information stored with page image (to comprise a complete digital surrogate publication) are:

Bookmarks (table of contents hyperlinks)
Thumbnail hyperlinks
OCRed text for search and retrieval navigation
Page number and word coordinate references

Without these elements, the user cannot navigate a digital paginated publication, so they are essential components.

Image + Text format is the answer for a faithful digital surrogate
There are only two digital document formats today that both present paginated page images for display and printing, and also support OCRed text for search and retrieval. These are the PDF format and the DjVu format.  Of these two, only DjVu is capable of presenting color images in a file size that is compact enough for web delivery.

We therefore conclude that the DjVu format is the digital format that best qualifies as a faithful digital surrogate publication format, for delivery on the web today.

