University of Louisiana Digital Library
PDF Normal from Image compared to DjVu with hidden text
This comparison is made using academic articles published at the University of Louisiana Digital Library in PDF Normal format.  The PDF files were created between 1996 - 1998.  The Normal form of PDF is the smallest form of PDF that can be created from scanned page images.

In the Normal form of PDF, the text is converted to visible ASCII text with fonts and point sizes and text attributes such as bolding and italics.  Graphic illustrations, equations and symbols are left as graphic objects in the PDF.  These graphic objects are "snippets" of the original complete scanned page image.  By having only snippets and not the complete image in the page, the file size is smaller in PDF Normal.

The downside of PDF Normal is that it is impossible to represent the layout of the original page with complete fidelity.  Also, defects can occur in the ASCII text presentation, as you can see in these PDF examples.

DjVu takes a different approach than PDF Normal in producing a small full-text-searchable file.  High-contrast analysis is performed to separate background graphics and foreground text.  The text is compressed with the JB2 compression method that is superior to supported PDF methods for bitonal images. The graphics are copressed with IW44 which is superior to the supported PDF methods for graphics.  The background and foreground layers are merged when the image is presented in the viewer.

Like PDF Image + Text (now called Searchable Image PDF), DjVu has a searchable text layer hidden under the image, and in JRASearch both forms can be searched with resulting search term highlighting.

In the table below, click on the size to open that file:

Library PDF Filename
PDF Normal
DjVu
    cdx00107
    cdx00108
    cdx00111
    cdx00114
    cdx00115
    cdx01237
    cdx01240
    cdx01242
        Total File Size
8.39
3.18
        Reduction Percentage
100%
37%


Note that the DjVu file for cdx01240 is OCRed in three languages: English, German and French!



PDF Color Image-Only compared to DjVu Photo and Segmented

Historic weather logs from the University of Louisiana are presented as JPEG-compressed images in a "PDF Wrapper".  We compare these to DjVu Photo (background IW44 compression only) and DjVu Segmented (background and foreground layers).

Weather Log PDF Filename
PDF Image
DjVu Photo
DjVu Segmented
     rj00141
     rj00142
     rj00143
     rj00145
     rj00147
     rj00148
     rj00149
     rj00150
        Total File Size
16.42
7.73
0.86
        Reduction Percentage
100%
47%
5%


The defects in the DjVu Segmented version (text in the background) will be eliminated when the uncompressed color image file is used for DjVu encoding. In this presentation, the compressed color image file in the PDF was used, limiting the effectiveness of the segmenter. For an example of good segmentation of handwritten text using uncompressed color images as input, see: http://www.planetdjvu.com/gallery/jones.djvu.








Hosted by uCoz