M.L.K. Newspaper comparison using JRAPublish
a study by PlanetDjVu, February 12, 2004
This study is inspired by both the recent release of JRAPublish 2.0 and a return visit to the NewspaperArchive.com website. NewspaperArchive.com was an early adopter of the DjVu format for the conversion of newspaper microfilm to digital web documents, but a year or two ago, for reasons not explained (perhaps a "political" decision), they switched to PDF using CCITT Group 4 compression. One thing we can say for sure is that this decision was not made for the purpose of delivering content faster to the end user, because the current format and compression method are much slower, as we demonstrate.
Study Method
We downloaded the 5 PDF newspaper pages and then used them as input files in JRAPublish. We generated dual searchable-image DjVu and PDF files, applying the JBIG2 compression method to the PDF files. It was necessary to use the "force pages to bitonal" method for processing the input PDF pages, because even though the page images were bitonal, there was unnecessary colorspace information stored in the PDF. This is an all-too-frequent problem with bitonal-mage PDF files created by other software.
Click on any size in the chart below to open that file. Compare the time it takes to open each version. You will see that by far the fastest performers are the DjVu files, followed by the JBIG2-compressed PDF files generated by JRAPublish 2.0.
Featured Article Title
Manhunt On for King's Assassin; Nation Grieves
Violence follows King's Death
King to Ignore Ban on March
New Looting, U.S. Moves More Troops Into Capital
Taft Urges King to Cancel March
Warning Signals Fly
3.614 Mb
2.432 Mb
1.683 Mb
100 %
67 %
47 %
OCR Comparison
The OCR layer of the PDF from NewspaperArchive.com has more recognition errors than the OCR layer of JRAPublish. Also, there is a problem of words running together in NewspaperArchive.com, and the highlight rectangles to not map correctly over the text when you select it. In PDFand DjVu files produced by JRAPublish, the highlight rectangles are very accurate.
In the second table, we show the results of OCR from Document Express 4.1 and Adobe Acrobat 6.01 Capture Plug-in. Both results have significant errors that do not occur in the OCRed text generated by JRAPublish.
Image of text
OCRed Text generated by JRAPublish
OCRed Text from NewspaperArchive.com
MEMPHIS, Tenn. (AP) —
Authorities pressed a manhunt
today for the killer of Dr.
Martin Luther King Jr. whose
assassination yesterday
touched off Negro violence in
a number of American cities
and brought a national out
pouring of grief and sorrow.
King. 39, leading advocate
of nonviolence and Nobel
Prize winner, died in a Mem
phis hospital last night less
than an hour after he was shot
in the neck by a white gun
man while standing on the
balcony of his motel.
Police director Frank Hollo-
man said today that a single
white man, following an ap
parently well planned proce
dure, was the assassin.
MEMPHIS, Tenn. (AP) ???
Authoritiespressed a manhunt
todayfor the killer of Dr.
Martin Luther KingJr. whose
assassination yesterday
touchedoff Negro violence in
a number of American cities
and brought a national outpouring
of grief and sorrow.
King. 39, leadingadvocate
of nonviolence and Nobel
Prize winner, died in a Memphis
hospital last night less
than an hour after he was shot
in the neck bya whit* gunman
while standing on the
balcony of hismotel.
PolicedirectorFrank Holloman
said todaythat a single
white man, following an apparently
well planned procedure,
was the assassin.
Image of text
OCRed Text generated by Document Express
OCRed Text generated by Adobe Acrobat Capture
MEMPHIS, Tenn. tAP) --
Authorities pressed a manhunt
today for the killer of Dr.
Martin Luther King Jr. whose
assassination y e s t e r d a y
touched off Negro violence in
a number of American cites
and brought a national out-
pouring o( grief and sorrow.
King. , leading advocate
of nonviolence and N o b e 1
Prize winner, died in a Mem-
phis hospital last night less
than an hour after he was shot
in the neck by a whit gun-
Dog ay..have o.
m balcony of his motel.
Police director Frank Hollo-
man said today that a single
white man, following an ap.
y'a- ngs parenfly well planned proee-
seen s dure, was the assassin.
MEMPHIS, Tenn. (AP? -
Authorities pressed a manhunt
today for the killer of Dr.
Martin Luther King Jr. whose
assassination y e s t e r d a y
touched off Negro violence in
a number of American cities
and brought a national outpouring
r$ grief and sorrow.
King. 3B. leading advocate
of nonviolence and N o b e 1
Prize winner, died in a Memphis
hospital last night less
than an hour after he was shot
in the neck by a white gunman
while standing on the
balcony of his motel.
Police director Frank H o b
man said today that a single
white man, following an ag
psrently well planned procedure,
%as the assassin.
The conversion of microfilm to bitonal image files (TIFF, G4 compressed), and then the conversion of these files to searchable-image web documents is sound, but care must be taken to keep file size to a minimum, while maximizing the quality of the recognized and searchable text. JRAPublish 2.0 excels at this task, producing higher-quality files than those generated and used in the commercial NewspaperArchive.com portal. The JBIG2-PDF files encoded by JRAPublish are a big improvement, but for best results use DjVu, or use the dual-format generating capability of JRAPublish to make both file types, and give the choice of file type to the end user!