RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
   
  February 15, 2003, Volume 7, Number 1
ISSN 1093-5371


Table of Contents

Feature Article 1
Debunking of Specsmanship: Progress on ISO/TC42 Standards for Digital Capture Imaging Performance, by Don Williams

Feature Article 2
Risk Management for Web Resources: A Case Study on Southeast Asian Web Sites, by Peter Botticelli

Highlighted Web Site
Wilhelm Imaging Research

FAQ
Squeezing more life out of bitonal files: a study of black and white. Part I, by Richard Entlich

Calendar of Events

Announcements



print this article

Debunking of Specsmanship

Don Williams
Eastman Kodak Company
don.williams@kodak.com

Introduction

And so it is with the claims of the capture performance of digital imaging devices. One is beckoned by a cacophony of vendor specifications. Loud and confusing (and unregulated), they are, ironically, seductive. The greater the numerical extreme, the greater the allure. For the inexperienced, evaluating these assertions in a consistent scientific sense is futile. Indeed, even the experienced are disadvantaged without the appropriate tools and guidelines. Let’s face it—for the most part we use these marketing specifications, along with brand name and price, as imaging performance guides.

For serious amateurs and professionals with demanding projects or clients, relying on this information as an imaging performance indicator is precarious, especially in the context of the constraints of high-productivity workflow. The difference between sampling frequency (dpi) and true resolution is confusing. Improved "optical" resolution claims remain suspect. Bit depth alone is far from a sufficient criterion for specifying dynamic range, and the existence of artifacts and noise are dismissed with a shrug. Unlike the world of analog imaging, where one could confidently rely on the history-rich reputation of a few manufacturers for performance integrity, today’s digital imaging landscape offers far fewer assurances.

The efforts of ISO/TC42 toward imaging performance standardization are slowly, but surely, changing this free-for-all. Through the participation of scientists and device manufacturers, a unified architecture of objective signal- and noise-based metrics is evolving to help remove device performance ambiguity and permit robust cross-device comparison. Adapted from proven approaches over a half-century of analog imaging experience, these metrics can be used as figures of merit in their own right or may be extended as weighted inputs into higher-order models of image quality. Some are still the subject of ongoing research. They are not perfect, but nonetheless offer the best compromise between technical rigor and ease of use.

Of course, the simple issuance of a standard will not ensure its adoption. For this, education, enablement, and improvement efforts are necessary. These have not been forgotten by the TC42 committee members who complement the standard itself by way of classes, technical papers, free software, benchmark testing, and target creation. These are perhaps more important than the documentation itself because they provide practical exercising of the standard by interested users who in turn provide feedback that allows improvements to the standard’s practice.

The progress, status, and content of the following TC42 imaging performance standards will be discussed in the indicated groupings. Associated with each standard is an ISO status that ranks, in order, its progression toward full ISO adoption.

I. Terminology ISO/DIS 12231
II. Opto-electronic conv. function ISO 14524
III. Resolution: still picture cameras ISO 12233
     Resolution: print scanners ISO/FDIS 16067-1
     Resolution: film scanners ISO/CD 16067-2
IV. Noise: still picture cameras ISO/FDIS 15739
     Dynamic range: film scanners ISO/CD 21550

WD Working draft
CD Committee draft
DIS Draft international standard
FDIS Final draft International standard
ISO International standard

Terminology

Frequently forgotten among the techniques and practices outlined in technical standards is the definition, or terminology, section. Although individual standards typically carry their own terminology section, ISO 12231 is a collective document that draws from a number of TC42 electronic-imaging working groups (WG 18, JWG 20, JWG 23). As such, it provides a broad perspective on electronic-imaging terms. Occasionally, definitions from working groups involved with traditional imaging are also included for completeness.

Opto-Electronic Conversion Function—OECF

At the foundation of nearly all of the ISO/TC42 performance standards is the opto-electronic conversion function (OECF). Similar to a film’s characteristic curve that plots the transfer of exposure into optical film density, the OECF defines the relationship between exposure, or reflectance, and digital count value of a capture device. By itself the OECF appears low tech, but it allows one to evaluate the effective gamma applied to an image, as well as any unusual tonal manipulations. Its real power, though, lies in acting as a tonal Rosetta Stone for remapping count values back to a common, physical image evaluation space so that meaningful cross-device performance evaluation can be done. It is also the hub for all of TC42's performance standards, which is why it is cited and used so frequently in all the performance standards.

The only dedicated standard to OECF is for digital cameras, ISO 14524. Nevertheless, the use of OECF for film and print scanners is described and required as defined in the standards' annexes peculiar to these devices. Though the OECF is intimately tied to other performance metrics, its calculation was always made from separate image captures rather than from metrics of prime interest. This led to inconsistent results between captured frames because of auto-contrast or scene-balance algorithms associated with capture devices. For this reason, gray patches for OECF calculation are now being integrated into targets for all the other performance metrics.

Although performance metrics tend to be considered from a benchmarking perspective, monitoring the OECF on a periodic basis for QC purposes is the greatest benefit. A good example of this can be found in a paper by Johnston.[1] Unless one is confident that auto-contrast features are not being invoked, capture devices cannot be counted on to have a unique OECF. This is why small gray-scale patches are frequently placed alongside documents at the time of digitization and remain part of digitized images. If designed correctly or included as metadata, they will provide an unambiguous tone path to the source document for faithful future rendering.

Resolution

This observation was made with respect to traditional analog imaging more than a decade before digital imaging began to become popular. Now, sampling and interpolation associated with digital imaging has made the term "resolution" even more ambiguous.

The advertising of device resolution in terms of finished image-file size is perhaps the most misleading of all. Through interpolation an infinite amount of "empty" resolution can be synthetically created that has no physical bearing on spatial detail detection (i.e., real resolution). Short of removing the detector from the camera and physically counting the sensor sites (ugh!), there is no way for the casual user to know the difference. Fortunately, through education, litigation, and standards this practice is becoming less common.

Though simple pixel count (e.g., Mpixels) and sampling frequency (e.g., dpi) are always cited and easy to understand, Mother Nature frowns at such laziness. She requires that optics, motion, image processing, and electronics contributions also be considered as factors influencing a device's true resolution. Then, and only then, is realistic spatial resolution determined. For this, the measurement of spatial frequency response (SFR) or modulation transfer function (MTF) of a device is required. These measurements unify the spatial resolution standards for electronic capture devices under TC42 and are described for cameras (ISO 12233), reflection scanners (ISO 16067-1), and film scanners (ISO/CD 16067-2). Each of these standards adopts a common slanted-edge-gradient MTF analysis technique especially suited for digital capture devices. Its accuracy has been benchmarked [2] with both synthetic and real image data. Its chief advantages are ease of use, durability, and analytical insight.

The suitability of MTF as an objective tool to characterize spatial imaging performance is well documented and has been used as an image-quality prediction tool for more than fifty years.[3] By characterizing contrast loss with respect to spatial frequency, one of its many uses can be to objectively establish the limiting resolution of a device. This is done by determining the spatial frequency associated with a given MTF value, typically 0.1. This frequency is then translated into limiting resolution for a given set of scan conditions and compared to the manufacturer’s claim to determine compliance. An example of this for a reflection scanner at three different sampling frequencies is shown in figure 1.

Fig. 1

Notice that the MTFs for each sampling frequency (250, 300, and 500 dpi) are essentially identical. The individual curves of figure 1 are difficult to identify because they literally overlay. This indicates no real resolution advantage at 300 and 500 dpi compared to the 250 dpi scan. This is indisputable. The 0.1 modulation level corresponds to 4 cycles/mm. Translating this to an effective resolution (dpi = (cycles/mm) * 50.8 ~ 200 dpi ), one finds that this scanner is really no better than a 200 dpi scanner, no matter what the advertising claims or sampling frequency. This analysis was performed with tools provided through the TC42/WG18 standards group and is one of many examples where they have been used to objectively clarify resolution performance.

Parenthetically, informative references to ISO 16067-1 detail methods to extract sub-pixel color channel registration errors from the ephemeris MTF data.[4] This artifact is often a problem with linear array scanners and is quantified in the analysis tools provided through the I3A Web site. Color misregistration as large as 1.5 pixels was calculated in the scanner of figure 1 with the same tools used for MTF calculation.

In the past, MTF measurement was confined to laboratory settings and never matured as a particularly field-friendly method for objectively determining resolution—a requirement for widespread adoption and credibility as a standard. This hurdle to acceptance has now been largely removed. Through the efforts of TC42 members, free automated software, debugging, affordable high-bandwidth targets, technique documentation, and educational workshops have been provided. The remaining challenges lie in the manufacturing and design of robust targets for film scanners and in improvements to target design for cameras.

Noise and Dynamic Range

Part of the seduction of digital imaging is the myth that it is noise free. By proclaiming a lack of film grain, this claim implicitly suggests that this is so. To demystify this, two ISO/TC42 standards are in development that define noise and dynamic range measurements. ISO/FDIS 15739 is intended for digital still cameras, and ISO/CD 21550 for film/print scanners. The camera standard (15739) is primarily intended to measure noise, but it also makes recommendations on dynamic range. Similarly, the film/print scanner standard (21550) is primarily intended for dynamic range measurement where noise characterization is required. Both standards use identical techniques for characterizing dynamic range and noise and are described next.

For the uninitiated, assessing dynamic range in the context of noise may not be obvious. After all, most claims for dynamic range are typically tied to device bit depth alone: the higher the number, the better. For instance, 12 bits/color (4096 levels/color) would indicate a precision of 1 part in 4096, or a maximum optical density of 3.6.[5] These simple calculations of dynamic range may be suitable for tutorials on concept capability but are far from sufficient for real imaging performance. To understand why, a qualitative definition of dynamic range as applied to imaging applications is needed.

I propose the following:Dynamic range—the extent of light over which a digital capture device can reliably detect signals—reported as either a normalized ratio (xxx:1) or in equivalent optical density units.

The operative words in this definition are reliably detect. Detection is a function of signal strength (think contrast), so the stronger the better, in this case. The reliability, or probability, of that detection is a function of the noise associated with that signal, so the lower the better. This logic suggests that maximizing the signal-to-noise ratio (SNR) is appropriate for increasing the dynamic range of a device. This was not lost on the members of TC42/WG18. Thus, SNR is integral to measurements of dynamic range under the cited standards. They marry signal with the probability of detecting that signal; that is, noise. So far, so good. We now know what to measure. Knowing how to measure it is more complex.

Both standards have taken the high road and adopted an incremental SNR approach to the metrology mechanics. Of all the ways to measure signal, incremental signal is probably the most informative for realistic imaging use. An example of this for an 8-bit reflection scanner is given in the top of figure 2. Its utility lies in quantifying how well a given intensity, Io, can be distinguished from another intensity of an arbitrarily small difference, I. Unlike the simple counting of bits, which is a capability measure, this usage of dynamic range is a performance measure as dictated by everyday needs. In the context of noise it will answer questions like, "Can this capture device distinguish between an optical density of 1.00 and 1.10?"

 

The other portion of dynamic range measurement is device noise characterization. This is determined through a "noise cracking" technique.[6] This step is extremely important for scanners because noise due to the target often accounts for the majority of the total noise. This target noise must be discounted so that the scanner itself is not discredited. The center graph of figure 2 illustrates the noise function.

 

Taking the ratio of the incremental signal to noise at each OECF patch yields the incremental SNR function. An example of this for a reflection print scanner is illustrated it the bottom of figure 2. Dynamic range is then determined from the incremental SNR by noting the density at which a prescribed SNR value is met. For instance, using a typical value of six, the scanner of figure 2 would have a dynamic range of roughly 1.5 or 32:1. This measure of dynamic range is significantly lower than the noiseless and flare-free capability measure of 2.4 that a simple bit count yields.

Conclusion

Ultimately, the success of any standard is measured by its level of adoption. And adoption is achieved through regulation, education, and enablement. The regulation metrology for digital imaging performance is well on its way through the vetting and review process provided by ISO/TC42. This paper details the content and status of these standards and includes a view of the scientific rationale for each. With an aim of combining technical rigor and utility, a sound architecture of signal and noise metrology techniques has been established. Though not perfect, the goal is to have them evolve so that one day they may nearly be so.

Education and enablement can accomplish this. Automated, easy-to-use software; economical targets; publications; and presentations through I3A and committee members have provided a good start toward this goal. The use of these tools and resources is beginning to allow users to accept, refute, or at least question manufacturers' "siren calls" in a scientifically sound and unified manner.

Acknowledgments
Gratitude to Jack Holm, Peter Burns, Sabine Süsstrunk, Ken Parulski, Sean Kelly, and Bruce Pillman for insightful talks on the theoretical and practical aspects of the above standards.

Footnotes
[1] Daniel L. Johnston, "A Simplified Method of Digital Image Tonal Capture for Archival Projects," IS&T PICS Conf., p. 210-213 (2002). [back]
[2] Don Williams, "Benchmarking of the ISO 12233 Slanted-Edge Spatial Frequency Response Plug-in," IS&T PICS Conf., p. 133-136 (1998). [back]
[3] T.H. James, "The Theory of the Photographic Process," p. 604-606 (1977). [back]
[4] Peter Burns and Don Williams, "Using Slanted-Edge Analysis for Color Registration Measurement," IS&T PICS Conf., p. 51-53 (1999). [back]
[5] Anne R. Kenney and Oya Y. Rieger, "Moving Theory into Practice, Digital Imaging for Libraries and Archives," Research Library Group, p. 40-41 (2000). [back]
[6] Peter Burns and Don Williams, "Distilling Noise Sources for Digital Capture Devices," IS&T PICS Conf., p. 132-136 (2001).
[back]

 

print this article

Risk Management for Web Resources: A Case Study on Southeast Asian Web Sites

Peter Botticelli
Cornell University Library
pkb4@cornell.edu

In recent years libraries and other cultural institutions have become increasingly concerned about the tendency for Web sites to lose content over time, especially those that are managed informally and without strong institutional backing. Cornell's Project Prism has been exploring ways to detect risks to Web resources as the first step toward developing a toolset for managing risks without necessarily requiring libraries to capture and archive the Web resources themselves. Thus, over the past year we have been monitoring Web sites and documenting changes in their status that may indicate short- and long-term risks to content.

In early 2002 Allen Riedy, the curator of Cornell Library’s Echols Collection on Southeast Asia, offered the Project Prism team a sample list of fifty-four Web sites of political and nonprofit organizations covering Southeast Asia (hereafter referred to as the "Asia sites") that he considered valuable for long-term preservation as a natural extension of the library’s world-class Southeast Asia holdings. The sites were chosen because they were rich in timely, original content on political issues that was subject to major changes as events unfolded in the region. All the sites chosen by Riedy had been cataloged and made available through the library’s catalog, as in the example below, for http://www.orchestraburma.org:


Figure 1. Cornell University Library Catalog Entry for http://www.orchestraburma.org

Many of the Asia sites represent political parties or advocacy groups for such causes as human rights, government reform, independence for indigenous peoples, and environmental protection. Among the sites dedicated to Myanmar (Burma), for instance, is a site, http://www.dassk.com, representing Daw Aung San Suu Kyi, the prominent dissident leader. Riedy noted that the selection and cataloging of Web resources like the Asia sites required a significant investment of time by library staff and that methods and policies are urgently needed to ensure their long-term viability. Thus we have spent the past year studying risk management issues for the Asia sites as part of a larger effort involving test sets of other Web sites.

The content of the Asia sites represents seven different countries: Cambodia, East Timor, Indonesia, Laos, Malaysia, Myanmar, and the Philippines. At the outset we hoped to track the physical location of the servers used for each site. But we discovered that the available information on domain name owners and domain servers was fragmentary at best. We did find anecdotal evidence that a significant number of the Asia sites may actually be managed or published outside Southeast Asia, in the U.S., Europe, Japan, and Australia, for instance. Almost half the domain names in this study were originally registered in the U.S., for instance, and, through queries to the various "whois" databases, we were able to link roughly a third of the domain servers with American or other Western ISPs. For risk management purposes, it would be a great advantage to have a complete up-to-date registry for sites intended for long-term preservation.[1]

In monitoring the Asia sites, we were given access to Mercator, a powerful Web crawler developed by researchers at Compaq (now Hewlett-Packard). In crawling sites, we used a strict “politeness” algorithm rigorously designed to avoid overloading servers with requests for pages. And we did not attempt to crawl any pages with robot.txt directives, which are commonly used to exclude crawlers. In addition, our study was designed for “passive” monitoring only, using data that was freely available on the Web, and without making any contact with the owners or system operators for the Asia sites.

All Web crawlers are programmed to search the Web and download pages according to predetermined criteria. Thus every crawl begins with, a "seed URL" as a starting point, and a set of URL filters designed to limit the crawl to a desired site or domain within the Web. The following example illustrates how we set these criteria for crawling the Asia sites.

Seed URL: http://www.bigpond.com.kh/users/ngoforum/
Filter: ".bigpond.com.kh" AND "ngoforum"

In this case the seed URL is the home page for this site. The filter has two parts, limiting the crawl to pages in the "bigpond.com.kh" host domain (a Cambodian ISP), and specifically in the subdirectory labeled "ngoforum," the particular site we hoped to monitor.

Changes in a URL, especially in its host domain, were an ongoing problem in our monitoring efforts. For instance, between November and December 2002, one site (http://www.easttimorpress.qut.edu.au) moved its base of operations from Australia to East Timor, and hence the site was renamed http://www.easttimorpress.com. The site began as a community service project at Queensland University of Technology in Australia. After East Timor became an independent country in May 2002, students and staff went to Dili to train a group of Timorese journalists to run the site themselves. Our last two crawls of the old site, indicated below, yielded a page linking to the new site, the URL for which we were able to document automatically. This was important given the fact that by January 20, 2003, the old site no longer functioned.

Thus, through regular crawling we were able to document the changing provenance of this site, as well as to reveal the possible risk of content loss as the site’s organizational structure changed due to evolving political circumstances. 


Table 1. Crawls of East Timor Press

*Both the old and new URLs were crawled on 1/14/03, at which time http://www.easttimorpress.qut.edu.au still registered a single 200. As a test, we recrawled the old URL a week later and discovered that it was now a 404.

Using Mercator, we crawled each of the Asia sites ten times in roughly an eight-month period, between late April 2002 and January 2003. We were able to successfully crawl all of the sites at least once, with one exception: http://www.freemalaysia.com, which was apparently shut down sometime between late January 2002 (the last available entry we were able to find in the Internet Archive) and the time of our first crawl, in April 2002. [2]

Once Mercator has crawled a site, it automatically generates a series of reports derived from the HTTP and HTML data that a server returns any time a Web page is requested by a client. We programmed Mercator to capture the HTTP status code for each discovered page, the full set of HTTP headers, and the full text of the HTML source for each page. We were particularly interested in HTTP headers as potential sources of information for risk management, given that they were designed in part to ensure that cached Web pages are complete, authentic, and up-to-date. We have also begun to examine HTML META (metadata) elements as possible sources of information for risk management. Besides our interest in META tags, a colleague, Hye Yeon Hann, has carried out a pilot study in which she documented the incidence of HTML tags used for dynamic features such as applets, scripts, and interactive forms. In the near future we plan to automate our data gathering on the full spectrum of HTML elements and thus to compare the reliability of pages with dynamic or multimedia features versus all other pages.

The most basic data we've been able to track is the number of Web pages for each site. None of the Asia sites is very large by Web standards, as the chart below indicates. Half of the sites have between 100 and 1,000 pages, while only a quarter have more than 1,000. Taken as a whole, the Asia sites consist of just over 70,000 total pages. By contrast, we discovered about 1.4 million pages (averaging over 12,000 pages per site) while crawling a list of library Web sites for members of the Association of Research Libraries.


Chart 1. Distribution of Web Pages Discovered for Fifty-four Asia Sites (January 2003)

Despite the relatively small size of the Asia sites, preserving 70,000 Web pages across fifty-four different sites is clearly a nontrivial matter, making it necessary to monitor these sites closely and to have a robust and at least partially automated set of tools, including a powerful crawler, to detect risks and, as much as possible, to rescue content before permanent losses occur. Our work with Mercator is part of a larger effort to identify and classify risks to Web resources and ultimately to enforce preservation policies developed for specific collections and types of resources.[3]

At the outset of the project we were naturally concerned that a collection like the Asia sites might have a very high rate of content loss, raising the potential cost of implementing effective risk management for these types of resources. In the course of crawling, which began at the end of April 2002, we were able to record the disappearance of four sites (besides http://www.freemalaysia.com, as noted above). Two sites, http://www.orchestraburma.org (recall the catalog entry for this site, above) and http://www.partikeadilan.org, were lost by our second crawl on June 6, 2002 (our first crawl was on April 30, 2002), and months later we documented the fact that these domains had been acquired by an American ISP and a pornography site, respectively. A third site, http://www.barisanalternatif.org, was lost by our third crawl, in July 2002. As of January 2003 this domain was available for sale. A fourth site, http://www.laskarjihad.or.id, became defunct by our November 2002 crawl. This site represented the militant Islamic group Laskar Jihad, which disbanded immediately before the October 2002 bombing of a nightclub in Bali.[4] The examples above highlight the complexities of preserving information in the dynamic and transitory environment of the Web. They also highlight the value of automated data-gathering methods, including crawling, supplemented by other, more qualitative sources, in discovering risks to the integrity of Web-based materials.

In spite of the handful of catastrophic losses we encountered, our overall results indicate a significant but relatively low failure rate for Web page downloads, as indicated by HTTP status codes. On average, across ten crawls we found that 92% of all pages discovered returned a 200 ("OK").[5] Of the remainder, 7% were reported as 404 ("Not Found") errors, and the other 1% were a combination of server errors (500s), socket-level errors (failure to connect to server), access-restricted pages (401, 403), robot-excluded pages (blocked from crawling), and redirected pages (300s).

We also found that a significant percentage of the Asia sites had relatively low rates of missing pages.[6] In every crawl at least one-third of the sites were missing less than 1% of their pages. In seven crawls at least one-fifth of the sites had no 404s whatsoever. In our January 2003 crawl we found that twelve sites have at least 99% of their pages return a 200 ("OK") code and no more than 1% of their pages as 400 (client error) codes. While more work is needed to test the completeness of pages having a 200 code, our data thus far suggests that a significant percentage of the Asia sites are generally reliable in responding to requests for Web pages. Also, we found that the sites that were registered in Southeast Asia had only a slightly higher rate of missing pages than sites registered in the U.S. or other Western countries, as the table below indicates.


Table 2. 404 Errors by Region (January 2003)

However, we did find many sites exhibiting danger signs for content loss. On average, across ten crawls we found that about one-fifth of the sites had at least 10% of their pages missing and that 10% of the sites were missing at least 20% of their pages. In our latest crawl we found five sites missing 25-40% of their total pages. These potential losses should be viewed in absolute as well as relative terms. Thus, one site was recently missing 1,005 out of a total of 9,206 pages, or 11%. Another site was missing 690 out of 1,740 pages, or 40%. For large Web sites, even a small percentage of missing pages could mean thousands of pages lost.

We've also found it interesting to track changes in the number of pages for each site discovered by Mercator over successive crawls. As we gather more data over time, we are correlating these numbers with other potential indicators of risk. For instance, in January 2003 we found that one site had shrunk 91% from our previous crawl in December. Three other sites showed declines between 11% and 38%. Taking these four sites together, our crawl results show a potential loss of about 2,000 pages out of nearly 7,500 total pages.

Besides decreases in pages discovered for sites, we've also documented substantial increases in the number of pages discovered from crawl to crawl. In January 2003 we found two sites whose page totals grew by 34% (adding 1,000 pages) and 56% (adding 53 pages) from the previous month. Although an increase in pages does not by itself indicate a risk factor for a site, it could be a sign of organizational changes that might put some older content at risk. As we gather more data, we plan to investigate possible correlations between such changes in the composition of a site and possible risks of content loss.

Changes in the number of pages discovered for a site can indicate major organizational risks. For instance, after East Timor received its independence in May 2002, the East Timor government site, http://www.gov.east-timor.org, showed a drastic decline in pages discovered as shown in Table 3. Closer examination revealed that the site was under construction and its previous content off-line. As the site continued to exist, it was not clear that the old content had been discarded, but it clearly was in danger of being lost to the user. A similar phenomenon occurred after the 2000 Presidential election. In late 2000, George W. Bush's transition team instructed all federal agencies to remove information from their Web sites specifically related to the Clinton administration. Cases like this show the value of regular monitoring of valuable Web sites, as early detection of organizational changes may leave enough time to negotiate archiving agreements with the owners of a still-functioning site. We are also working on methods for capturing the full content of Web pages (including dynamic features and images automatically linked to pages) as part of our crawling routine, making it possible to preserve a complete "snapshot" of a site at a point in time. That way, if a site disappears, we would be able to archive the last available version of the site. Moreover, as we discover particular risks to sites (e.g., a major change in a country’s political climate), we could step up crawling for affected sites to increase the likelihood that we could archive sites before they disappeared.


Table 3. Crawls of http://www.gov.east-timor.org

In the case of the East Timor government site, we immediately decided to investigate the site to determine what had changed after examining the results of our second crawl (6/6/02) for this site. Since both pages discovered by the crawl were 200s ("OK"), we had to look at the HTML source to find evidence of what had happened; namely, that the site was under construction. In successive crawls we were not able to explain the changes we detected between 8/12/02 and 10/3/02, except that the site was apparently still under construction. By December 2002 the site's home page indicated that the site was again fully functional, although it had obviously undergone substantial changes from April of that year. While we could not say that the site's old content had been lost, especially since we were monitoring the site passively, there was clearly a substantial risk of content loss owing to the magnitude of change in the number of pages on the site.

Besides tracking the number of pages and their HTTP status codes, we programmed Mercator to capture and report the HTTP headers that servers automatically send in response to requests for Web pages. We were interested in determining which headers were used and how frequently, because data of this type may indicate how well managed Web resources are, as well as provide information that may prove valuable in risk management efforts over time. Taking the Asia sites as a whole, we discovered fourteen total headers in use by all the sites, as indicated below. The table gives the number of instances of each header relative to the total number of pages detected by Mercator for this phase of the study.[7]


Table 4. HTTP Headers Discovered for the Asia Sites (December 2002)

The results above closely matched those of the other collections we tested, although the
last-modified and content-length headers were implemented more often (by 30-40 percentage points) by the Asia sites than was the case for our overall sample of 6.3 million total header objects. Ultimately, we plan to compare the contents of header fields across successive crawls to detect potentially important changes in specific pages and across whole sites.

Of the fourteen header types we found in use by the Asia sites, we believe that the five listed below may be particularly valuable for risk management, although more study is needed to assess the consistency and reliability of information typically provided in these fields.


Table 5. Possible risk management uses of header types.

The Content-Type header is particularly important because it is used to record the MIME type, or "media type," as it is sometimes described, for a Web document. In our December crawl of the Asia sites we found documents representing thirty-two different media types, as listed below.


Table 6. MIME Types Discovered for Asia Sites (December 2002)

Even a quick glance at the above list reveals many potentially at-risk media types. But then, we should keep in mind that 97.8% of all pages in this sample consist of just four very common document types: HTML, GIF and JPEG images and PDF files. However, if we are concerned with preserving audio and video files, for instance, we have to be concerned with seven different formats, in spite of the fact that they make up less than 1% of the total number of pages in this sample. Given the potential cost of preserving out-of-date media types, it is important to monitor the use of at-risk formats and whenever possible to encourage Web site creators to consider migrating documents to common or standard formats. Our data shows the complexity of Web resources and the need for institutions to choose their priorities carefully in deciding what content needs to be preserved and what risks are most pressing at any given time.

Our goal in tracking the Asia sites has been to identify and monitor preservation risks and to provide comparative data on the organizational and technical integrity of Web sites. Thus far we have been able to use the full spectrum of HTTP data provided by Web servers, and at present we are refining a set of tools to analyze the HTML source for each page, which will enable us to track links and to identify dynamic elements embedded in pages that may add to the risk of content loss over time.

In general, our results from the Asia sites highlight the need for a focused approach to preserving distributed information resources on the Web by gathering risk data about sites on a regular basis, classifying and comparing risks as they are discovered, and developing a robust set of tools to rescue content before it is permanently lost. Ultimately, Project Prism envisions a comprehensive system for risk management and preservation of Web resources, giving libraries the power to maintain lasting virtual collections out of diverse resources they may not own or control directly. We believe that as patrons come to depend more on all types of online information, libraries can add substantial value to Web resources by identifying and actively managing risks of content loss on the Web.

Acknowledgements
Support for this project came from the National Science Foundation (Grant No. IIS-9905955) and Cornell University Library through Project Prism, part of NSF’s Digital Libraries Initiative, Phase 2. Thanks especially to Nancy McGovern, Anne Kenney, Rich Entlich, Bill Kehoe, Hye Yeon Hann, and Carl Lagoze for their contributions to Project Prism and to the case study that is the focus of this article.

Footnotes
[1] The need for better registry information has been demonstrated by OCLC's Web Characterization Project, which has studied the volatility of IP addresses. Their results showed that only 13% of the sites studied had the same IP address in 2002 as they had in 1998. (back)

[2] This site had been under political pressure from the Malaysian government, as it published many reports critical of senior officials. See http://news.bbc.co.uk/2/hi/asia-pacific/country_profiles/1304569.stm and http://www.thestar.com.my/news/storyx1000.asp?file=/1999/8/9/nation/0908eede&sec=. (back)

[3] See "Preservation Risk Management for Web Resources: Virtual Remote Control in Cornell's Project Prism," D-Lib Magazine (9) 1 (January 2002). (back)

[4] See http://news.bbc.co.uk/2/hi/asia-pacific/770263.stm. (back)

[5] We should point out that 200 codes can effectively mask content losses in sites that are programmed to generate a page containing an error message indicating that content is missing. For a detailed discussion of the problem and its possible solutions, see http://www.rlg.org/preserv/diginews/v6_n6_faq.html. (back)

[6] By "missing pages" I'm actually referring to URLs that result in error codes in the 400 (client error) and 500 ("server error") ranges, as well as pages that result in socket-level errors (unable to connect to server). Strictly speaking, the presence of a 404 ("Page Not Found") error, for example, does not necessarily mean that a page is lost, as it may still exist under a different URL. But from the users' point of view, a bad URL with no redirect or other information provided is functionally lost content. Hence, for the purposes of this study we decided to label pages "missing" if we were unable to locate them by their URL, which most users still depend on to identify and distinguish Web pages. We are currently investigating alternative methods for identifying pages without relying on URLs, though from an archival point of view, more work is needed to show if we can determine the provenance of Web pages in the absence of clear information provided by a page's creator. (back)

[7] The number of pages in this sample is less than the total pages we discovered for the Asia sites because Mercator was only able to report HTTP headers for pages having the text/html MIME type. (back)

 


Highlighted Web Site

Wilhelm Imaging Research

Wilhelm Imaging Research, Inc., is a research and consulting firm specializing in photograph preservation. The company's Web site focuses on permanence issues facing digital inkjet printers, including desktop and large-format models. The site also provides access to technical articles by Henry Wilhelm and others on a number of photograph preservation issues. Of particular interest are reports giving a Display Permanence Rating (DPR) for a range of inkjet printers based on laboratory tests using different papers and framing conditions for each printer. This site should be a valuable source of information for institutions responsible for preserving photograph collections and especially for those contemplating the purchase of any type of inkjet printer. Also of interest is the site’s "Newsfeed" page, which offers links to current articles of interest on imaging equipment.




print this article

FAQ

Squeezing more life out of bitonal files: a study of black and white
Part I

Your editor's interview in the December 2002 RLG DigiNews states that JPEG 2000 can save space and replace the multitude of file formats being used for conversion and display of cultural heritage images, but that it isn't suitable for bitonal material. We have lots of bitonal images. Is there anything similar available for them?

Editors’ Note:
Addressing this question extends beyond the normal length of our typical responses, so we’re experimenting with a 3-part answer, part one of which appears in this issue. Parts two and three will follow in subsequent issues. We’d be interested in hearing from RLG DigiNews readers about their reaction to this approach
.

There are significant parallels in recent developments for color and bitonal images. In both cases, many of the existing preservation master files held by institutions are TIFFs (Tagged Image File Format), which are almost always converted to some other format for access purposes (generally JPEGs or GIFs). In both cases, new file formats and compression schemes offer the potential for reduced storage space, improved transmission, and greater functionality for users. And in both cases, decisions about whether to take advantage of new developments, and if so, when and how, are difficult and complex.

Bitonal Scanning Background

A bitonal image epitomizes the notion of a bitmap. Each dot or pixel captured by the scanner is mapped to a single bit, which can take on the binary value of one or zero. Though those two values could conceivably correspond to any pair of display or output colors, traditionally they are represented by black and white.

Early experiments with scanning of cultural heritage materials (from around 1990) focused on methods for reformatting brittle books from the 19th and early 20th century. Bitonal scanning was a natural choice, because most of the materials being scanned consisted primarily of text, which is suitable for bitonal capture. However, technical and budgetary considerations also contributed heavily to the choice.

Scanning for preservation reformatting requires that the digitized image be a high-fidelity surrogate for the original document. Achieving such an outcome requires high-resolution scanning and the storage of resulting images with either no compression or lossless compression (compression that allows exact restoration of the original bitmap). Consider a book page approximately 6" wide and 9" high consisting primarily of ordinary text. In 1990, a 600 dpi scan of that page could be losslessly compressed to about 100 KB. A similar 24-bit color file could be scanned at lower resolution, but would losslessly compress only to about 5 MB.

In 1990, the highest capacity magnetic drives commonly available held 1-2 GB and cost about $2/MB. It was simply cost-prohibitive to consider gray scale or color scanning of large collections. Additionally, high-speed networks were only just starting to be deployed, and few home users had greater than 2400 bit per second access to the Internet. There was no practical way to move such large images around.

Thus, despite its limitations, in 1990 bitonal scanning was the only affordable, technically feasible way to scan large collections. Since the early 1990s, cultural heritage institutions have bitonally scanned millions of pages of brittle monographs mostly using the TIFF and Group 4 fax (now known as ITU-T.6) compression (hereafter we refer to these files as "TIFF G4s."). Bitonal scanning has also been applied to other kinds of printed matter with less demanding requirements.

As the Web became widely available in the mid-1990s, institutions starting making their bitonal images Web accessible. Even then it wasn't feasible to ship users the 600 dpi 1-bit image, owing to the lack of display hardware (i.e. high-resolution monitors), CPU power and display software to handle such fare. Additionally, bandwidth limitations and the fact that TIFF images are not a native Web format dictated that the images be served up one-at-a-time, scaled in size, and converted to a Web-native image format, such as GIF.

Fast Forward to the Present

A lot has changed since 1990. Many of the limitations from that time have been overcome. Mass storage costs have plummeted by a factor of 1,000. High-speed networks abound, and many end users have broadband access to the Internet. A number of new file formats and compression schemes have been developed, allowing 600 dpi, 1-bit image files to be compressed by an additional two to ten times over what was previously possible. Given that the latest major revision of TIFF dates from 1992 and the G4 compression scheme is even older (it became an official recommendation in 1988), it is perfectly valid to question the continuing use of these standards, and whether existing files should be migrated to newer formats.

In fact, in view of the above, some may question whether bitonal scanning is still merited, since gray scale and color scanning produce richer, more tonally subtle output. However, a number of countervailing forces suggest otherwise.

For example, our hunger for digital storage has managed to keep up with the decline in its cost. As digital collections grow, that magical time when mass storage will be so cheap it will be "unmetered" has remained elusive. In the realm of lossless compression, important strides in the compression of bitonal images have maintained the gap between their storage requirements and those of gray scale and color images.

Another influence is the desire to bundle multiple images together in the form of complete journal articles, book chapters, and pamphlets. Aggregating images is clearly desirable for certain kinds of publications, but can strain network capacity (and users' patience). Though network hardware is getting faster, increased demand tends to dull the impact of increased capacity, so that attention to efficient use of the resource remains important. In addition, emerging wireless networks often have lower bandwidth than their wired cousins.

Finally, much monograph and journal scanning requires only a high level of content fidelity, not perfect tonal reproduction. Properly executed, bitonal scanning is still quite appropriate for much source material. The Digital Library Federation has endorsed a minimum benchmark for digital reproductions of monographs and serials of 600 dpi 1-bit for black-and-white text, simple line drawings, and descreened halftones.

The Migration Dilemma

Even though bitonal images clearly still have an important role to play, the question of whether such images should continue to be created and stored as TIFF G4s remains. Institutions with large collections of TIFF G4 images face many potential motivations for migrating to newer file formats and compression schemes. They can

  • avoid obsolescence of existing collections
  • save on storage space (for both primary and backup copies of files)
  • reduce network traffic
  • use a single format for preservation and access versions
  • serve users better by bundling single page images into articles or chapters and offering faster transmission, higher-quality images, and more control over image viewing

Some questions that might be part of an institutional self-assessment of whether to migrate include

  • Is TIFF G4 becoming obsolete or endangered, and if so, how soon should migration take place?
  • Will it be worth the cost (in time, equipment, personnel, etc.) to migrate a large collection, considering the existing threat and the potential benefits of the new format?
  • Are the migration options sufficiently stable and do they offer the characteristics (flexible metadata, broad industry and user support, open specification) desirable for long-term preservation?
  • Do any of the new options offer the possibility of further consolidation of image formats (i.e., by offering solutions for gray scale and color images, in addition to bitonal)?

Despite getting somewhat long in the tooth, the TIFF file format and G4 compression scheme are still widely used and well supported by scanning hardware and software, as well as by image processing and display applications. So for now, the motivation for most institutions will center on the potential benefits of a new format, rather than fear of loss from continuing use of the old. Institutional circumstances and priorities will play a big role in weighing the pros and cons of migration.

For example, how large is the existing collection of TIFF G4s? How important is it to maintain absolute fidelity to the originals? Is there a commitment to long-term retention, or are the images for temporary use? Are users clamoring for more functionality or better performance? How important is reducing expenses for storage, backup, and network transmission? Does the institution want to make its high-resolution master files available to end users (many do not)?

We cannot answer these questions for you, but we can lay some of the groundwork to help you start considering a few of the motivating factors and potential consequences.

Migration Considerations

Retrospective conversion vs. going forward If a decision is made to move to a new format, will all existing images be converted to it, or just newly created ones?

The former situation requires very careful consideration, such as whether existing metadata internal to the file will transfer and continue to function. TIFF files offer only a rudimentary header, but some of the newer formats have even less to offer. Institutions must also examine the impact of a wholesale migration on existing systems and processes and on the user population. Some level of disruption and inconvenience is inevitable.

Applying a new format only to new material has its own drawbacks, since it creates a divided collection and the need to maintain additional formats. It may also mean that part of the collection has different functionality than the rest. Will users be able to make sense of that?

Master vs. access versions Will the change apply to master files, access files, or both?

One of the potential benefits of migrating is the ability to use the same file as master and access version. Most users now have hardware and software appropriate to handle high-resolution images. With sufficient compression, and the availability of a file viewer on the user's computer that handles scaling, zooming, gray enhancement [1], etc., one can contemplate using 600 dpi bitonal files for access.

However, many non-technical issues enter into decisions about master files, such as whether any degree of lossy compression can be tolerated, or whether the format is proprietary. Such dual use may be possible only for files that are not considered long-term institutional assets and that can be altered slightly without losing their value. Also, some institutions consider their high-resolution scans economic assets or may even be prohibited by copyright or contractual arrangements from making them available to end users.

Using a new format solely for access, especially if that new format must be maintained online rather than created on the fly, may lessen the appeal of migration. For example, potential storage savings would probably not be realized. In some cases, migration for access only may be desirable if substantial improvement to the users' experience is the primary motivation.

Single pages vs. bundles Bundling can be a component of migrating to a new format. For instance, although it doesn't reduce their size significantly, individual TIFF G4s can be bundled into multipage PDF G4s, making them directly accessible to most Web users, and automatically gaining all the navigation and display control offered by Adobe's Acrobat Reader. Though an image collection may lend itself to bundling, there may not be any one correct way to do it. Should journal images be bundled by article, by issue, or by volume? Viewing one page at a time may be constraining, but so may having pages bundled in what seems like too small, too large, or just the wrong configuration. It is also possible to create page bundles on-the-fly, giving users the option to customize.

To be continued….

Several image file formats and compression schemes, some old and some quite new, are potential migration targets for existing TIFF G4 files. In part II of this FAQ, to be published in the April issue, we'll provide a brief rundown of some of the most important alternatives and examine some of their pros and cons. In part III, to be published in the June issue, we'll take a closer look at specific implementations, including features and performance.

—Richard Entlich

Footnote
[1] Gray enhancement is a software technique that allows high-resolution bitonal images to be significantly scaled in size while retaining readability through the addition of gray tones. The technique, also called scale-to-gray, is widely used in image display software.
[back]


Calendar of Events

The Archiving Forum: Preserving Digital Content (and the Opportunities it holds) for the Long Haul
March 4, 2003
Philadelphia, Pennsylvania

The American Medical Publishers Association and the National Library of Medicine (NLM) are co-sponsoring this forum. The program features presentations on three electronic journal archiving strategies—Elsevier's agreement with the Royal Library of the Netherlands, PubMed Central, and JSTOR—from the perspectives of publishers who have chosen each strategy and the organizations that are maintaining the archives.

International Symposium on Open Access and the Public Domain in Digital Data and Information for Science
10-11 March 2003,
Paris, France

This international symposium is being jointly organized by the Committee on Data for Science and Technology (CODATA); the National Academies, US; the International Council for Science (ICSU); UNESCO; and the International Council for Scientific and Technical Information (ICSTI).

Museums and the Web 2003
March 19-22, 2003
Charlotte, North Carolina

The seventh annual Museums and the Web conference—the premier international venue to review the state of the Web in arts, culture, and heritage—will address Web-related issues for museums, archives, libraries and other cultural institutions.

What Scholars Need to Know to Publish Today: Digital Writing and Access for Readers
April 8, 2003
Albany, New York

The advent of networked systems for scholarly communication promises to provide unprecedented access to scholarly works. Knowing the essential features required in electronic documents that will facilitate the most-effective retrieval for fellow scholars is an increasingly important set of skills for scholars and advanced students. This program will examine the issues from a variety of perspectives and provide options for policies and activities.

ERPANET Workshop: Long-term Preservation of Databases
April 9-11, 2003
Bern, Switzerland

ERPANET is pleased to announce its workshop, Long-term Preservation of Databases, which is co-hosted by the Swiss Federal Archives. The workshop will cover the entire process of database preservation: selection, appraisal, preservation, description, and access. Technical solutions will occupy an important place in the workshop, and archival requirements will also be addressed.

LIBER Workshop: Microfilming and Digitisation for Preservation
Koninklijke Bibliotheek (KB),
April 14-15, 2003
The Hague, The Netherlands

This workshop aims to clarify the position of microfilming and digitization by presenting an overview of the possibilities and requirements for combining the two approaches in preservation projects.

School for Scanning: Los Angeles
Creating, Managing, and Preserving Digital Assets
Presented by the Northeast Document Conservation Center
April 23-25, 2003
Los Angeles, California

Presented by the Northeast Document Conservation Center, this conference will include project management, copyright, content selection, standards, quality control and costs, text and image digitization, metadata, and digital longevity and preservation.

Digitization for Cultural and Heritage Professionals Workshop
May 11-16, 2003
Chapel Hill, North Carolina

The School of Information and Library Science at the University of North Carolina, Chapel Hill in conjunction with the Humanities Advanced Technology and Information Institute, University of Glasgow, and Rice University’s Fondren Library is pleased to announce the fourth Digitization for Cultural and Heritage Professionals course. The one-week intensive course will consist of lectures; seminars; lab-based practicals (offering both guided tuition, as well as an opportunity for individual practice) and visits to the UNC and Duke University libraries.

Eighth International Summer School on the Digital Library

This year the Summer School will consist of three courses that will be held aright after the IFLA Conference in Berlin. Every year, the Summer School is updated to respond to the most recent developments.

ECDL2003-Trondheim,
August 17-22, 2003
Trondheim, Norway

Submissions are now being accepted for ECDL 2003, the 7th conference in the series of European Digital Library conferences. Paper, workshop, panel and tutorial submissions are open until 10 March 2003. Demonstration and poster submissions accepted until 19 May 2003.

AMIA 2003 Annual Conference
November 18-22, 2003
Vancouver, Canada

AMIA’s annual conference is the premiere event on the annual calendar of the moving image archival community. Conference Program Proposals are due on Monday, February 3, 2003.


Announcements

New JISC Website
The Joint Information Systems Committee (JISC) announces the launch of its new Web site. The site contains information about all JISC collections, projects, funding opportunities, advice and guidance, resource guides, and user-based guides.

NSDL Whiteboard Report
Research news and notes from the National Science, Technology, Engineering, Mathematics and Education Digital Library (NSDL) Program.

MIT and Six Major Research Universities Announce DSpace Federation Collaboration
The Massachusetts Institute of Technology (MIT) Libraries have announced the DSpace Federation with six major research universities: Columbia University, Cornell University, Ohio State University, and the Universities of Rochester, Toronto, and Washington. DSpace, a digital repository for intellectual output, is an open-source system now in full production at MIT. MIT is now seeking to extend the federation.

French version available for Cornell Library’s Moving Theory into Practice: Digital Imaging Tutorial
Cornell University Library is pleased to announce a French language version of its digital imaging tutorial, which also appears in English and Spanish. The new version is supported with funding by the Food and Agricultural Organization of the United Nations.

Revolutionizing Science and Engineering through Cyberinfrastructure
The Report by the National Science Foundation's Advisory Committee for Cyberinfrastructure recommends that a cyberinfrastructure program encompass fundamental cyberinfrastructure research, research on science and engineering applications of the cyberinfrastructure, development of production-quality software, and equipment and operations
.

RLG Guidelines for Preservation Microfilming Updated (PDF)
The RLG Preservation Microfilming Handbook (1992) and RLG Archives Microfilming Manual (1994) are the gold standard against which many funding agencies evaluate grant proposals. Now, Lars Meyer (Emory University) and Janet Gertz (Columbia University) have created RLG Guidelines for Microfilming to Support Digitization, a supplement to both publications. This Web-based publication advises how to create preservation microfilm that is most amenable to effective and efficient scanning to produce high-quality digital images.

Comma: Archives, Memory, and Knowledge in Central Europe
In 2004 an issue of the International Council on Archives journal Comma will be devoted to the theme Archives, Memory, and Knowledge in Central Europe. The editorial board seeks contributions in particular from archivists, historians, political analysts, cultural anthropologists, teachers of archival studies, state administrators, and other observers of Central Europe.

National Library of New Zealand Preservation Metadata Schema (PDF)
The National Library of New Zealand's Preservation Metadata Schema details data elements needed to support the preservation of digital objects, as well as elements needed to manage the metadata record itself. The document has been developed in light of international research, including a discussion draft from the National Library of Australia, the UK-based CEDARS program, OCLC/RLG activities, and the emerging consensus around the OAIS Reference Model.

Creating and Using Virtual Reality: A Guide for the Arts and Humanities
The Arts & Humanities Data Service is pleased to announce the publication of this new guide to good practice, edited by Julian Richards and Kate Fernie. It concentrates on accessible desktop virtual reality that may be distributed and viewed via the Web. It is geared to the needs of the creators of virtual reality (including artists, illustrators, and computer scientists) and of organizations who are commissioning virtual reality (including museums, galleries, heritage agencies, and university-based projects).

NISO Publishes Guide to Standards for Library Systems (PDF)
The National Information Standards Organization (NISO) announces the publication of The RFP Writer's Guide to Standards for Library Systems, a manual intended to aid library system request for proposal (RFP) writers and evaluators in understanding the relevant standards and determining a software product's compliance with standards.

EAD 2002 Released
The 2002 version is the second production release of the EAD DTD. It incorporates a small number of newly defined elements, deprecates eight previously used elements, and modifies the structure (content model) for a few elements to allow the inclusion of other valid EAD elements at different levels within a finding aid.


Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site. It will be published six times in 2003. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews.

Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.

RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Research, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Martha Crowe; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello; Technical Assistant, Valerie Jacoski.

All links in this issue were confirmed accurate as of February 15, 2003.

Please send your comments and questions to RLG Diginews Editorial Staff.

   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
 
Hosted by uCoz