RLG DigiNews June 15, 2003, Volume 7, Number 3

			RLG DigiNews		BROWSE ISSUES		SEARCH		RLG

June 15, 2003, Volume 7, Number 3

ISSN 1093-5371

Table of Contents

Editor's Interview
National Digital Information Infrastructure and Preservation Program: An Interview with Laura Campbell

Feature Article 2
Like Russian Dolls: Nesting Standards for Digital Preservation, by Günter Waibel

Feature Article 3
Saving Digital Heritage—A UNESCO Campaign, by Colin Webb

Highlighted Web Site
Digital Dog

FAQ
Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part III, by Richard Entlich

Calendar of Events

Announcements

print this article

Editor's Interview

National Digital Information Infrastructure and Preservation Program

Laura Campbell
Associate Librarian for Strategic Initiatives
Library of Congress

Editors’ Note
In January Congress approved the Library of Congress’s Plan for the National Digital Information Infrastructure and Preservation Program (NDIIPP), which will enable the Library to launch the initial phase of building a national infrastructure for the collection and long-term preservation of digital content. With this approval Congress also released $35 million for the next phase of NDIIPP, of which $15 million will be matched dollar-for-dollar from nonfederal sources. Following is an interview with Laura Campbell, the Associate Librarian for Strategic Initiatives, who is directing the work of this next phase of NDIIPP. Queries may be addressed to her special assistant, George Coulbourne.

It’s wonderful that Congress has authorized and financially supported the next phase of NDIIPP, but how will the funds be committed? What percentage of funding will be spent on research, planning, implementation, evaluation, and other core areas?

The majority of the funds will be used for testing various models that support the capture and preservation of content. The projects will focus on the preservation of a variety of digital media: e-books and e-journals, digital film, audio, and television. We will be working with other repositories as well as rights holders to test approaches that support a distributed digital preservation infrastructure for collecting and preserving content. This infrastructure will consist of a network of committed partners with defined roles and responsibilities working through a preservation architecture.

Other projects will test and help define the digital preservation architecture spelled out in the NDIIPP report. Approximately ten percent of the funds will support basic digital preservation research to help build solutions that are flexible and sustainable for the long term.

How will proposals be solicited and accepted?

Through our Web site we anticipate making calls for proposals in late summer.

What outcomes do you expect from this phase, and how will you measure success in meeting your goals and objectives?

Outcomes expected from this phase include establishing the groundwork for what we call the "digital preservation infrastructure," which has two components.

The first is the "digital preservation network," which will comprise a group of partners committed to collecting and preserving digital information.

The second component is the "digital preservation architecture," or the technology that will support long-term preservation in a distributed environment. This phase will conclude with an advanced design for the architecture.

Copyright and the intellectual property issues associated with digital information will also be a focus of this phase. We will work closely with the U.S. Copyright Office, which is part of the Library of Congress, and many stakeholders in the broader community to address issues that advance or impede preservation of content.

Communication is a key component of NDIIPP. It is critical to convey information about the program to the stakeholders in digital preservation as well as to the general public. Currently, content creators and distributors understand to varying degrees what digital preservation is, why it is needed, and what their role in preservation should be. Unlike in the analog world, where preservation decisions may be made long after the content is created, in the digital world preservation decisions often need to be made coincident with creation. Think of all the Web sites, for example, that are no longer available.

We also know from experience that the success of any new technology requires support—and understanding—from the general public. That was the case when the Library began its National Digital Library Program. A large part of the success of that public-private partnership ($15 million from Congress; more than $45 million from private donors) was the result of the public’s awareness of the importance of having remote access to the riches of the Library of Congress’s high-quality educational content. The more Library materials we made available, the more the public wanted. From such a base of support came private sector support. We believe that as the public increases its awareness of the importance of digital preservation, support for the program will grow.

The metrics to measure success will vary according to the component we are examining. For example, at a base level we can measure the success of the preservation architecture the way any design program is evaluated—by testing it. Does the architecture support long-term preservation? Is it flexible enough to change as technology changes? Can users and donors of content rely on its integrity?

The success of the preservation network must be judged in more qualitative terms. We know that we cannot capture and preserve all digital information, nor is it desirable to do so. Partners will have to make decisions on what to keep and who should keep it. In many ways this is no different than the decisions that are made every day by the selecting officials at the Library of Congress. The Library retains for its collections only about 7,000 of the approximately 20,000 items it receives each business day. Other repositories make these same decisions. The hope is that, as with analog materials, we are collecting and preserving the information that will be most useful to the U.S. Congress, researchers, and lifelong learners for generations to come. It is the generations of tomorrow who will judge the success of the decisions we make today.

As far as communication is concerned, we will know we have succeeded when there is a national conversation about the importance of digital preservation such that the public and private sectors support the goals of NDIIPP.

Who are the key stakeholders for LC in this effort, and how will you involve them? What about the National Library of Medicine and the National Agriculture Library? Research libraries? Others?

In the broadest terms, anyone who creates or uses digital information is a stakeholder in NDIIPP. NLM and NAL are key stakeholders, as are all the libraries and other repositories in this nation and around the world. We formed the National Digital Strategy Advisory Board with the idea that its members are representatives for the various stakeholder communities. The NDIIPP legislation mandates that “the overall plan should set forth a strategy for the Library of Congress, in collaboration with other Federal and non-Federal entities, to identify a national network of libraries and other organizations with responsibilities for collecting digital materials that will provide access to and maintain those materials.”

How can other institutions participate in NDIIPP?

We are interested in hearing from institutions and organizations who are collecting and preserving digital content and are interested in becoming involved in the preservation network of committed partners. They can send inquiries to http://www.digitalpreservation.gov/ndiipp/contact.html.

How will cultural repositories—large and small—benefit from NDIIPP?

We hope to set forth, in collaboration with others, a national approach to sharing the responsibility for the collection and preservation of digital content, leveraging what any one institution can do alone.

Desirable benefits of NDIIPP include

shared responsibility for collection and selection development

standards and best practices for managing content

business models to support preservation and the shared responsibility for collection and selection development (no. 1 above)

intellectual property agreements for use of rights-protected content

a technical framework within which to work together

Ultimately there will be an operational environment that allows many institutions, big and small, to be part of a network that collects, preserves, and provides rights-protected access to digital content.

Digital preservation doesn’t stop at the border. Would you describe your plans for international collaboration?

With its core mission to make information available and useful and to sustain and preserve a universal collection of knowledge and creativity, regardless of format, for current and future generations of Congress and the American people, the Library of Congress has a long history as a trusted convener that is able to facilitate the development of standards and best practices in librarianship across the country and internationally.

The NDIIPP plan represents the fruits of intensive consultations with a wide range of American and international innovators, creators, and high-level managers of digital information in the private and public sectors. We achieved this through surveying national and international initiatives (Appendix 5 of the report ) and during several stakeholder meetings with international participation. This was accompanied by ongoing interviews and consultation with a broad group of experts.

There is nothing comparable to the congressional action taken and funding provided in behalf of digital preservation abroad; however, areas of potential collaboration with the United States include

technical research

standards development

collection development

development of shared services needed by repositories

The Web site you have established is very helpful in conveying information on NDIIPP. How else will you keep individuals and organizations informed?

NDIIPP has already received broad coverage from the media in more than fifty publications, including the New York Times, the Washington Post, and the Chronicle of Higher Education. Their articles have been the direct result of our communications efforts. We will continue to work with major media—both general-interest as well as trade press—to keep NDIIPP in the public eye as it progresses in meeting its goals. We will also continue to participate in public presentations and forums, such as at the Library’s exhibit booth during the American Library Association meeting and in other appropriate venues.

print this article

Like Russian Dolls: Nesting Standards for Digital Preservation

Günter Waibel
Research Libraries Group

On February 14, 2003, the Library of Congress announced the approval of Congress for its plan to build a national infrastructure for the collection and long-term preservation of digital content. The establishment of the initiative, called National Digital Information Infrastructure and Preservation Program (NDIIPP), formally recognizes the importance of digital preservation at the highest level and promises guidance through project outcomes and published research in the years to come (see Editor's Interview). While the Library of Congress engages in some heavy lifting to benefit the entire community, the constituents of the community cannot afford to remain passive. With digital preservation looming large on the national agenda, understanding the terminology and standards emerging in the field becomes the ticket for following or participating in the upcoming discussions.

This article introduces three standards for digital preservation, at least two of which feature prominently in the appendix of the plan Congress just approved.[1] Understanding what these standards are, what they can and cannot do, provides a solid foothold in present and future discussions surrounding long-term retention of digital materials, as well as a leg up on implementation.

breakout quote As the title suggests, the three standards nest like Russian dolls—one provides the larger framework within which the following, more granular, standard may be implemented.

The Open Archival Information System (OAIS) constitutes the largest Russian doll in the lineup. A standard that comes from the space data community, OAIS is a reference model specifying the responsibilities and data flow surrounding a digital archive at a conceptual level.
Metadata Encoding and Transmission Standard (METS), developed by the library community, provides a data structure for exchanging, displaying, and archiving digital objects. It nests within the larger framework of the OAIS as a possible mechanism for data transfer between entities inside and outside the OAIS archive.
Our smallest Russian doll has the rather long name NISO[2] Data Dictionary—Technical Metadata for Digital Still Images, and again the library community deserves credit for seeing this specification through its standardization process. The NISO Data Dictionary, also known as Z39.87, describes what fields are necessary in a database for preserving digital images. In its XML encoding called Metadata for Images in XML Standard (MIX), courtesy of the Library of Congress, Z39.87 finds its home in the METS context as an extension detailing a section of administrative metadata appropriately called “technical metadata.”

Although all this probably sounds confusing in bulleted shorthand, it actually makes a lot of sense when properly laid out. This article walks through the standards one by one and elaborates on their functionality and interaction. As it works its way through the standards from the most general to the very specific, it will also home in on digital images as the files to be preserved. The expansive OAIS applies to any type of media, even nondigital materials, whereas METS applies exclusively to the digital realm of images, audio, and video. The NISO Data Dictionary focuses on technical metadata for digital still images.

From a business perspective, digital preservation is a mechanism to ensure return on investment. Enormous amounts of money have been and are being spent on reformatting original materials or creating digital resources natively. If the cultural heritage community can not sustain access to those resources or preserve them, the investment will not bear the envisioned returns. Although a basic understanding of the general problems surrounding preservation in an ever-changing technical environment has started to permeate memory institutions, practical solutions to the challenge are slow to emerge. The three standards, OAIS, METS, and Z39.87, converge as a sustainable system architecture for digital image preservation.

The space data community represents another group with enormous stakes in the long-term viability of its data. Capturing digital imagery of art or manuscripts may seem expensive, but the cost pales in comparison to that of gathering digital imagery from outer space. Under those circumstances, losing access to data is not an option. To foster a framework for preserving data gathered in space, the Consultative Committee for Space Data Systems (CCSDS) began work on an international standard in 1990. A good ten years later the OAIS was approved by the International Organization for Standardization (ISO).

breakout quote The fledgling standard met with great interest from the library community. Among its first implementers were the CURL Exemplars in Digital Archives (CEDARS) project and the Networked European Deposit Library (NEDLIB); implicitly, the National Library of Australia (NLA) has also adopted the model.[3] The California Digital Library recently received an Institute of Museum and Library Services (IMLS) grant to take first steps toward a University of California-wide preservation repository implementing the OAIS.

In the standard’s own words, “[a]n OAIS is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community.”[4] The standard formulates a framework for understanding and applying concepts in long-term preservation of digital information. It provides a common language for talking about these concepts and the organizational principles surrounding an archive. Though the OAIS pertains to both the digital and the analog realm, it has received the most attention for its applicability to digital data.

As a reference model, the OAIS in and of itself does not specify an implementation—it does not tell you which computers to buy, which software to load, or which storage medium to use. The standard does tell you, however, how an archive should be organized. In its so-called functional model, it defines the entities (or departments, if you will) in an archive, their responsibilities, and interactions. The data flows between those entities and the outside world are specified in the information model, which delineates how information gets into the archive, how it lives in the archive, and how it gets served to the public. The OAIS leaves it up to every distinct community to flesh out an implementation of the high-level guidelines. For the cultural heritage community a number of OAIS-related documents exploring the framework’s application to libraries, museums, and archives have come out of the joint OCLC-RLG Preservation Metadata Working Group.[5]

Fig. 1. The OAIS entities and information flows

As figure 1 illustrates, the OAIS stipulates that an archive (everything within the square box) interacts with a producer as well as a consumer. It takes in data from the producer through its ingest entity, and it serves out data to the consumer through its access entity. Within the archive itself, the data content submitted for preservation gets stored and maintained in the archival storage unit; data management maintains the descriptive metadata identifying the archive’s holdings.

The OAIS dubs the data flowing between the different players information packages, or IPs. The data flows sketched out in figure 1 contain the following information packages:

Submission information package (SIP): data flow between producer and archive (ingest)
Archival information package (AIP): data archived and managed within the OAIS
Dissemination information package (DIP): data flow between the archive (access) and the consumer

The data represented by the information packages may vary according to the specific needs at each station: an archival information package, for example, probably contains more data aimed at managing the object than its more light-weight counterpart on the access side, the dissemination information package. Furthermore, the OAIS details several categories of information comprising a complete information package, but in keeping with its role as a reference model, it stops short of suggesting specific data elements or a specific encoding for the entire bundle of information.[6]

Any community interested in implementing the OAIS has to identify or create a file-exchange format to function as an information package. For the cultural heritage community, METS shows great potential for filling that slot. METS wraps digital surrogates with descriptive and administrative metadata into one XML document. Digital surrogates in this context could be digital image files as well as digital audio or video. At the heart of each METS object sits the structural map, which becomes a table of contents for public access. The hierarchy of the structural map allows the navigation of media files embedded in, or referenced by, the METS object. It enables browsing through the individual pages of an artist’s book as well as jumping to specific segments in a time-based program, for example, a particular section of a video clip.

These so-called digital objects encoded in METS have three main applications that conveniently align with their potential as OAIS information packages.

File-exchange format. Since METS pulls together data about the item plus the digital surrogate information and encodes all that data into highly portable XML markup, the standard has been used to transfer data from local systems to union systems. At RLG, for example, a contributor to Cultural Materials can send a collection as individual METS objects, and our load program will ingest the data into a DB2 database. In OAIS terms, in this instance METS functions as a submission information package.
Management and preservation format. The METS specifications include an extensible section on administrative metadata that allows the digital object to carry information about administrative contexts such as legal access (intellectual property rights) or the technical environment in which the surrogate files were created (technical metadata). Because of its provisions for administrative metadata, METS lends itself to function as an archival information package in the OAIS framework.
Delivery format. Through use of a METS viewer utility the XML markup turns into a standards-based slide show or media player for cultural heritage content. By making digital images, audio, and video navigable, METS turns a multipart object such as a Chinese album consisting of ten leaves into a browsable object (see fig. 2) or provides efficient access to a forty-five-minute oral-history interview. The structural map divides the long audio clip into distinct sections that may be played back individually without playing the entire file. In OAIS lingo, providing public access turns METS into a dissemination information package.

Fig. 2. A METS object represented in the context of RLG Cultural Materials—a Chinese album from the Chinese Paintings Collection, contributed by the UC Berkeley Art Museum and Pacific Film Archive.

The METS XML schema divides the standard into a core component and several extension components. The METS core supports navigation and browsing of a digital object. It consists of a header, content files, and a structural map. The METS extension components support discovery and management of the digital object. They consist of descriptive metadata and administrative metadata, which in turn split into technical, source, digital provenance, and rights metadata.

Fig. 3. A graphical representation of a METS object (Chinese album with three leaves, and two details on leaf one).

Figure 3 details the components of a METS object and one possible set of relationships among them.

A header describes the METS object itself. It contains information along the lines of “who created this object, when, for what purpose.” The header information aids in managing the METS file proper.
The descriptive metadata section contains information describing the information resource represented by the digital object. Descriptive metadata enables discovery of the resource.
The structural map, represented by the individual leaves and details, orders the digital files of the object into a browsable hierarchy.
The content file section, represented by images one through five, declares which digital files constitute the object. Files may be either embedded in the object or referenced.
The administrative metadata section contains information about the digital files declared in the content file section. This section subdivides into
- technical metadata, specifying the technical characteristics of a file
- source metadata, specifying the source of capture (e.g., direct capture or reformatted 4 x 5 transparency)
- digital provenance metadata, specifying the changes a file has undergone since its birth
- rights metadata, specifying the conditions of legal access
  The sections on technical metadata, source metadata, and digital provenance metadata carry the information pertinent to digital preservation.
Honorary mention for the sake of comprehensiveness. A behavior section, not shown in figure 3, associates executables with a METS object. For example, a METS object may rely on a certain piece of code to instantiate for viewing, and the behavior section could reference that code.

The METS designers leveraged the combined power of the W3C specifications for XML schema and Namespaces in XML to create a flexible standard.

XML schema provides a way to specify the rules for a valid XML document.[7] The schema can be used to parse an XML document instance or, to put it in less technical terms, to verify that the XML markup conforms to the standard formalized by the schema. Using XML schema to define METS opened the doors to exploiting yet another W3C specification called Namespaces.[8]
Namespaces empowers METS to delegate certain metadata tasks to other XML extension schemas. For example, the METS schema itself does not dictate how you describe the resource represented by the digital object—it contains no elements for descriptive metadata. However, it contains a placeholder that may be realized through tags from an external XML schema for description.

In this way, each community can plug in its own preferred descriptive elements as long as they have been formalized into a schema.[9] The visual resources community, for example, may choose to extend METS using the VRA Core, while libraries might be more inclined to stick with Metadata Object Description Schema (MODS) from the Library of Congress. Others may decide the Dublin Core (DC) satisfies their access needs. The flexibility achieved through namespaces gives METS the potential for implementation across a wide range of communities.

The same logic applies to all components of administrative metadata. Each community has the opportunity to specify what data it deems most important for the management of its information, formalize those requirements into an XML schema, and use that schema as an extension to the hub-standard METS. For an example of a project that has identified or created a comprehensive suite of METS extensions, consult the Library of Congress AV Prototyping project.

An alternative to embedding metadata for the extension components through XML Namespaces and external schemas consists in simply referencing the data from within the object. Descriptive or administrative metadata may live outside the XML markup in a database, to which the METS object can point. Even down to the level of media files, METS provides the dual option of referencing or embedding. The METS specification makes provisions for wrapping the actual bit stream of a digital file in the XML. In most cases, however, files live at online locations pointed to from within the object.

In the realm of technical metadata, a fledgling NISO standard takes center stage for describing the different parameters of digital image files. As the NISO Data Dictionary—Technical Metadata for Digital Still Images, or Z39.87, the standard specifies a list of metadata elements. The Library of Congress, motivated by its AV Prototyping project, created an XML schema encoding for Z39.87, called NISO Metadata for Images in XML Standard (MIX). The XML schema constitutes the smallest Russian doll in our series of nesting standards, as it may be plugged into the METS framework as an extension schema for technical metadata. The standard also proposes fields for the source and digital provenance sections of METS.

The NISO effort draws heavily on the Tagged Image File Format specifications, better know by their acronym TIFF. As the name implies, this format uses tags to define the characteristics of a digital file.[10] Image creation applications write the necessary parameters to the tags within the TIFF file, which means that the majority of the data Z39.87 covers already exists in file headers. To complete the metadata cycle, harvester utilities have to extract the information from the image file headers and import it into digital-asset-management systems for long-term preservation. By using the image file format specification as an integral part of the Data Dictionary, the standard leverages existing metadata to achieve cost savings.

On the other hand, in going beyond the TIFF specifications for some elements, the NISO standard acknowledges information outside the TIFF scope that plays an important role in digital preservation. From this vantage point, the Data Dictionary becomes an important tool for educating vendors about the metadata our community sees as invaluable to preserve our investment. RLG is investigating the formation of a group advocating among digital camera-back vendors for the cultural heritage community’s metadata needs.[11] An industry standard for consumer digital cameras called DIG35 already has broad support among vendors. DIG35 allows transfer of information from the camera to the software utility that consumers use to manage their holiday snapshots. Building on that model, NISO Z39.87 in its XML instantiation MIX could become the file-exchange format to go between high-end scanners or camera backs and sophisticated asset-management databases.

The Data Dictionary divides the technical metadata elements into four groups.

Basic image parameters record information crucial to displaying a viewable image.
With just this information alone a programmer should be able to build a viewing application for the image from scratch. Elements represented in this section include format (GIF, JFIF/JPEG, TIFF, etc.), compression, and photometric interpretation (color space).
Image creation metadata records information crucial to understanding the technical environment in which a digital image file was captured. Just as in humans, any number of characteristics or issues of an image can be traced back to its birth. Elements represented in this section include SourceType (the analog source of a capture), ScanningSystem (identification of the particular scanning device used), and DateTimeCreated (date of the image’s birth).
Imaging performance assessment metadata records information that allows evaluation of the digital image’s quality, or output accuracy. This data aids in uncovering the characteristics of the original source of the image and functions as a benchmark for displaying or printing the file. Elements represented in this section include width and height of the digital image and the source, as well as various parameters for capturing the sampling frequency of an image and its color characteristics. Furthermore, the section hosts information about any color targets (such as GretagMacbeth or Q60) included in a capture.
Change history metadata records information about the processes applied to an image over its life cycle. This data tracks any changes to the original file, for example, during the course of preservation activities such as refreshing (copying the file to a new storage format) or migration (saving the file from an eclipsing file format to an emerging file format). Elements represented in this section include DateTimeProcessed, ProcessingAgency, and ProcessingSoftware.

For any institution just starting out on the path of digital preservation, managing technical metadata through the NISO Data Dictionary is a great first step. The term data dictionary itself comes from the database community; it refers to a file defining the basic organization of a database down to its individual fields and field types. NISO Z39.87 represents a blueprint for a database or a database module that can be implemented fairly quickly—all the intellectual legwork has already been done by the standards committee.

For expanding the database to include structural metadata relating files to each other, plus a descriptive record, as well as rights metadata, the database could be augmented by looking at METS and its extension schemas. Again, the Library of Congress AV Prototyping project offers a model implementation of a database using the METS approach. Scaling up to the bigger picture, this database could find its home in an archival environment specified by the OAIS.

OAIS, METS, and NISO Z39-87 as nesting standards
Fig. 4. OAIS, METS, and NISO Z39.87 as nesting standards

To summarize: as illustrated by figure 4, the OAIS stipulates information packages, which find instantiation in METS; METS stipulates an extension schema for technical metadata, which finds an instantiation in Z39.87’s XML schema, MIX. Now, after the detailed review, the first bulleted list in this article should make a lot more sense.

In broad strokes, digital preservation with the nesting standards OAIS, METS, and Z39.87 looks like a puzzle with all the pieces neatly falling into place. In the details, however, some harmonization issues between the standards remain. For example, the OAIS model breaks an information package into different subcomponents than the METS schema; the NISO Data Dictionary and its XML encoding MIX cover not only the technical metadata extension of METS, but also some elements that the digital object standard relegates to sections on source and digital provenance. Nevertheless, the convergence of three standards developed independently illustrates that a holistic view of digital preservation is emerging. Only widespread implementation will tell whether the theory as outlined by the standards can hold up in practice.

Footnotes
[1] Both OAIS and METS are referenced by multiple essays.(back)

[2] National Information Standards Organization.(back)

[3] For a review of CEDARS, NEDLIB, and NLA implementations of the OAIS, see “Preservation Metadata for Digital Objects: A Review of the State of the Art,” published by the OCLC/RLG Working Group on Preservation Metadata.(back)

[4] See http://wwwclassic.ccsds.org/documents/pdf/CCSDS-650.0-B-1.pdf, p.1-1.(back)

[5] RLG also maintains an OAIS Web site with further links.(back)

[6] For an introduction to the OAIS that goes beyond the present article, consult “Meeting the Challenges of Digital Preservation: The OAIS Reference Model,” by Brian Lavoie.(back)

[7] Some of you may be more familiar with document type definitions (DTDs) to specify rules for valid SGML and XML markup.(back)

[8] One of the key differences between XML schemas and DTDs is that only schemas allow extensions through the use of XML namespaces.(back)

[9] For more information on XML in the cultural heritage community, see “DigiCult Technology Watch Briefing 7: The XML Family of Technologies.(back)

[10] For the full TIFF tag library, see Appendix A of the format specifications.(back)

[11]For more information about this fledgling initiative, please contact the author.

print this article

Saving Digital Heritage—A UNESCO Campaign

Colin Webb
National Library of Australia

Considering that the disappearance of heritage in whatever form constitutes an impoverishment of the heritage of all nations …
Recognising that … resources of information and creative expression are increasingly produced, distributed, accessed and maintained in digital form, creating a new legacy—the digital heritage …
Understanding that this digital heritage is at risk of being lost and that its preservation for the benefit of present and future generations is an urgent issue of worldwide concern …

So begins an important new document being prepared for submission to the General Conference of UNESCO, the United Nations Educational, Scientific and Cultural Organisation. The Draft Charter on the Preservation of the Digital Heritage was positively received by a recent session of the UNESCO Executive Board, which asked for further consultations during preparation of a final draft for consideration. The Draft Charter is one very visible element in an international campaign to address the barriers to digital continuity and to head off the emergence of a second “digital divide,” in which the tools of digital preservation are restricted to the heritage of a well-resourced few.

As well as the Charter, other elements of UNESCO’s strategy for promoting digital preservation include widespread consultations, the development of practical and technical guidelines, and a range of pilot projects. UNESCO has been critical in fostering the understanding and preservation of other kinds of heritage through avenues such as the World Heritage Convention and the Memory of the World program. Given the organisation’s commitment to the safeguarding of recorded knowledge evident in its Information for All program, it is not surprising that UNESCO has been concerned at the prospect of the loss of vast amounts of digital information.

Digital technology’s immense potential for human benefit in so many areas—communication, expression, knowledge sharing, education, community building, accountability, to name just a few—is a tantalizing promise so easily denied by the lack of means, knowledge, or will to deal with its other great potential: rapid loss of access.

The impetus for this campaign was embedded in a resolution passed by the UNESCO General Conference at its previous meeting in 2000. That resolution, drafted in part by the Council of Directors of National Libraries (CDNL), highlighted the need to safeguard endangered digital memory. Following that, as a basis for developing a UNESCO strategy, the European Commission on Preservation and Access (EPCA) was commissioned to prepare a discussion paper outlining the issues in digital preservation for debate.

Consultation Process

As well as circulating for comment the draft papers produced in the campaign to governments and nongovernment organisations and experts all over the world, the campaign has featured a number of regional consultation meetings convened specifically to raise issues of regional concern and to provide comment on the Preliminary Draft Charter and Draft Guidelines on the Preservation of Digital Heritage. The meetings were held between November 2002 and March 2003, in Canberra, Australia (for Asia and the Pacific); in Managua, Nicaragua (for Latin America and the Caribbean); in Addis Ababa, Ethiopia (for Africa); in Riga, Latvia (for the Baltic states); and in Budapest, Hungary (for Eastern Europe).

All the meetings confirmed the need for urgent action and the great distance to be traveled before preservation of digital heritage is a reality in most countries. In total, around 175 experts and stakeholders from eighty-six countries participated in the five meetings, representing libraries, records archives, museums, audiovisual archives, data archives, producers and publishers of digital content, lawyers, universities and academies, governments, standardization agencies, community development organisations, computer industries, and researchers, among others.

Draft Charter on the Preservation of the Digital Heritage

Charters and declarations promulgated by UNESCO are meant to be “normative” documents that member states agree to through a vote of acceptance rather than by individual ratification. They are not binding and do not require any specific action on the part of governments, but they do express aspirations and priorities. In this case the purpose of the Draft Charter is to focus worldwide attention on the issues at stake and to encourage responsible preservation action wherever it can be taken.

The Draft Charter explains that the digital heritage

consists of unique resources of human knowledge and expression, whether cultural, educational, scientific or administrative, while embracing technical, legal, medical and other kinds of information that more and more are being created digitally, or converted into digital form from existing analogue resources.… Many of these resources have lasting value and significance, and therefore constitute a heritage that should be protected and preserved for current and future generations. This heritage may exist in any language, in any part of the world, and in any area of human knowledge or expression.

The purpose of preserving this heritage is to ensure that it can be accessed. The Draft Charter recognizes that this involves a tension and seeks a “fair balance between the legitimate rights of creators and other rights holders and the interests of the public to access digital heritage materials” in line with existing international agreements. It recognizes that some digital information is sensitive or of a personal nature and that some restrictions on access and on opportunities to tamper with information are necessary. Sensibly, it asserts the responsibility of each member state to work with “relevant organisations and institutions in encouraging a legal and practical environment which would maximise accessibility of the digital heritage.”

Threats to this digital heritage are highlighted, including rapid obsolescence of the technologies for access, an absence of legislation that fosters preservation, and international uncertainties about resources, responsibilities, and methods. Urgent action is called for, ranging from awareness raising and advocacy to practical programs that address preservation threats throughout the digital life cycle.

In discussing the measures that are needed, the Draft Charter emphasizes the importance of deciding what should be kept, taking account of the significance and enduring value of materials, and noting that the digital heritage of all regions, countries, and communities should be preserved and made accessible. It discusses the legislative and policy frameworks that will be needed and calls on member states to designate agencies with coordinating responsibility. It also calls on governments to provide adequate resources for the task.

Many agencies have a role to play, both within and outside governments. Agencies are urged to work together to pursue the best possible results and to democratize access to digital preservation methods and tools. The Draft Charter proposes a UNESCO commitment to foster cooperation, build capacity, and establish standards and practices that will help. Although this document is meant to inspire rather than dictate action, its adoption by UNESCO will be an important opportunity to raise digital preservation issues with governments and others who can influence how laws, budgets, and expectations are framed to help or hinder continuity of the digital heritage.

Guidelines for the Preservation of Digital Heritage

While the Charter focuses on advocacy and public policy issues, the Guidelines present practical principles on which technical decisions can be based throughout the life cycle of a wide range of digital materials. The Guidelines, prepared by the National Library of Australia on commission from the UNESCO Division of Information Society, have been published on the UNESCO CI (Communication and Information) Web site.

The guidelines address at least four kinds of readers with different but overlapping needs:

policy makers looking for information on which to base policy commitments regarding digital preservation
high-level managers who are seeking to understand the concepts of digital preservation and the key management issues their programs will face
line managers involved in making day-to-day decisions who need a more-detailed understanding of practical issues
operational practitioners responsible for implementing programs who need a perspective on how various practical issues and processes fit together as an integrated whole.

The structure of the guidelines is intended to make it easy for readers to find the information most relevant to their needs. The regional consultation process highlighted the fact that many people who feel they have a preservation responsibility are operating with very limited resources. Specific suggestions have been included to provide some starting points, although comprehensive, reliable digital preservation is a resource-intensive business.

Material in the Guidelines is organized around two approaches: basic concepts behind digital preservation (explaining concepts of digital heritage, digital preservation, preservation programs, responsibility, management, and cooperation) and more- detailed discussion of processes and decisions involved in various stages of the digital life cycle, including deciding what to keep, working with producers, taking control and documenting digital objects, managing rights, protecting data, and maintaining accessibility.

Although the guidelines were directly produced by the National Library of Australia, they were very extensively informed by input from reading and comments from a wide range of contacts, in addition to responsive comments from the formal consultation meetings. The text does not reflect any new research, but does try to reflect current thinking about the maintenance of accessibility, the core issue in digital preservation (although certainly not the only important issue).

For some readers the level of technical detail will be disappointing. The detail required to meet all the needs of practitioners is very situation-specific and quickly dated. As the Guidelines are intended to be useful in a very wide range of sectors and circumstances, the emphasis is on technical and practical principles that should enable practical decisions. It is to be hoped that UNESCO will complement the Guidelines with a Web site offering a growing body of technical details and tips aimed at specific sectors.

To give readers a sense of the approaches taken, a few of the principles asserted in the Guidelines, are appended to this paper. The UNESCO Guidelines for the Preservation of Digital Heritage will be published in a number of languages. At the time of writing, they are available in English from the UNESCO Web site.

Sample Principles from the UNESCO Guidelines for the Preservation of Digital Heritage

1. Not all digital materials need to be kept, only those that are judged to have ongoing value: these form the digital heritage.

3. Digital materials cannot be said to be preserved if access is lost. The purpose of preservation is to maintain the ability to present the essential elements of authentic digital materials.

4. Digital preservation must address threats to all layers of the digital object: physical, logical, conceptual, and essential.

5. Digital preservation will happen only if organisations and individuals accept responsibility for it. The starting point for action is a decision about responsibility.

6. Everyone does not have to do everything; everything does not have to be done all at once.

7. Comprehensive and reliable preservation programs are highly desirable, but they may not be achievable in all circumstances of need. Where necessary, it is usually better for noncomprehensive and nonreliable action to be taken than no action at all. Small steps are usually better than no steps.

8. In taking action, managers should recognize that there are complex issues involved. It is important to do no harm. Managers should seek to understand the whole process and the objectives that eventually need to be achieved and avoid steps that will jeopardize later preservation action.

15. Preservation programs must clarify their legal right to collect, copy, name, modify, preserve, and provide access to the digital materials for which they take responsibility.

24. Authenticity is best protected by measures that ensure the integrity of data is not compromised and by documentation that maintains the clear identity of the material.

26. The goal of maintaining accessibility is to find cost-effective ways of guaranteeing access whenever it is needed, in both the short- and long-term.

27. Standards are an important foundation for digital preservation, but many programs must find ways to preserve access to poorly standardised materials, in an environment of changing standards.

28. Preservation action should not be delayed until a single ‘digital preservation standard’ appears.

29. Digital data is always dependent on some combination of software and hardware tools for access, but the degree of dependence on specific tools determines the range of preservation options.

30. It is reasonable for programs to choose multiple strategies for preserving access, especially to diverse collections. They should consider the potential benefits of maintaining the original data streams of materials as well as any modified versions, as insurance against the failure of still-uncertain strategies.

32. Preservation programs are often required to judge acceptable and unacceptable levels of loss in terms of items, elements, and user needs.

33. Waiting for comprehensive, reliable solutions to appear before taking responsible action will probably mean material is lost.

34. Preservation programs require good management that consists largely of generic management skills combined with enough knowledge of digital preservation issues to make good decisions at the right time.

35. Digital preservation incorporates the assessment and management of risks.

39. While suitable service providers may be found to carry out some functions, ultimately responsibility for achieving preservation objectives rests with preservation programs and with those who oversee and resource them.

Highlighted Web Site

Digital Dog

Digital Dog is a training, consulting, and service business dedicated to digital imaging, electronic photography, and color management. The website provides a variety of free digital imaging tutorials, including a color management primer, scanner interface review, tips on calibrating digital cameras, and an "in the trenches" guide to image resolution. Many of the articles available on the Digital Dog site were written for Photo Electronic Imaging magazine, and contain reliable technical content presented in an accessible, down-to-earth style.

This site should be a valuable source of information for institutions involved in scanning projects, or who are looking for good digital imaging training materials. Some of the older articles are out-of-date, but there is a great deal of practical content available. Most of the tutorials are PDF documents, and will require the Acrobat Reader plug-in.

[Errata added 1 July 2003: The Digital Dog Tips section is once again available.]
[Errata added 17 June 2003:
Dear Reader: Our choice of this site has proven to be an unintended object lesson in the volatility of Web resources. Within two days of our final URL check, the Digital Dog site was completely reorganized, changed domain names and most of its tutorial content disappeared. Though RLG DigiNews often covers efforts to preserve Web sites, we too are sometimes caught off guard by the swiftness and suddenness of their transformations.]

print this faq

FAQ

Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part III.

Your editor's interview in the December 2002 RLG DigiNews states that JPEG 2000 can save space and replace the multitude of file formats used for conversion and display of cultural heritage images but that it isn't suitable for bitonal material. We have lots of bitonal images. Is there anything similar available for them?

Part I of this three-part FAQ discussed general considerations for migration of scanned bitonal images away from TIFF G4, while Part II examined the characteristics of several alternative bitonal file formats and compression schemes that have become available during the past decade. In this, the final installment, we present the results of our experiences with several products for converting individual and multipage bitonal high-resolution TIFF G4s. Our coverage includes product specifications, general impressions, compression data, and sample images. Please note that some of the files require special plug-ins to be viewed. Instructions for downloading the necessary viewers are given below.

Test Image Selection

Though a bitonal image may seem like a simple affair, how well a particular image compresses depends on how it was scanned, the nature of its content, and the design of the compression scheme. Characteristics of the source image that can affect the rate of compression include:

the "cleanliness" of the scan (extraneous speckles lower compression)
the resolution of the scan (lower resolution lowers compression)
the use of multiple sizes and styles of text (more variation lowers compression)
the density of information present (less white space lowers compression)
the presence of fine detail, e.g., engravings or halftones (high complexity lowers compression)
the number of pages (fewer pages lowers compression)

Why do these factors affect compression? It helps to understand a little about how image compression is accomplished. Lossless compression depends on the recognition of patterns and the replacement of repeated elements with compact representations that exactly describe the feature being compressed. For example, instead of storing every bit in a scan line of all white bits, simply store a count of the white bits. Thus, sparse printing that leaves a lot of white space compresses well, while dense printing or highly speckled pages result in more transitions between black and white and thus less efficient compression.

The more sophisticated compression algorithms tested here take advantage of the fact that higher level elements are repeated within printed documents, including the symbols that make up the text. Thus, if a 12-point, Times Roman, non-bold, non-italic, non-underlined, lowercase 'a' appears in a document, its bitmap can be stored in a database and a subsequent appearance of the identical character can be replaced by a pointer to the database. This explains why clean, uniform typography compresses better than irregular, highly variant typography. Longer documents have an advantage because the algorithm "learns" more and more of the characters as it processes the text.

Halftones deserve a special mention. Bitonal halftoning is a printing process that simulates shades of gray by varying the size and spacing of black dots. Avoiding problems such as moiré (interference patterns) and poor contrast when scanning halftones bitonally requires the use of special processing algorithms (e.g. dithering or error diffusion). When done properly, the typical scanned halftone will be densely packed with data of a somewhat random nature, presenting a real challenge to compression algorithms.

Lossy but "visually lossless" compression attempts to remove elements that are redundant for human visual perception, producing an image that contains less information, but doesn't appear degraded.

We selected four images for in-depth testing, representing a variety of content type. We also tested 20-page sequences derived from the same works in order to average out anomalies, and give the compression algorithms a chance to show off their "learning curves."

The images are from three of Cornell's older collections: historic math books, NEH agriculture, and historic monographs. All images are bitonal 600 dpi TIFF G4s. If you follow the links for the individual pages from Table 1, you'll be taken to the image as it appears within the Cornell Digital Library—converted from TIFF to GIF, scaled down by a factor of six and enhanced with gray for improved legibility. The links for the 20-page groupings will bring up all 20 pages in GIF thumbnail mode, from which larger GIFs of the individual pages can then be accessed.

Table 1. Details of Test Images

Individual page	Title/Author/Publication Date	Important characteristics	Individual Page Size (width x height) & number (click to see image in collection context)	20-page grouping (click to see thumbnails in collection context)	Collection source
	An elementary treatise on elliptic functions/Arthur Cayley/1895	Variable-sized text, heavy use of math symbols, clean scans with a fair amount of white space	p. 16; 3120x5056 pixels (5.2" x 8.43")	pp.1-20; images 19-38	Historic math books
	The Modern Farmer in His Business Relations/Edward F. Adams/1899	Very uniform text, fairly dense, fairly clean	p. 33; 3424x5184 pixels (5.71" x 8.64")	pp. 22-41; images 26-45	NEH agriculture collection
	The Mushroom, Edible and Otherwise: Its Habitat and Its Time of Growth/M. E. Hard/1908	About 45% text and 55% halftone illustrations, heavily speckled scans; the individual page consists almost entirely of a halftone	p. lxvii (67); 4080x6000 pixels (6.8" x 10")	lxvii-lxxxvi (67-86); images 69-88	NEH agriculture collection
	The steam turbine, the Rede lecture, 1911/Charles A. Parsons/1911	About 50% text, 35% complex line art and 15% halftone illustrations; the individual page consists almost entirely of a complex line drawing	p.27; 2832x4368 pixels (4.72" x 7.28")	pp. 13-32; images 23-42	Historic monograph collection

Conversion software selection

Our software testing was limited to products that are open source or for which free evaluation copies are available. In some areas of computing, that might greatly constrain the selections, but in the specialized market niche of bitonal image conversion, it hardly cramped our style at all. We were able to test most of the important packages without spending a dime on software acquisition, which bodes well for anyone who wants to test these products on their own image collections.

As indicated in part II, we focused testing on products supporting three main technologies:

CPC (Cartesian Perceptual Compression): a file format and lossy compression scheme for bitonal images.

We fully tested CPC Tool from Cartesian Products, the only product available to encode this proprietary format.

DjVu: a file format supporting several compression schemes for bitonal, gray level, and color images. It supports both lossy and lossless bitonal compression. The bitonal compression algorithm is called JB2 and is similar to JBIG2.

We fully tested Any2DjVu , a Web service that allows files of many different formats (including TIFF G4) to be uploaded and converted to the DjVu format.

We also tested cjb2, a bitonal DjVu encoder that is part of the DjVuLibre package, an open source implementation of DjVu. Cjb2 only converts single pages. Although DjVuLibre comes with a utility (called djvm) that combines single DjVu pages into multipage DjVu files, it does not support font learning across pages. Thus we tested cjb2 only for the encoding of single pages.

There is also a commercial DjVu encoder, made by the format's owner, LizardTech, Inc., which we did not test. Currently available as part of LizardTech's Document Express 4.0, it is available in a trial version from LizardTech's Web site. The trial became available fairly late in our testing cycle and requires a special page cartridge (allowing the encoding of 250 pages) which we requested, but still had not received ten days later.

JBIG2: a lossless and lossy compression scheme for bitonal images only. JBIG2 does not specify a file format, but is often associated with PDF.

We fully tested two JBIG2 in PDF encoders, PdfCompressor from CVision Technologies and SILX from PARC (Palo Alto Research Center).

Another option for JBIG2 that we did not test is Adobe's Acrobat Capture with Compression PDF Agent.

Tables 2 and 3 provide additional details on the products tested.

Table 2. Product information (general)

Product and Version	Producer	Type of product	Demo available/terms	Platforms supported	Viewer software (all freely available for downloading)
CPC Tool 5.1.x	Cartesian Products, Inc.	Commercial	Yes/1000 file limitation when converting to or from CPC	Windows (95 and up), MacOS X, Linux, various Unixes	CPC Lite (for Windows) CoPyCat (for Mac, Linux and several Unixes�requires Acrobat Reader also)
Any2DjVu	DjVu Zone	Free Web service	Yes/Response time will depend on size of file and how busy the service is; not meant for production use	any platform that supports a graphical Web browser	LizardTech DjVu browser plug-in (Windows, Mac Classic, Mac OS X, and Unix) DjVuLibre browser plug-in (Linux and Unix�part of the DjVuLibre distribution)
cjb2 from the DjVuLibre package 3.5.x		Open source	Yes/freeware	Windows 95 and up (also available as part of full DjVuLibre package here and here); Linux and Unix versions are here	same as above
CVista PdfCompressor 2.1	CVision Technologies	Commercial	Yes/30 days or 1000 files (whichever comes first) and output has watermark and footer	Windows (95 and up)	Adobe Reader or PDFViewer browser plug-in (most platforms; must be at least version 5)
Silx 3.1 (previously called DigiPaper)	PARC Solutions (Xerox PARC)	Commercial	Yes/90 days and output files have watermarks	Windows (95 and up), Linux and Sun Solaris	same as above

Table 3. Product features

Testing Protocols

How we tested

For each tool, we converted the four individual test pages from TIFF G4 to the supported target format. In the case of cjb2, the open source bitonal DjVu encoder, we first had to convert the TIFF G4s to pbm (portable bitmap) format, which we did with the free Windows application Irfanview. We also converted 20-page groupings derived from the same works as the individual test pages, except for cjb2, which only handles single pages.

Other than PdfCompressor, which runs only under Windows (CVision says the product will eventually support Solaris), all the tested products can be run under Windows, Linux or Unix. We ran the Windows version of cjb2, and the Solaris versions of CPC Tool and Silx. However, results should be the same regardless of the platform on which the conversions are carried out.

Each product offers options that affect the speed of conversion, display speed, display quality, etc. We attempted to test the major compression options of each product. We always tested lossless mode (if available), in addition to two or three lossy modes, sometimes in combination. As a rule, we turned off features that would result in faster compression or faster display at the cost of lower compression. This allowed each product to show off the maximum compression of which it is capable.

What we didn't test

As already mentioned, we did not test every product on the market capable of converting scanned TIFF G4 images to other bitonal formats. We limited our testing to three output formats, and only a subset of the products in those markets.

Of the products we did profile, our evaluation permits comparison of 1) compression efficiency in various modes (though see caveats, below), 2) quality of the compressed image (by visual inspection of the output files) and 3) general ease of use. You the reader, if you choose to examine and compare the test images, may also be able to evaluate the speed of decompression and how readily the viewers can navigate and manipulate the files, as long as you are viewing all the files on the same computer. We did not evaluate compression speed, decompression speed, or other performance issues.

Some of the products support OCR (optical character recognition). We did no evaluation of OCR capability and left OCR functions turned off for all conversions.

Caveats

As detailed as these tests are, they cannot be a substitute for individual testing on your own documents. We only tested TIFF files from the Cornell Digital Library, and only a limited range of content type consisting of printed material from the 19th and early 20th centuries. The results might not apply to other kinds of content, such as earlier (and more varied or more broken) typography or other illustration types suitable for bitonal scanning, such as woodcuts. Also, there are many different ways to build a valid TIFF file, and not all variants are recognized by all conversion software.

There are limits to the degree that the compression numbers in the table below can be compared. Lossless modes should be pretty much directly comparable, but not lossy modes. For example, even if two products use the same compression scheme, what one calls "loose" and another calls "aggressive" may or may not be similar in the degree of compression or the degree of "loss." This is not only because the terms used are subjective descriptions of highly technical underpinnings, but because different implementations of the same compression technique may do certain things more or less well than others. Therefore, it is essential to examine the images and evaluate their quality, in addition to looking at the compression numbers. From what we've seen, it is not possible to make any predictions about the performance of a product based on the file format or compression scheme supported. Each has strengths and weaknesses and must be evaluated separately.

Test Results for Individual Pages

Benchmarks: Table 4 begins with some size data about the test files and the effectiveness of G4 compression on each one. The first line of data shows the size of each file in MB as an uncompressed TIFF and as a percentage relative to G4 compression. On these files, G4 achieves reductions running from about 2.6 times (258%) for the halftone all the way up to 31.5 times (3147%) for the variable text example. The next line shows the size of the files in KB after G4 compression, assigning these values a nominal 100% against which to measure the effectiveness of the tested products.

The text of the active links in the table are percentages, rounded to the nearest whole number, relative to the size of the TIFF G4 files. Thus, a value of 50 indicates compression to 50% the size of TIFF G4. Values above 100 indicate files that were larger than the original TIFF G4 after conversion.

The "GIF 200 dpi" files are those delivered by the Cornell Digital Library when "View as100%" is chosen as the display format. The "GIF 100 dpi" files are those delivered by the Cornell Digital Library when "View as 50%" is chosen as the display format.

For information about the test files linked to from this table, see the sidebar.

Table 4. Test results for individual files

Results Interpretation for Individual Files (Table 4)

Lossless: The first of the lossless benchmarks, with a file format and compression scheme of "PDF/G4," shows what can be expected by moving a G4 datastream into a PDF envelope. It is not so much a conversion as a mild transformation that leaves the G4 image data intact, but changes the file format to one that is accessible to a much larger Web audience. As can be seen, there is little change in file size. The main advantage of such a conversion is improved access.

The other lossless tests demonstrate the great variation amongst the different products and the different content types. The two DjVu products excelled in lossless compression of the two text-only pages, achieving nearly twice the compression of G4. PdfCompressor had the best results with the halftone image, with three times the compression of G4. None of the products did very well with the line drawing, with the two DjVu products and PdfCompressor all managing only to reach about three quarters the size of the G4, and the others doing even worse.

The performance of Silx in lossless mode deserves attention. According to its manufacturer, although Silx includes a lossless mode, it was not designed for optimal performance in that mode. As with the other products in these tests, Silx is geared to provide its best compression in lossy modes that are intended to be visually or perceptually lossless, rather than bit-for-bit lossless.

Note that CPC lacks a true lossless mode, and so is not included in this set of comparisons.

Lossy (modest): We included in the "modest" category of lossy both default lossy modes and those that concentrate on minor lossy procedures, such as cleaning (removal of tiny bit protrusions) and despeckling (removal of very small, extraneous dots). This should be considered a loose grouping, since we can't really know how one product's default mode compares with another's. This is especially true with the Any2DjVu service where the software performing the conversions resides on a Web server and is hidden from the user. The software package that runs on the Any2DjVu service appears to be called DjVuDigital and is not, to the best of our knowledge, available as a commercial product.

Understanding those limitations, Any2DjVu did the best compression on the variable text page, closely followed by PdfCompressor and CPC. PdfCompressor achieved the highest compression in this class on the uniform text example, with about five times the compression of G4 when its clean and despeckle filters are on. Any2DjVu in its lossy normal mode was almost as good. The open source cjb2 converter, which had topnotch results on the two text pages in lossless mode, was worst in class in clean mode.

CPC was best in this class with the halftone image, narrowly besting PdfCompressor and cjb2. At least that's the story the compression figures tell. In examining the images, however, it seems that the "clean" routine in cjb2 is indiscriminate and unaware of halftones. This is a good example of a lossy algorithm producing output that doesn't qualify as perceptually lossless. The image is extremely contrasty and lacks the subtle tonal impressions of the original halftone. That's why it's essential to examine the images and not just look at the compression numbers. (The images produced by CPC and PdfCompressor look fine, by the way).

The subtle impact of some compression enhancement filters can only be appreciated with close inspection of the images. Comparing PdfCompressor in lossy default as opposed to lossy clean and despeckle modes, the greatest difference in compression is seen for the line drawing (71% of G4 without filters vs. 66% with). Examining the line drawing in both modes at high magnification, it is clear that there isn't much speckling to be removed, but the abundant horizontal lines have many fewer protrusions in the filtered version. The mushroom image is heavily speckled. This is most visible in the margin area. The despeckle filter didn't have a great impact on compression, because most of the image is a halftone where despeckling isn't desirable (see above). However, a close comparison of the filtered and unfiltered mushroom images will show that the despeckle filter was quite effective. As can be seen in Table 5, across twenty pages, the despeckle filter can have a significant impact on compression.

Despite some improvement over lossless modes, none of the tested products was able to achieve high levels of compression on the line drawing in lossy mode.

Lossy (aggressive): This class includes modes that go beyond minor cleanup and that loosen up the font matching algorithms. With sufficient slack in character matching, it is possible for an incorrect character to be substituted. Though we didn't note any cases where that occurred, we didn't search exhaustively either. As seen in the table, not all the products offer aggressive lossy modes.

Any2DjVu's compression in this mode was only marginally better than its lossy normal mode. Cjb2 improved more noticeably, but still wasn't as good as other products in their modestly lossy modes.

Silx improved markedly in this mode. It went from next to last to first for the variable text page, and from next to last to second for the uniform text page, suggesting that its real forte is text compression.

Not surprisingly, since aggressive modes in JB2 and JBIG2 are primarily aimed at improving compression of text, this mode did not offer much by way of improved compression of either the halftone or line drawing.

Halftone: Both JBIG2 products (PDFCompessor and Silx) have special halftone modes designed to provide even better compression when compressing halftones. Silx improved considerably over its other modes, but still could not match CPC or PdfCompressor in their modest lossy modes. On the other hand, PdfCompressor, which already had one of the best results on the halftone, really shone with its halftone filter on, producing a file five times smaller than the TIFF G4. If the image is examined closely, especially in comparison with a lossless version, some loss in image quality is evident, but it is surprisingly small given the considerable improvement in compression relative to G4.

Overall: The best combined total for the four images in lossless mode was achieved by PdfCompressor, thanks to its excellent lossless compression of the halftone image. Without use of special filters, the best combined lossy result was produced by CPC at 34% of TIFF G4. However if PdfCompressor's special lossy halftone is included, then its overall lossy number drops to 25%.

For the two text pages, the best combined lossless compression was achieved by the two DjVu encoders, while the best lossy compression came from Silx in its aggressive mode.

Test Results for Grouped Pages (Table 5)

Benchmarks: Table 5 (below) begins with some size data about the test files and effectives of G4 compression on each one. The first line of data shows the combined size of each set of 20 files in MB as uncompressed TIFFs and as a percentage relative to G4 compression. On these groups of files, G4 achieved reductions running from about 5.4 times (539%) for the halftone dominated set, all the way up to 30 times (2993%) for the variable text set. The next line shows the size of the files in MB after G4 compression, assigning these values a nominal 100% against which to measure the effectiveness of the tested products.

The other data in the table are all percentages, rounded to the nearest whole number, relative to the size of the TIFF G4 files. Thus, a value of 50 indicates compression to 50% the size of TIFF G4. Values above 100 indicate files that were larger than the original TIFF G4 after conversion. Each value represents the 20-page sequence from the work described in the column heading. Click on the title/author link in the column heading to see thumbnails of all 20 pages.

Table 5. Test results for multipage files

Results Interpretation for Multipage Files (Table 5)

Testing multiple pages is important for several reasons. It forces some averaging on the results (few works consist entirely of halftones or line drawings, for example). It also minimizes any exaggeration of the file wrapper relative to the image contents that might occur with a single page. Finally, as mentioned previously, it allows the compression schemes that learn fonts to work on a larger set of characters and take full advantage of that feature.

Lossless: The first of the lossless benchmarks, with a file format and compression scheme of "PDF/G4" shows what can be expected by moving a G4 datastream into a PDF envelope. Confirming what was seen for the individual pages, these groupings show that for most kinds of content, moving TIFF G4s into PDF envelopes results in little change in file size. However, given that multipage TIFFs are even less well-supported than individual ones, those wanting to distribute aggregations of individual TIFF G4s via the Web without any substantive change to the image data may want to consider this route.

PdfCompressor supports Web optimization of PDFs, which allows a multipage document to be viewed as soon as the first page is loaded, and also permits selective downloading of individual pages within multipage bundles. There are other products on the market that specialize in converting TIFF G4s to PDFs including Aquaforest's Tiff Junction (Aquaforest makes other TIFF conversion utilities that may be of interest to RLG DigiNews readers who use Microsoft's IIS as their Web server) and the open source utility c42pdf. As acknowledged by the keepers of c42pdf, "only part of TIFF 6.0 specification is used." We were unable to test it on our images because c42pdf couldn't handle our particular TIFF G4s.

Lossless conversions to other compression schemes held few surprises. With a few exceptions, the multipage sets compressed slightly better than the individual pages. The only major improvement from single page to multipage set occurred with "The Steam Turbine." In that case, the individual page consisted almost entirely of a hard-to-compress line drawing, while the 20-page set was only about 35% line drawings and 50% text. Therefore, the 20-page set compressed quite a bit better, with PdfCompressor turning in the best numbers.

Lossy (modest): In lossy mode, the impact of font learning on compression became quite apparent. The products using JB2 and JBIG2 compression all showed gains in the multipage samples of the text works. Topping the list was Silx, which overall compressed the multipage samples about twice as well as the single page, and had the best overall modestly lossy text results. Any2DjVu also showed significant improvement though not as pronounced as Silx, while PdfCompressor had even more modest improvements, though its results were still quite respectable, at about a quarter the size of the TIFF G4s. CPC's results were mixed, with the variable text pages improving and the uniform text pages compressing less well.

The page sets dominated by halftones and line drawings showed some improvement as well, but this can be largely attributed to the fact that the groups contained considerably more text than the single pages from the same works.

Lossy (aggressive): Aggressive mode compression of the multipage sets produced the best numbers we saw. For the two text sets, Silx compressed by about eight and a half times better than TIFF G4. Any2DjVu also lowered its numbers, but not as much. However, the two more graphical sets (each only about 50% text) showed little change from modestly to aggressively lossy mode. In contrast to its very effective compression of text-only pages, Silx managed only very modest improvement relative to TIFF G4 for the more graphical works.

Halftone: The two graphical works both contain some halftones, so we tested both with the specialized halftone modes offered by PdfCompressor and Silx. With its halftone filter on, PdfCompressor turned in the best compression results for the multipage sets containing graphical content. We had problems using Silx for multipage files containing multiple halftones created with the halftone filter on. When viewed in Adobe Acrobat Reader (v.5 or 6), some of the halftones would not appear, and Reader sometimes crashed. We also had a problem with multipage, multi-halftone files in PdfCompressor v.2.0 (the documents would appear blank), but it was resolved with the release of v.2.1.

Overall: Again, PdfCompressor using JBIG2 had the best combined compression (over the 80 pages of our four 20-page sets) in lossless mode, with Any2DjVu only slightly behind. In modestly lossy mode with no special filters, PdfCompressor, CPC and Any2DjVu all achieved combined results in the low-mid 30% range. The best overall lossy results came from PdfCompressor with its halftone filter on, at 24%, more than four times smaller than TIFF G4.

For the two text-only sets, the best combined lossless compression was achieved by Any2DjVu, while the best lossy compression came from Silx in its aggressive mode.

Product Summaries (in alphabetical order by compression technology)

CPC: CPC isn't flashy. It wasn't a particular standout on any type of image content, yet even without special filters, it emerged at or near the top when the totals were added. This suggests it might be a good choice for mixed collections. Though it lacks a true lossless mode, we saw no specific quality problems. However, the appropriateness of using a lossy format would depend on whether the images are for preservation or access, and the nature of their content.

The main drawbacks to CPC have nothing to do with its compression performance or output quality. CPC is a proprietary format with only a single source for encoders and decoders. It is not Web native nor is it handled by any widely-used plug-in. Consequently, its best use may be for saving on local disk storage with image delivery in other formats, as is done by JSTOR.

DjVu: Any2DjVu is a great service for experimenting with the DjVu format. It's free, available from any Web browser and extremely flexible in the range of inputs it handles. Unfortunately, the results from it are merely suggestive of what DjVu is capable of, since the product behind the service isn't available for purchase.

Cjb2 is available for free, but has obvious limitations. Its best use would be to produce lossless DjVu versions of single page, text-only scans. Beyond that, its lossy compression is unexceptional and its cleaning routine damages halftone images.

We would like to have done at least some testing of LizardTech's Document Express 4.0, but, as mentioned earlier, we were unable to obtain the evaluation cartridge necessary to unlock the encoder. Nevertheless, we would encourage those interested in DjVu to download their own copy and try it out.

JBIG2: PdfCompressor appears to be a very solid, conservatively built product. The use of a graphical front end gives it a leg up in ease of use, but also limits the amount of tinkering the user can do. On the other hand, it also limits the amount of damage the user can do by selecting inappropriate compression settings that might lead to unexpected loss, including mismatched symbols. PdfCompressor produces impressive results, particularly for lossless halftone compression, but its overall results for all content types were very good. One of PdfCompressor's drawbacks is that it is currently available only for the Windows platform.

Silx, though outputting files in the same PDF/JBIG2 class as PdfCompressor, is a very different application. Silx appears to have been optimized for aggressive compression of text documents. In that mode, it produced the highest compression ratios we observed. Silx has numerous user adjustable compression parameters, though many were not well-documented in the evaluation copy we downloaded. Some could be damaging to image quality in the hands of an inexperienced user. In its current state, we see Silx as primarily a product for experienced users where the need is for very high levels of compression of text-only documents. Its compression of halftones and line drawings was modest, even with its halftone filter on. One reason to keep an eye on Silx is the potential availability of an arithmetic encoder. Currently, Silx comes with a Huffman encoder but an arithmetic encoder that requires separate licensing is available. The arithmetic encoder can improve Silx' compression from 10-50% over its Huffman decoder's results. In some cases, that would move the Silx results to the head of the class. If and when the arithmetic encoder becomes a standard part of Silx, the product would merit another round of testing with a wide variety of content.

Recommendations

The original question that motivated this lengthy, three-part FAQ was whether any technology existed for bitonal files that might supercede TIFF G4 as the standard for bitonal scans of library and archive materials. In part I, we made it clear that TIFF G4 is the accepted standard for preservation master files of bitonal images. That doesn't seem likely to change right away, but it would be imprudent not to carefully consider possible alternatives.

Both TIFF and G4 are older imaging standards. Newer technologies are likely to chip away at their existing market. For color and grayscale images, JPEG2000 is gaining acceptance, and unlike its predecessor, JPEG, it supports both lossy and lossless compression well. The draft specification for v1.5 of Adobe's PDF says it will incorporate a decoder for JPEG2000, a step that, assuming it comes to pass, will undoubtedly speed the acceptance of JPEG2000 for Web use. Even if it is not initially embraced by the library and archive community, other current users of TIFF will undoubtedly migrate, weakening the TIFF market.

A similar scenario may well play out for bitonal images. JBIG2 is being embraced much more quickly and enthusiastically than JBIG1. It is already supported in Adobe Acrobat Reader and several applications exist to embed JBIG2 datastreams in PDFs. Many current users of TIFF G4 will undoubtedly be happy with the "visually lossless" lossy compression offered by JBIG2, especially given its considerably better compression. Again, the TIFF market will lose support.

Then consider the nature of the TIFF format. Unlike JPEG2000 and JBIG2, it is not a recognized international standard. Adobe owns the rights to the TIFF specification, and has not announced any plans for updating it. TIFF is a large specification, and though adopted as a de facto standard by many, much TIFF software implements only a portion of the specification. We encountered this problem when we found that c42pdf would not read our TIFF files, even though they are fully compliant with the specification.

Then there is the TIFF header. Long seen as an advanced feature (compared to file formats offering no structured metadata capability), it is now starting to seem rather quaint. Only a few TIFF header tags are truly standard, while large numbers of custom ones that have been registered over the years are not widely supported. Newer metadata standards for image formats such as JPEG2000 and PDF/A are based on XML and will offer far greater flexibility and Web integration.

The steps that are often taken to deal with TIFF's poor usability as a distribution format also merits consideration. Typically the format is changed to GIF, the image is heavily scaled and then gray enhanced for legibility. The resulting image is often larger than the master file from which it was derived, despite being of poorer quality. The modern formats tested here obviate the need for such tradeoffs. They offer file viewers that do automatic scaling and gray enhancement at lower resolution along with more efficient bitonal compression. This allows a single image to provide legibility at low resolution without compromising detail at higher resolution, at little cost in disk storage or transmission delay. Servers are relieved of the burden of on-the-fly conversion, since the client machine handles all the processing details.

Nevertheless, from a preservation perspective, there are a number of perfectly valid reasons to be very cautious about the new bitonal formats. CPC is a proprietary, closed specification. DjVu is also proprietary, though it has been open enough to produce an open source encoder, but that encoder's performance lags well behind the commercial one. As we've discussed in part II of this FAQ, the future of DjVu is uncertain.

JBIG2 is an international standard that is gaining wide support. There will probably be one or more open source encoders available eventually. However, it must be acknowledged that JBIG2 is a big step up in sophistication from G4. Even for files maintained in lossless mode, successful migration of JBIG2 files to another format will require people with an in-depth understanding of its inner workings.

Another concern surrounds the issue of lossy compression. Every one of these new compression schemes claims that its lossy mode is visually or perceptually lossless. That in itself is not necessarily the main issue. All scanning, and especially bitonal scanning, is inherently a lossy process. Scanning inevitably introduces noise and artifacts. Bitonal scanning relies on thresholding, whereby each pixel must ultimately be rendered in either black or white. Even though most bitonal scanning is done by first scanning in gray scale, allowing for more intelligent thresholding, the process still has an arbitrary component. Most printing, and especially that in nineteenth century monographs, is not truly black and white, and capturing it bitonally involves some degree of loss.

More research is needed to understand the potential impact of lossy bitonal compression on scans of library and archive holdings. For example, at what point, for what quality of typography, does "loose" JB2 or JBIG2 character matching run the risk of substituting an incorrect character? How big a risk is generational loss from lossy bitonal compression? For what kinds of materials might these risks be minimal or acceptable? (A good discussion of the pros and cons of lossless vs. lossy compression appeared in RLG DigiNews in February 1999.)

These more modern compression schemes may be well-suited for certain kinds of digital library holdings, particularly those with less artifactual value. Use of cleaning and despeckling algorithms may be considered anathema by some, but in some cases may actually produce images that are closer to the original. The ability to send a high-resolution image rather than a fuzzy, scaled down, gray-enhanced image could enhance scholarship in some disciplines. Bundling of single page TIFFs improves utility and lessens user frustration.

Thus, our primary recommendation is to not dismiss these alternatives out of hand, but to carefully consider the pros and cons, examine our data, and conduct your own tests on your own images. Given that TIFF G4 will likely not be supported forever, something has to replace it. Of the currently available alternatives, JBIG2 in PDF looks to have the best chance of superseding TIFF G4. However, regardless of whether any of these becomes the new standard for bitonal imaging, it makes sense for libraries and archives to be aware of the developments, understand the implications, and proactively respond to them, rather than waiting for market forces to lead the way.

—Richard Entlich

Calendar of Events

Sixth International Digitisation Summer School for Cultural Heritage Professionals
July 6-11
University of Glasgow, United Kingdom

Topics include digitization of image and text material, digitizing audiovisual material, preservation, metadata, and digital asset management systems.

Digital Preservation Management: Short-Term Solutions to Long-Term Problems
August 3-8, 2003
Cornell University Library, Ithaca, NY

Three places are still available for Cornell University Library's new digital preservation training program on digital preservation management. This limited enrollment workshop partially funded by the National Endowment for the Humanities has a registration fee of $750 per participant. Registration is now open for the August workshop. A second workshop is scheduled for October 13-17 (registration will open this summer). There will be three workshops in 2004.

Advanced XML: Data Transformation with XSLT
August 3-15
Charlottesville, Virginia

Taught by seasoned XML/XSLT developers from the Brown University and University of Virginia libraries, this three-day workshop will explore XSLT with focus on the role of XSLT in digital library projects. The workshop will be a mix of lecture and hands-on demonstration and experimentation culminating in the creation of an XSLT-based library application.

ECDL 2003, 7th European Digital Libraries Conference
August 17-22
Trondheim, Norway

The conference will include sessions on usability evaluation of digital libraries, how to build a geospatial digital library, Web technologies, subject gateways, topical crawling, and digital preservation.

Eighth International Summer School on the Digital Library: Libraries, Electronic Resources, and Electronic Publishing
August 24–27
Tilburg University, the Netherlands

The course aims to support university and research libraries in a transitional phase and to identify new roles and opportunities for them. Topics include e-publishing, intellectual property, and open archives.

DRH 2003: Digital Resources for the Humanities
August 31-September 3
University of Gloucestershire, United Kingdom

A forum for all those involved in, and affected by, the digitization of cultural heritage materials.

ERPANET Seminar: Metadata in Digital Preservation, Getting What You Want, Knowing What You Have, and Keeping What You Need
September 3–5
Marburg, Germany

The seminar will discuss various perspectives on metadata to facilitate preservation, issues of interoperability, the role of standards and schemas, cost, and other state-of-the-art developments.

IS&T Archiving Conference Meeting Announcement and Call for Papers
April 20-23, 2004
San Antonio, Texas

The conference will focus on techniques for preserving, cataloging, indexing and retrieving images and documents in both digital and human readable formats. Goals are to benchmark systems that might be in place to preserve digital and print information for the future, as well as to identify areas where further research is necessary.

Announcements

Directory of Open Access Journals
The aim of the Directory of Open Access Journals is to increase the visibility and ease of use of open access scientific and scholarly journals. Its goal is to include all such journals that use a quality control system to guarantee the content, regardless of language or subject matter.

OCLC launches Digitization & Preservation Online Resource Center
OCLC has built a collection of resources relating to digital preservation initiatives, containing links to resources on copyright, digitization, grants assistance, and preservation issues. Also included is a section highlighting digitization projects at other institutions.

Library of Congress records available for harvesting
The Library of Congress has recently created a page of collections for which records are available for harvesting through the Open Archives Initiative Protocol for Metadata Harvesting. Item-level metadata is available for selected collections presented through the American Memory collection and through the Prints & Photographs Division online catalog.

Open eBook Forum: Consumer Survey on Electronic Books released
This survey attempts to determine consumer preferences for electronic and paper books and to measure attitudes toward eBooks by people who read “paper” books. To access the report, you are required to fill out a short form.

CLIR and Library of Congress release National Digital Preservation
Initiatives Report
Digital preservation initiatives in four countries and related multinational initiatives are highlighted in this new report, which is intended to help inform the development of a national strategy for digital preservation in the U.S.

Sun Education and Research posts new resources
Several new PDF whitepapers have been posted on the Sun Web site, including The Digital Library Toolkit, Version Three, The Digital Campus Primer, The E-Learning Architectural Framework, Digital Library Technology Trends, and Information Technology Advances in Libraries.

Harvard-Smithsonian Digital Video Library
The Harvard-Smithsonian Center for Astrophysics, in partnership with the American Association for the Advancement of Science, will establish a library of 350 hours of digital video materials supporting science, technology, engineering, and math education. The video clips will include demonstrations of phenomena, case studies of instruction or research, and interviews with experts.

Publishing Information

RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews,

." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski.

All links in this issue were confirmed accurate as of June 13, 2003.


		RLG DigiNews		BROWSE ISSUES		SEARCH		RLG