|
|
|
|
|
|
|
|
|
|||
June 15, 2003, Volume 7, Number 3 |
ISSN
1093-5371 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Feature
Article 2 Feature
Article 3 Highlighted
Web Site FAQ
Editor's Interview National Digital Information Infrastructure and Preservation Program Laura Campbell Editors’
Note
It’s wonderful that Congress has authorized and financially supported the next phase of NDIIPP, but how will the funds be committed? What percentage of funding will be spent on research, planning, implementation, evaluation, and other core areas?
Like
Russian Dolls: Nesting Standards for Digital Preservation This article introduces three standards for digital preservation, at least two of which feature prominently in the appendix of the plan Congress just approved.[1] Understanding what these standards are, what they can and cannot do, provides a solid foothold in present and future discussions surrounding long-term retention of digital materials, as well as a leg up on implementation. As the title suggests, the three standards nest like Russian dolls—one provides the larger framework within which the following, more granular, standard may be implemented.
Although all this probably sounds confusing in bulleted shorthand, it actually makes a lot of sense when properly laid out. This article walks through the standards one by one and elaborates on their functionality and interaction. As it works its way through the standards from the most general to the very specific, it will also home in on digital images as the files to be preserved. The expansive OAIS applies to any type of media, even nondigital materials, whereas METS applies exclusively to the digital realm of images, audio, and video. The NISO Data Dictionary focuses on technical metadata for digital still images. From a business perspective, digital preservation is a mechanism to ensure return on investment. Enormous amounts of money have been and are being spent on reformatting original materials or creating digital resources natively. If the cultural heritage community can not sustain access to those resources or preserve them, the investment will not bear the envisioned returns. Although a basic understanding of the general problems surrounding preservation in an ever-changing technical environment has started to permeate memory institutions, practical solutions to the challenge are slow to emerge. The three standards, OAIS, METS, and Z39.87, converge as a sustainable system architecture for digital image preservation. The space data community represents another group with enormous stakes in the long-term viability of its data. Capturing digital imagery of art or manuscripts may seem expensive, but the cost pales in comparison to that of gathering digital imagery from outer space. Under those circumstances, losing access to data is not an option. To foster a framework for preserving data gathered in space, the Consultative Committee for Space Data Systems (CCSDS) began work on an international standard in 1990. A good ten years later the OAIS was approved by the International Organization for Standardization (ISO). The fledgling standard met with great interest from the library community. Among its first implementers were the CURL Exemplars in Digital Archives (CEDARS) project and the Networked European Deposit Library (NEDLIB); implicitly, the National Library of Australia (NLA) has also adopted the model.[3] The California Digital Library recently received an Institute of Museum and Library Services (IMLS) grant to take first steps toward a University of California-wide preservation repository implementing the OAIS. In the standard’s own words, “[a]n OAIS is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community.”[4] The standard formulates a framework for understanding and applying concepts in long-term preservation of digital information. It provides a common language for talking about these concepts and the organizational principles surrounding an archive. Though the OAIS pertains to both the digital and the analog realm, it has received the most attention for its applicability to digital data. As a reference model, the OAIS in and of itself does not specify an implementation—it does not tell you which computers to buy, which software to load, or which storage medium to use. The standard does tell you, however, how an archive should be organized. In its so-called functional model, it defines the entities (or departments, if you will) in an archive, their responsibilities, and interactions. The data flows between those entities and the outside world are specified in the information model, which delineates how information gets into the archive, how it lives in the archive, and how it gets served to the public. The OAIS leaves it up to every distinct community to flesh out an implementation of the high-level guidelines. For the cultural heritage community a number of OAIS-related documents exploring the framework’s application to libraries, museums, and archives have come out of the joint OCLC-RLG Preservation Metadata Working Group.[5]
As figure 1 illustrates, the OAIS stipulates that an archive (everything within the square box) interacts with a producer as well as a consumer. It takes in data from the producer through its ingest entity, and it serves out data to the consumer through its access entity. Within the archive itself, the data content submitted for preservation gets stored and maintained in the archival storage unit; data management maintains the descriptive metadata identifying the archive’s holdings. The OAIS dubs the data flowing between the different players information packages, or IPs. The data flows sketched out in figure 1 contain the following information packages:
The data represented by the information packages may vary according to the specific needs at each station: an archival information package, for example, probably contains more data aimed at managing the object than its more light-weight counterpart on the access side, the dissemination information package. Furthermore, the OAIS details several categories of information comprising a complete information package, but in keeping with its role as a reference model, it stops short of suggesting specific data elements or a specific encoding for the entire bundle of information.[6] Any community interested in implementing the OAIS has to identify or create a file-exchange format to function as an information package. For the cultural heritage community, METS shows great potential for filling that slot. METS wraps digital surrogates with descriptive and administrative metadata into one XML document. Digital surrogates in this context could be digital image files as well as digital audio or video. At the heart of each METS object sits the structural map, which becomes a table of contents for public access. The hierarchy of the structural map allows the navigation of media files embedded in, or referenced by, the METS object. It enables browsing through the individual pages of an artist’s book as well as jumping to specific segments in a time-based program, for example, a particular section of a video clip. These so-called digital objects encoded in METS have three main applications that conveniently align with their potential as OAIS information packages.
Fig. 2. A METS object represented in the context of RLG Cultural Materials—a Chinese album from the Chinese Paintings Collection, contributed by the UC Berkeley Art Museum and Pacific Film Archive. The METS XML schema divides the standard into a core component and several extension components. The METS core supports navigation and browsing of a digital object. It consists of a header, content files, and a structural map. The METS extension components support discovery and management of the digital object. They consist of descriptive metadata and administrative metadata, which in turn split into technical, source, digital provenance, and rights metadata.
Figure 3 details the components of a METS object and one possible set of relationships among them.
The METS designers leveraged the combined power of the W3C specifications for XML schema and Namespaces in XML to create a flexible standard.
In this way, each community can plug in its own preferred descriptive elements as long as they have been formalized into a schema.[9] The visual resources community, for example, may choose to extend METS using the VRA Core, while libraries might be more inclined to stick with Metadata Object Description Schema (MODS) from the Library of Congress. Others may decide the Dublin Core (DC) satisfies their access needs. The flexibility achieved through namespaces gives METS the potential for implementation across a wide range of communities. The same logic applies to all components of administrative metadata. Each community has the opportunity to specify what data it deems most important for the management of its information, formalize those requirements into an XML schema, and use that schema as an extension to the hub-standard METS. For an example of a project that has identified or created a comprehensive suite of METS extensions, consult the Library of Congress AV Prototyping project. An alternative to embedding metadata for the extension components through XML Namespaces and external schemas consists in simply referencing the data from within the object. Descriptive or administrative metadata may live outside the XML markup in a database, to which the METS object can point. Even down to the level of media files, METS provides the dual option of referencing or embedding. The METS specification makes provisions for wrapping the actual bit stream of a digital file in the XML. In most cases, however, files live at online locations pointed to from within the object. In the realm of technical metadata, a fledgling NISO standard takes center stage for describing the different parameters of digital image files. As the NISO Data Dictionary—Technical Metadata for Digital Still Images, or Z39.87, the standard specifies a list of metadata elements. The Library of Congress, motivated by its AV Prototyping project, created an XML schema encoding for Z39.87, called NISO Metadata for Images in XML Standard (MIX). The XML schema constitutes the smallest Russian doll in our series of nesting standards, as it may be plugged into the METS framework as an extension schema for technical metadata. The standard also proposes fields for the source and digital provenance sections of METS. The NISO effort draws heavily on the Tagged Image File Format specifications, better know by their acronym TIFF. As the name implies, this format uses tags to define the characteristics of a digital file.[10] Image creation applications write the necessary parameters to the tags within the TIFF file, which means that the majority of the data Z39.87 covers already exists in file headers. To complete the metadata cycle, harvester utilities have to extract the information from the image file headers and import it into digital-asset-management systems for long-term preservation. By using the image file format specification as an integral part of the Data Dictionary, the standard leverages existing metadata to achieve cost savings. On the other hand, in going beyond the TIFF specifications for some elements, the NISO standard acknowledges information outside the TIFF scope that plays an important role in digital preservation. From this vantage point, the Data Dictionary becomes an important tool for educating vendors about the metadata our community sees as invaluable to preserve our investment. RLG is investigating the formation of a group advocating among digital camera-back vendors for the cultural heritage community’s metadata needs.[11] An industry standard for consumer digital cameras called DIG35 already has broad support among vendors. DIG35 allows transfer of information from the camera to the software utility that consumers use to manage their holiday snapshots. Building on that model, NISO Z39.87 in its XML instantiation MIX could become the file-exchange format to go between high-end scanners or camera backs and sophisticated asset-management databases. The Data Dictionary divides the technical metadata elements into four groups.
For any institution just starting out on the path of digital preservation, managing technical metadata through the NISO Data Dictionary is a great first step. The term data dictionary itself comes from the database community; it refers to a file defining the basic organization of a database down to its individual fields and field types. NISO Z39.87 represents a blueprint for a database or a database module that can be implemented fairly quickly—all the intellectual legwork has already been done by the standards committee. For expanding the database to include structural metadata relating files to each other, plus a descriptive record, as well as rights metadata, the database could be augmented by looking at METS and its extension schemas. Again, the Library of Congress AV Prototyping project offers a model implementation of a database using the METS approach. Scaling up to the bigger picture, this database could find its home in an archival environment specified by the OAIS.
To summarize: as illustrated by figure 4, the OAIS stipulates information packages, which find instantiation in METS; METS stipulates an extension schema for technical metadata, which finds an instantiation in Z39.87’s XML schema, MIX. Now, after the detailed review, the first bulleted list in this article should make a lot more sense. In
broad strokes, digital preservation with the nesting standards OAIS, METS,
and Z39.87 looks like a puzzle with all the pieces neatly falling into
place. In the details, however, some harmonization issues between the
standards remain. For example, the OAIS model breaks an information package
into different subcomponents than the
METS schema; the NISO Data Dictionary and its XML encoding MIX cover not
only the technical metadata extension of METS, but also some elements
that the digital object standard relegates to sections on source and digital
provenance. Nevertheless, the convergence of three standards developed
independently illustrates that a holistic view of digital preservation
is emerging. Only widespread implementation will tell whether the theory
as outlined by the standards can hold up in practice. Footnotes [2] National Information Standards Organization.(back)
[10] For the full TIFF tag library, see Appendix A of the format specifications.(back) [11]For more information about this fledgling initiative, please contact the author. Saving Digital Heritage—A UNESCO Campaign Colin
Webb
So begins an important new document being prepared for submission to the General Conference of UNESCO, the United Nations Educational, Scientific and Cultural Organisation. The Draft Charter on the Preservation of the Digital Heritage was positively received by a recent session of the UNESCO Executive Board, which asked for further consultations during preparation of a final draft for consideration. The Draft Charter is one very visible element in an international campaign to address the barriers to digital continuity and to head off the emergence of a second “digital divide,” in which the tools of digital preservation are restricted to the heritage of a well-resourced few. As well as the Charter, other elements of UNESCO’s strategy for promoting digital preservation include widespread consultations, the development of practical and technical guidelines, and a range of pilot projects. UNESCO has been critical in fostering the understanding and preservation of other kinds of heritage through avenues such as the World Heritage Convention and the Memory of the World program. Given the organisation’s commitment to the safeguarding of recorded knowledge evident in its Information for All program, it is not surprising that UNESCO has been concerned at the prospect of the loss of vast amounts of digital information. Digital technology’s immense potential for human benefit in so many areas—communication, expression, knowledge sharing, education, community building, accountability, to name just a few—is a tantalizing promise so easily denied by the lack of means, knowledge, or will to deal with its other great potential: rapid loss of access. The impetus for this campaign was embedded in a resolution passed by the UNESCO General Conference at its previous meeting in 2000. That resolution, drafted in part by the Council of Directors of National Libraries (CDNL), highlighted the need to safeguard endangered digital memory. Following that, as a basis for developing a UNESCO strategy, the European Commission on Preservation and Access (EPCA) was commissioned to prepare a discussion paper outlining the issues in digital preservation for debate. Consultation Process As well as circulating for comment the draft papers produced in the campaign to governments and nongovernment organisations and experts all over the world, the campaign has featured a number of regional consultation meetings convened specifically to raise issues of regional concern and to provide comment on the Preliminary Draft Charter and Draft Guidelines on the Preservation of Digital Heritage. The meetings were held between November 2002 and March 2003, in Canberra, Australia (for Asia and the Pacific); in Managua, Nicaragua (for Latin America and the Caribbean); in Addis Ababa, Ethiopia (for Africa); in Riga, Latvia (for the Baltic states); and in Budapest, Hungary (for Eastern Europe). All the meetings confirmed the need for urgent action and the great distance to be traveled before preservation of digital heritage is a reality in most countries. In total, around 175 experts and stakeholders from eighty-six countries participated in the five meetings, representing libraries, records archives, museums, audiovisual archives, data archives, producers and publishers of digital content, lawyers, universities and academies, governments, standardization agencies, community development organisations, computer industries, and researchers, among others. Draft Charter on the Preservation of the Digital Heritage Charters and declarations promulgated by UNESCO are meant to be “normative” documents that member states agree to through a vote of acceptance rather than by individual ratification. They are not binding and do not require any specific action on the part of governments, but they do express aspirations and priorities. In this case the purpose of the Draft Charter is to focus worldwide attention on the issues at stake and to encourage responsible preservation action wherever it can be taken. The Draft Charter explains that the digital heritage
The purpose of preserving this heritage is to ensure that it can be accessed. The Draft Charter recognizes that this involves a tension and seeks a “fair balance between the legitimate rights of creators and other rights holders and the interests of the public to access digital heritage materials” in line with existing international agreements. It recognizes that some digital information is sensitive or of a personal nature and that some restrictions on access and on opportunities to tamper with information are necessary. Sensibly, it asserts the responsibility of each member state to work with “relevant organisations and institutions in encouraging a legal and practical environment which would maximise accessibility of the digital heritage.” Threats to this digital heritage are highlighted, including rapid obsolescence of the technologies for access, an absence of legislation that fosters preservation, and international uncertainties about resources, responsibilities, and methods. Urgent action is called for, ranging from awareness raising and advocacy to practical programs that address preservation threats throughout the digital life cycle. In discussing the measures that are needed, the Draft Charter emphasizes the importance of deciding what should be kept, taking account of the significance and enduring value of materials, and noting that the digital heritage of all regions, countries, and communities should be preserved and made accessible. It discusses the legislative and policy frameworks that will be needed and calls on member states to designate agencies with coordinating responsibility. It also calls on governments to provide adequate resources for the task. Many agencies have a role to play, both within and outside governments. Agencies are urged to work together to pursue the best possible results and to democratize access to digital preservation methods and tools. The Draft Charter proposes a UNESCO commitment to foster cooperation, build capacity, and establish standards and practices that will help. Although this document is meant to inspire rather than dictate action, its adoption by UNESCO will be an important opportunity to raise digital preservation issues with governments and others who can influence how laws, budgets, and expectations are framed to help or hinder continuity of the digital heritage.
While the Charter focuses on advocacy and public policy issues, the Guidelines present practical principles on which technical decisions can be based throughout the life cycle of a wide range of digital materials. The Guidelines, prepared by the National Library of Australia on commission from the UNESCO Division of Information Society, have been published on the UNESCO CI (Communication and Information) Web site. The guidelines address at least four kinds of readers with different but overlapping needs:
The structure of the guidelines is intended to make it easy for readers to find the information most relevant to their needs. The regional consultation process highlighted the fact that many people who feel they have a preservation responsibility are operating with very limited resources. Specific suggestions have been included to provide some starting points, although comprehensive, reliable digital preservation is a resource-intensive business. Material in the Guidelines is organized around two approaches: basic concepts behind digital preservation (explaining concepts of digital heritage, digital preservation, preservation programs, responsibility, management, and cooperation) and more- detailed discussion of processes and decisions involved in various stages of the digital life cycle, including deciding what to keep, working with producers, taking control and documenting digital objects, managing rights, protecting data, and maintaining accessibility. Although the guidelines were directly produced by the National Library of Australia, they were very extensively informed by input from reading and comments from a wide range of contacts, in addition to responsive comments from the formal consultation meetings. The text does not reflect any new research, but does try to reflect current thinking about the maintenance of accessibility, the core issue in digital preservation (although certainly not the only important issue). For some readers the level of technical detail will be disappointing. The detail required to meet all the needs of practitioners is very situation-specific and quickly dated. As the Guidelines are intended to be useful in a very wide range of sectors and circumstances, the emphasis is on technical and practical principles that should enable practical decisions. It is to be hoped that UNESCO will complement the Guidelines with a Web site offering a growing body of technical details and tips aimed at specific sectors. To give readers a sense of the approaches taken, a few of the principles asserted in the Guidelines, are appended to this paper. The UNESCO Guidelines for the Preservation of Digital Heritage will be published in a number of languages. At the time of writing, they are available in English from the UNESCO Web site.
Highlighted Web Site
FAQ Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part III. Your editor's interview in the December 2002 RLG DigiNews states that JPEG 2000 can save space and replace the multitude of file formats used for conversion and display of cultural heritage images but that it isn't suitable for bitonal material. We have lots of bitonal images. Is there anything similar available for them? Part
I of this three-part FAQ discussed general considerations for migration
of scanned bitonal images away from TIFF G4, while Part
II examined the characteristics of several alternative bitonal file
formats and compression schemes that have become available during the
past decade. In this, the final installment, we present the results of
our experiences with several products for converting individual and multipage
bitonal high-resolution TIFF G4s. Our coverage includes product specifications,
general impressions, compression data, and sample images. Please note
that some of the files require special plug-ins to be viewed. Instructions
for downloading the necessary viewers are given below. Test Image Selection Though a bitonal image may seem like a simple affair, how well a particular image compresses depends on how it was scanned, the nature of its content, and the design of the compression scheme. Characteristics of the source image that can affect the rate of compression include:
Why do these factors affect compression? It helps to understand a little about how image compression is accomplished. Lossless compression depends on the recognition of patterns and the replacement of repeated elements with compact representations that exactly describe the feature being compressed. For example, instead of storing every bit in a scan line of all white bits, simply store a count of the white bits. Thus, sparse printing that leaves a lot of white space compresses well, while dense printing or highly speckled pages result in more transitions between black and white and thus less efficient compression. The more sophisticated compression algorithms tested here take advantage of the fact that higher level elements are repeated within printed documents, including the symbols that make up the text. Thus, if a 12-point, Times Roman, non-bold, non-italic, non-underlined, lowercase 'a' appears in a document, its bitmap can be stored in a database and a subsequent appearance of the identical character can be replaced by a pointer to the database. This explains why clean, uniform typography compresses better than irregular, highly variant typography. Longer documents have an advantage because the algorithm "learns" more and more of the characters as it processes the text. Halftones deserve a special mention. Bitonal halftoning is a printing process that simulates shades of gray by varying the size and spacing of black dots. Avoiding problems such as moiré (interference patterns) and poor contrast when scanning halftones bitonally requires the use of special processing algorithms (e.g. dithering or error diffusion). When done properly, the typical scanned halftone will be densely packed with data of a somewhat random nature, presenting a real challenge to compression algorithms. Lossy but "visually lossless" compression attempts to remove elements that are redundant for human visual perception, producing an image that contains less information, but doesn't appear degraded. We selected four images for in-depth testing, representing a variety of content type. We also tested 20-page sequences derived from the same works in order to average out anomalies, and give the compression algorithms a chance to show off their "learning curves." The images are from three of Cornell's older collections: historic math books, NEH agriculture, and historic monographs. All images are bitonal 600 dpi TIFF G4s. If you follow the links for the individual pages from Table 1, you'll be taken to the image as it appears within the Cornell Digital Library—converted from TIFF to GIF, scaled down by a factor of six and enhanced with gray for improved legibility. The links for the 20-page groupings will bring up all 20 pages in GIF thumbnail mode, from which larger GIFs of the individual pages can then be accessed. Table 1. Details of Test Images
Conversion software selection Our software testing was limited to products that are open source or for which free evaluation copies are available. In some areas of computing, that might greatly constrain the selections, but in the specialized market niche of bitonal image conversion, it hardly cramped our style at all. We were able to test most of the important packages without spending a dime on software acquisition, which bodes well for anyone who wants to test these products on their own image collections. As indicated
in part II, we focused testing on products supporting three main technologies: We fully tested CPC Tool from Cartesian Products, the only product available to encode this proprietary format. DjVu: a file format supporting several compression schemes for bitonal, gray level, and color images. It supports both lossy and lossless bitonal compression. The bitonal compression algorithm is called JB2 and is similar to JBIG2. We fully tested Any2DjVu , a Web service that allows files of many different formats (including TIFF G4) to be uploaded and converted to the DjVu format. We also tested cjb2, a bitonal DjVu encoder that is part of the DjVuLibre package, an open source implementation of DjVu. Cjb2 only converts single pages. Although DjVuLibre comes with a utility (called djvm) that combines single DjVu pages into multipage DjVu files, it does not support font learning across pages. Thus we tested cjb2 only for the encoding of single pages. There is also a commercial DjVu encoder, made by the format's owner, LizardTech, Inc., which we did not test. Currently available as part of LizardTech's Document Express 4.0, it is available in a trial version from LizardTech's Web site. The trial became available fairly late in our testing cycle and requires a special page cartridge (allowing the encoding of 250 pages) which we requested, but still had not received ten days later. JBIG2: a lossless and lossy compression scheme for bitonal images only. JBIG2 does not specify a file format, but is often associated with PDF. We fully tested two JBIG2 in PDF encoders, PdfCompressor from CVision Technologies and SILX from PARC (Palo Alto Research Center). Another option for JBIG2 that we did not test is Adobe's Acrobat Capture with Compression PDF Agent. Tables 2 and 3 provide additional details on the products tested. Table 2. Product information (general)
How we tested For each tool, we converted the four individual test pages from TIFF G4 to the supported target format. In the case of cjb2, the open source bitonal DjVu encoder, we first had to convert the TIFF G4s to pbm (portable bitmap) format, which we did with the free Windows application Irfanview. We also converted 20-page groupings derived from the same works as the individual test pages, except for cjb2, which only handles single pages. Other than PdfCompressor, which runs only under Windows (CVision says the product will eventually support Solaris), all the tested products can be run under Windows, Linux or Unix. We ran the Windows version of cjb2, and the Solaris versions of CPC Tool and Silx. However, results should be the same regardless of the platform on which the conversions are carried out. Each product offers options that affect the speed of conversion, display speed, display quality, etc. We attempted to test the major compression options of each product. We always tested lossless mode (if available), in addition to two or three lossy modes, sometimes in combination. As a rule, we turned off features that would result in faster compression or faster display at the cost of lower compression. This allowed each product to show off the maximum compression of which it is capable. What we didn't test As already mentioned, we did not test every product on the market capable of converting scanned TIFF G4 images to other bitonal formats. We limited our testing to three output formats, and only a subset of the products in those markets. Of the products we did profile, our evaluation permits comparison of 1) compression efficiency in various modes (though see caveats, below), 2) quality of the compressed image (by visual inspection of the output files) and 3) general ease of use. You the reader, if you choose to examine and compare the test images, may also be able to evaluate the speed of decompression and how readily the viewers can navigate and manipulate the files, as long as you are viewing all the files on the same computer. We did not evaluate compression speed, decompression speed, or other performance issues. Some of the products support OCR (optical character recognition). We did no evaluation of OCR capability and left OCR functions turned off for all conversions. Caveats As detailed as these tests are, they cannot be a substitute for individual testing on your own documents. We only tested TIFF files from the Cornell Digital Library, and only a limited range of content type consisting of printed material from the 19th and early 20th centuries. The results might not apply to other kinds of content, such as earlier (and more varied or more broken) typography or other illustration types suitable for bitonal scanning, such as woodcuts. Also, there are many different ways to build a valid TIFF file, and not all variants are recognized by all conversion software. There are
limits to the degree that the compression numbers in the table below can
be compared. Lossless modes should be pretty much directly comparable,
but not lossy modes. For example, even if two products use the same compression
scheme, what one calls "loose" and another calls "aggressive"
may or may not be similar in the degree of compression or the degree of
"loss." This is not only because the terms used are subjective
descriptions of highly technical underpinnings, but because different
implementations of the same compression technique may do certain things
more or less well than others. Therefore, it is essential to examine the
images and evaluate their quality, in addition to looking at the compression
numbers. From what we've seen, it is not possible to make any predictions
about the performance of a product based on the file format or compression
scheme supported. Each has strengths and weaknesses and must be evaluated
separately. Test Results for Individual Pages Benchmarks: Table 4 begins with some size data about the test files and the effectiveness of G4 compression on each one. The first line of data shows the size of each file in MB as an uncompressed TIFF and as a percentage relative to G4 compression. On these files, G4 achieves reductions running from about 2.6 times (258%) for the halftone all the way up to 31.5 times (3147%) for the variable text example. The next line shows the size of the files in KB after G4 compression, assigning these values a nominal 100% against which to measure the effectiveness of the tested products. The text of the active links in the table are percentages, rounded to the nearest whole number, relative to the size of the TIFF G4 files. Thus, a value of 50 indicates compression to 50% the size of TIFF G4. Values above 100 indicate files that were larger than the original TIFF G4 after conversion. The "GIF 200 dpi" files are those delivered by the Cornell Digital Library when "View as100%" is chosen as the display format. The "GIF 100 dpi" files are those delivered by the Cornell Digital Library when "View as 50%" is chosen as the display format. For information about the test files linked to from this table, see the sidebar. Table 4. Test results for individual files Results Interpretation for Individual Files (Table 4) Lossless: The first of the lossless benchmarks, with a file format and compression scheme of "PDF/G4," shows what can be expected by moving a G4 datastream into a PDF envelope. It is not so much a conversion as a mild transformation that leaves the G4 image data intact, but changes the file format to one that is accessible to a much larger Web audience. As can be seen, there is little change in file size. The main advantage of such a conversion is improved access. The other lossless tests demonstrate the great variation amongst the different products and the different content types. The two DjVu products excelled in lossless compression of the two text-only pages, achieving nearly twice the compression of G4. PdfCompressor had the best results with the halftone image, with three times the compression of G4. None of the products did very well with the line drawing, with the two DjVu products and PdfCompressor all managing only to reach about three quarters the size of the G4, and the others doing even worse. The performance of Silx in lossless mode deserves attention. According to its manufacturer, although Silx includes a lossless mode, it was not designed for optimal performance in that mode. As with the other products in these tests, Silx is geared to provide its best compression in lossy modes that are intended to be visually or perceptually lossless, rather than bit-for-bit lossless. Note that CPC lacks a true lossless mode, and so is not included in this set of comparisons. Lossy (modest): We included in the "modest" category of lossy both default lossy modes and those that concentrate on minor lossy procedures, such as cleaning (removal of tiny bit protrusions) and despeckling (removal of very small, extraneous dots). This should be considered a loose grouping, since we can't really know how one product's default mode compares with another's. This is especially true with the Any2DjVu service where the software performing the conversions resides on a Web server and is hidden from the user. The software package that runs on the Any2DjVu service appears to be called DjVuDigital and is not, to the best of our knowledge, available as a commercial product. Understanding those limitations, Any2DjVu did the best compression on the variable text page, closely followed by PdfCompressor and CPC. PdfCompressor achieved the highest compression in this class on the uniform text example, with about five times the compression of G4 when its clean and despeckle filters are on. Any2DjVu in its lossy normal mode was almost as good. The open source cjb2 converter, which had topnotch results on the two text pages in lossless mode, was worst in class in clean mode. CPC was best in this class with the halftone image, narrowly besting PdfCompressor and cjb2. At least that's the story the compression figures tell. In examining the images, however, it seems that the "clean" routine in cjb2 is indiscriminate and unaware of halftones. This is a good example of a lossy algorithm producing output that doesn't qualify as perceptually lossless. The image is extremely contrasty and lacks the subtle tonal impressions of the original halftone. That's why it's essential to examine the images and not just look at the compression numbers. (The images produced by CPC and PdfCompressor look fine, by the way). The subtle impact of some compression enhancement filters can only be appreciated with close inspection of the images. Comparing PdfCompressor in lossy default as opposed to lossy clean and despeckle modes, the greatest difference in compression is seen for the line drawing (71% of G4 without filters vs. 66% with). Examining the line drawing in both modes at high magnification, it is clear that there isn't much speckling to be removed, but the abundant horizontal lines have many fewer protrusions in the filtered version. The mushroom image is heavily speckled. This is most visible in the margin area. The despeckle filter didn't have a great impact on compression, because most of the image is a halftone where despeckling isn't desirable (see above). However, a close comparison of the filtered and unfiltered mushroom images will show that the despeckle filter was quite effective. As can be seen in Table 5, across twenty pages, the despeckle filter can have a significant impact on compression. Despite some improvement over lossless modes, none of the tested products was able to achieve high levels of compression on the line drawing in lossy mode. Lossy (aggressive): This class includes modes that go beyond minor cleanup and that loosen up the font matching algorithms. With sufficient slack in character matching, it is possible for an incorrect character to be substituted. Though we didn't note any cases where that occurred, we didn't search exhaustively either. As seen in the table, not all the products offer aggressive lossy modes. Any2DjVu's compression in this mode was only marginally better than its lossy normal mode. Cjb2 improved more noticeably, but still wasn't as good as other products in their modestly lossy modes. Silx improved markedly in this mode. It went from next to last to first for the variable text page, and from next to last to second for the uniform text page, suggesting that its real forte is text compression. Not surprisingly, since aggressive modes in JB2 and JBIG2 are primarily aimed at improving compression of text, this mode did not offer much by way of improved compression of either the halftone or line drawing. Halftone: Both JBIG2 products (PDFCompessor and Silx) have special halftone modes designed to provide even better compression when compressing halftones. Silx improved considerably over its other modes, but still could not match CPC or PdfCompressor in their modest lossy modes. On the other hand, PdfCompressor, which already had one of the best results on the halftone, really shone with its halftone filter on, producing a file five times smaller than the TIFF G4. If the image is examined closely, especially in comparison with a lossless version, some loss in image quality is evident, but it is surprisingly small given the considerable improvement in compression relative to G4. Overall: The best combined total for the four images in lossless mode was achieved by PdfCompressor, thanks to its excellent lossless compression of the halftone image. Without use of special filters, the best combined lossy result was produced by CPC at 34% of TIFF G4. However if PdfCompressor's special lossy halftone is included, then its overall lossy number drops to 25%. For the two text pages, the best combined lossless compression was achieved by the two DjVu encoders, while the best lossy compression came from Silx in its aggressive mode.
Benchmarks: Table 5 (below) begins with some size data about the test files and effectives of G4 compression on each one. The first line of data shows the combined size of each set of 20 files in MB as uncompressed TIFFs and as a percentage relative to G4 compression. On these groups of files, G4 achieved reductions running from about 5.4 times (539%) for the halftone dominated set, all the way up to 30 times (2993%) for the variable text set. The next line shows the size of the files in MB after G4 compression, assigning these values a nominal 100% against which to measure the effectiveness of the tested products. The other data in the table are all percentages, rounded to the nearest whole number, relative to the size of the TIFF G4 files. Thus, a value of 50 indicates compression to 50% the size of TIFF G4. Values above 100 indicate files that were larger than the original TIFF G4 after conversion. Each value represents the 20-page sequence from the work described in the column heading. Click on the title/author link in the column heading to see thumbnails of all 20 pages. Table 5. Test results for multipage files Results Interpretation for Multipage Files (Table 5) Testing multiple pages is important for several reasons. It forces some averaging on the results (few works consist entirely of halftones or line drawings, for example). It also minimizes any exaggeration of the file wrapper relative to the image contents that might occur with a single page. Finally, as mentioned previously, it allows the compression schemes that learn fonts to work on a larger set of characters and take full advantage of that feature. Lossless: The first of the lossless benchmarks, with a file format and compression scheme of "PDF/G4" shows what can be expected by moving a G4 datastream into a PDF envelope. Confirming what was seen for the individual pages, these groupings show that for most kinds of content, moving TIFF G4s into PDF envelopes results in little change in file size. However, given that multipage TIFFs are even less well-supported than individual ones, those wanting to distribute aggregations of individual TIFF G4s via the Web without any substantive change to the image data may want to consider this route. PdfCompressor supports Web optimization of PDFs, which allows a multipage document to be viewed as soon as the first page is loaded, and also permits selective downloading of individual pages within multipage bundles. There are other products on the market that specialize in converting TIFF G4s to PDFs including Aquaforest's Tiff Junction (Aquaforest makes other TIFF conversion utilities that may be of interest to RLG DigiNews readers who use Microsoft's IIS as their Web server) and the open source utility c42pdf. As acknowledged by the keepers of c42pdf, "only part of TIFF 6.0 specification is used." We were unable to test it on our images because c42pdf couldn't handle our particular TIFF G4s. Lossless conversions to other compression schemes held few surprises. With a few exceptions, the multipage sets compressed slightly better than the individual pages. The only major improvement from single page to multipage set occurred with "The Steam Turbine." In that case, the individual page consisted almost entirely of a hard-to-compress line drawing, while the 20-page set was only about 35% line drawings and 50% text. Therefore, the 20-page set compressed quite a bit better, with PdfCompressor turning in the best numbers. Lossy (modest): In lossy mode, the impact of font learning on compression became quite apparent. The products using JB2 and JBIG2 compression all showed gains in the multipage samples of the text works. Topping the list was Silx, which overall compressed the multipage samples about twice as well as the single page, and had the best overall modestly lossy text results. Any2DjVu also showed significant improvement though not as pronounced as Silx, while PdfCompressor had even more modest improvements, though its results were still quite respectable, at about a quarter the size of the TIFF G4s. CPC's results were mixed, with the variable text pages improving and the uniform text pages compressing less well. The page sets dominated by halftones and line drawings showed some improvement as well, but this can be largely attributed to the fact that the groups contained considerably more text than the single pages from the same works. Lossy (aggressive): Aggressive mode compression of the multipage sets produced the best numbers we saw. For the two text sets, Silx compressed by about eight and a half times better than TIFF G4. Any2DjVu also lowered its numbers, but not as much. However, the two more graphical sets (each only about 50% text) showed little change from modestly to aggressively lossy mode. In contrast to its very effective compression of text-only pages, Silx managed only very modest improvement relative to TIFF G4 for the more graphical works. Halftone: The two graphical works both contain some halftones, so we tested both with the specialized halftone modes offered by PdfCompressor and Silx. With its halftone filter on, PdfCompressor turned in the best compression results for the multipage sets containing graphical content. We had problems using Silx for multipage files containing multiple halftones created with the halftone filter on. When viewed in Adobe Acrobat Reader (v.5 or 6), some of the halftones would not appear, and Reader sometimes crashed. We also had a problem with multipage, multi-halftone files in PdfCompressor v.2.0 (the documents would appear blank), but it was resolved with the release of v.2.1. Overall: Again, PdfCompressor using JBIG2 had the best combined compression (over the 80 pages of our four 20-page sets) in lossless mode, with Any2DjVu only slightly behind. In modestly lossy mode with no special filters, PdfCompressor, CPC and Any2DjVu all achieved combined results in the low-mid 30% range. The best overall lossy results came from PdfCompressor with its halftone filter on, at 24%, more than four times smaller than TIFF G4. For the two text-only sets, the best combined lossless compression was achieved by Any2DjVu, while the best lossy compression came from Silx in its aggressive mode. Product Summaries (in alphabetical order by compression technology) CPC: CPC isn't flashy. It wasn't a particular standout on any type of image content, yet even without special filters, it emerged at or near the top when the totals were added. This suggests it might be a good choice for mixed collections. Though it lacks a true lossless mode, we saw no specific quality problems. However, the appropriateness of using a lossy format would depend on whether the images are for preservation or access, and the nature of their content. The main drawbacks to CPC have nothing to do with its compression performance or output quality. CPC is a proprietary format with only a single source for encoders and decoders. It is not Web native nor is it handled by any widely-used plug-in. Consequently, its best use may be for saving on local disk storage with image delivery in other formats, as is done by JSTOR. DjVu: Any2DjVu is a great service for experimenting with the DjVu format. It's free, available from any Web browser and extremely flexible in the range of inputs it handles. Unfortunately, the results from it are merely suggestive of what DjVu is capable of, since the product behind the service isn't available for purchase. Cjb2 is available for free, but has obvious limitations. Its best use would be to produce lossless DjVu versions of single page, text-only scans. Beyond that, its lossy compression is unexceptional and its cleaning routine damages halftone images. We would like to have done at least some testing of LizardTech's Document Express 4.0, but, as mentioned earlier, we were unable to obtain the evaluation cartridge necessary to unlock the encoder. Nevertheless, we would encourage those interested in DjVu to download their own copy and try it out. JBIG2: PdfCompressor appears to be a very solid, conservatively built product. The use of a graphical front end gives it a leg up in ease of use, but also limits the amount of tinkering the user can do. On the other hand, it also limits the amount of damage the user can do by selecting inappropriate compression settings that might lead to unexpected loss, including mismatched symbols. PdfCompressor produces impressive results, particularly for lossless halftone compression, but its overall results for all content types were very good. One of PdfCompressor's drawbacks is that it is currently available only for the Windows platform. Silx, though outputting files in the same PDF/JBIG2 class as PdfCompressor, is a very different application. Silx appears to have been optimized for aggressive compression of text documents. In that mode, it produced the highest compression ratios we observed. Silx has numerous user adjustable compression parameters, though many were not well-documented in the evaluation copy we downloaded. Some could be damaging to image quality in the hands of an inexperienced user. In its current state, we see Silx as primarily a product for experienced users where the need is for very high levels of compression of text-only documents. Its compression of halftones and line drawings was modest, even with its halftone filter on. One reason to keep an eye on Silx is the potential availability of an arithmetic encoder. Currently, Silx comes with a Huffman encoder but an arithmetic encoder that requires separate licensing is available. The arithmetic encoder can improve Silx' compression from 10-50% over its Huffman decoder's results. In some cases, that would move the Silx results to the head of the class. If and when the arithmetic encoder becomes a standard part of Silx, the product would merit another round of testing with a wide variety of content. Recommendations The original question that motivated this lengthy, three-part FAQ was whether any technology existed for bitonal files that might supercede TIFF G4 as the standard for bitonal scans of library and archive materials. In part I, we made it clear that TIFF G4 is the accepted standard for preservation master files of bitonal images. That doesn't seem likely to change right away, but it would be imprudent not to carefully consider possible alternatives. Both TIFF and G4 are older imaging standards. Newer technologies are likely to chip away at their existing market. For color and grayscale images, JPEG2000 is gaining acceptance, and unlike its predecessor, JPEG, it supports both lossy and lossless compression well. The draft specification for v1.5 of Adobe's PDF says it will incorporate a decoder for JPEG2000, a step that, assuming it comes to pass, will undoubtedly speed the acceptance of JPEG2000 for Web use. Even if it is not initially embraced by the library and archive community, other current users of TIFF will undoubtedly migrate, weakening the TIFF market. A similar scenario may well play out for bitonal images. JBIG2 is being embraced much more quickly and enthusiastically than JBIG1. It is already supported in Adobe Acrobat Reader and several applications exist to embed JBIG2 datastreams in PDFs. Many current users of TIFF G4 will undoubtedly be happy with the "visually lossless" lossy compression offered by JBIG2, especially given its considerably better compression. Again, the TIFF market will lose support. Then consider the nature of the TIFF format. Unlike JPEG2000 and JBIG2, it is not a recognized international standard. Adobe owns the rights to the TIFF specification, and has not announced any plans for updating it. TIFF is a large specification, and though adopted as a de facto standard by many, much TIFF software implements only a portion of the specification. We encountered this problem when we found that c42pdf would not read our TIFF files, even though they are fully compliant with the specification. Then there is the TIFF header. Long seen as an advanced feature (compared to file formats offering no structured metadata capability), it is now starting to seem rather quaint. Only a few TIFF header tags are truly standard, while large numbers of custom ones that have been registered over the years are not widely supported. Newer metadata standards for image formats such as JPEG2000 and PDF/A are based on XML and will offer far greater flexibility and Web integration. The steps that are often taken to deal with TIFF's poor usability as a distribution format also merits consideration. Typically the format is changed to GIF, the image is heavily scaled and then gray enhanced for legibility. The resulting image is often larger than the master file from which it was derived, despite being of poorer quality. The modern formats tested here obviate the need for such tradeoffs. They offer file viewers that do automatic scaling and gray enhancement at lower resolution along with more efficient bitonal compression. This allows a single image to provide legibility at low resolution without compromising detail at higher resolution, at little cost in disk storage or transmission delay. Servers are relieved of the burden of on-the-fly conversion, since the client machine handles all the processing details. Nevertheless, from a preservation perspective, there are a number of perfectly valid reasons to be very cautious about the new bitonal formats. CPC is a proprietary, closed specification. DjVu is also proprietary, though it has been open enough to produce an open source encoder, but that encoder's performance lags well behind the commercial one. As we've discussed in part II of this FAQ, the future of DjVu is uncertain. JBIG2 is an international standard that is gaining wide support. There will probably be one or more open source encoders available eventually. However, it must be acknowledged that JBIG2 is a big step up in sophistication from G4. Even for files maintained in lossless mode, successful migration of JBIG2 files to another format will require people with an in-depth understanding of its inner workings. Another concern surrounds the issue of lossy compression. Every one of these new compression schemes claims that its lossy mode is visually or perceptually lossless. That in itself is not necessarily the main issue. All scanning, and especially bitonal scanning, is inherently a lossy process. Scanning inevitably introduces noise and artifacts. Bitonal scanning relies on thresholding, whereby each pixel must ultimately be rendered in either black or white. Even though most bitonal scanning is done by first scanning in gray scale, allowing for more intelligent thresholding, the process still has an arbitrary component. Most printing, and especially that in nineteenth century monographs, is not truly black and white, and capturing it bitonally involves some degree of loss. More research is needed to understand the potential impact of lossy bitonal compression on scans of library and archive holdings. For example, at what point, for what quality of typography, does "loose" JB2 or JBIG2 character matching run the risk of substituting an incorrect character? How big a risk is generational loss from lossy bitonal compression? For what kinds of materials might these risks be minimal or acceptable? (A good discussion of the pros and cons of lossless vs. lossy compression appeared in RLG DigiNews in February 1999.) These more modern compression schemes may be well-suited for certain kinds of digital library holdings, particularly those with less artifactual value. Use of cleaning and despeckling algorithms may be considered anathema by some, but in some cases may actually produce images that are closer to the original. The ability to send a high-resolution image rather than a fuzzy, scaled down, gray-enhanced image could enhance scholarship in some disciplines. Bundling of single page TIFFs improves utility and lessens user frustration. Thus, our primary recommendation is to not dismiss these alternatives out of hand, but to carefully consider the pros and cons, examine our data, and conduct your own tests on your own images. Given that TIFF G4 will likely not be supported forever, something has to replace it. Of the currently available alternatives, JBIG2 in PDF looks to have the best chance of superseding TIFF G4. However, regardless of whether any of these becomes the new standard for bitonal imaging, it makes sense for libraries and archives to be aware of the developments, understand the implications, and proactively respond to them, rather than waiting for market forces to lead the way. —Richard
Entlich Calendar of Events Sixth
International Digitisation Summer School for Cultural Heritage Professionals
Topics include
digitization of image and text material, digitizing audiovisual material,
preservation, metadata, and digital asset management systems. Digital
Preservation Management: Short-Term Solutions to Long-Term Problems Three places
are still available for Cornell University Library's new digital preservation
training program on digital preservation management. This limited enrollment
workshop partially funded by the National Endowment for the Humanities
has a registration fee of $750 per participant. Registration is now open
for the August workshop. A second workshop is scheduled for October 13-17
(registration will open this summer). There will be three workshops in
2004. Advanced
XML: Data Transformation with XSLT Taught by
seasoned XML/XSLT developers from the Brown University and University
of Virginia libraries, this three-day workshop will explore XSLT with
focus on the role of XSLT in digital library projects. The workshop will
be a mix of lecture and hands-on demonstration and experimentation culminating
in the creation of an XSLT-based library application.
ECDL 2003, 7th European Digital Libraries Conference
The conference
will include sessions on usability evaluation of digital libraries, how
to build a geospatial digital library, Web technologies, subject gateways,
topical crawling, and digital preservation. Eighth
International Summer School on the Digital Library: Libraries, Electronic
Resources, and Electronic Publishing The course aims to support university and research libraries in a transitional phase and to identify new roles and opportunities for them. Topics include e-publishing, intellectual property, and open archives. DRH
2003: Digital Resources for the Humanities A forum for all those involved in, and affected by, the digitization of cultural heritage materials. ERPANET
Seminar: Metadata in Digital Preservation, Getting What You Want, Knowing
What You Have, and Keeping What You Need The seminar will discuss various perspectives on metadata to facilitate preservation, issues of interoperability, the role of standards and schemas, cost, and other state-of-the-art developments. IS&T
Archiving Conference Meeting Announcement and Call for Papers The conference will focus on techniques for preserving, cataloging, indexing and retrieving images and documents in both digital and human readable formats. Goals are to benchmark systems that might be in place to preserve digital and print information for the future, as well as to identify areas where further research is necessary.
Announcements Directory
of Open Access Journals
Library
of Congress records available for harvesting Open
eBook Forum: Consumer Survey on Electronic Books released CLIR
and Library of Congress release National Digital Preservation Sun
Education and Research posts new resources Publishing Information RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org. Materials in RLG
DigiNews are subject to copyright and other proprietary rights. Permission is
hereby given to use material found here for research purposes or private study.
When citing RLG DigiNews, include the article title and author referenced plus
"RLG DigiNews, Please send comments and questions about this or other issues to the RLG DigiNews editors. Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski. All links in this issue were confirmed accurate as of June 13, 2003. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|