RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
   
  April 15, 2003, Volume 7, Number 2
ISSN 1093-5371


Table of Contents

Feature Article 1
Digitizing, Archiving, and Preserving Japanese Cultural Heritage, by Hisayoshi Harada

Feature Article 2
The Paradigma Project, by Carol van Nuys

Feature Article 3
CAMiLEON: Emulation and BBC Domesday, by Phil Mellor

Highlighted Web Site
MetaMap

FAQ
Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part II, by Richard Entlich

Calendar of Events

Announcements

RLG News



print this article

Digitizing, Archiving, and Preserving Japanese Cultural Heritage

Hisayoshi Harada
National Diet Library

Activities and Initiatives at the National Diet Library

One of the primary duties of the National Diet Library (NDL) of Japan is to collect and preserve Japanese publications as the nation's cultural and intellectual assets. For this purpose NDL depends greatly on the legal deposit system for its collection of materials. Moreover, the NDL collects, through purchase and donation, books published before the legal deposit system came into existence, as well as older materials and foreign reference and academic publications. In addition to those traditional activities, the NDL needs to take care of new materials that have been increasingly created and disseminated in digital form. "The National Diet Library Electronic Library Concept," promulgated in fiscal year 1998, defines the digital library as "the provision by a library of primary information (actual materials) and secondary information (information about the materials) electronically, via communications networks, together with the infrastructure for this purpose." Since this concept was established, NDL has prepared to create its own digital library. As primary information, the library is already providing the Full-Text Database System for the Minutes of the Diet in cooperation with the House of Representatives and the House of Councillors, as well as the Rare Books Image Database and the online exhibitions called NDL Gallery created by digitizing our collections. As secondary information, bibliographic data for Japanese and Western books has been provided via the NDL-OPAC. In the autumn of 2002 we offered several new services to the public.

Digital Library from the Meiji Era

Some 140 years ago, East met West. As a result of the encounter, quite a number of cultural assets were produced that had a great impact on building modern Japanese society. NDL has approximately 102,000 titles and 169,000 volumes of books published in the Meiji era (1868-1912), the period of the westernization of Japan. Since these books are fragile, we converted them to microfiche for public use starting in 1993. Access to those materials was limited to people who were able to come to the library to use the microfiche.

In recent years NDL has been harnessing information and communication technologies to offer its digital library as a new service. One of the pillars of this service is to digitize the NDL collections and provide public access to them. As of October 2002, we have supplied digital images of our Meiji collections whose copyrights have expired under the title Digital Library from the Meiji Era.

The contents of the collection range from philosophy, history, and social sciences to art and literature. So far we have reached a greater audience than expected, people who had been very interested in seeing the materials, but who had never been able to come in person. We have also enjoyed good responses from people abroad, who say that this access will contribute to Japanese studies on a large scale. There have been around 760,000 hits in the four months since the system was implemented.

The texts and illustrations of the books are put into a digital image format, in both GIF and our own high-compression format (LINDRA), for convenient use. Using a plug-in customized for this system as an NDL viewer, users can freely navigate through the images, change the size from 25% to 300%, and print on paper at exactly the right size. In addition, the system offers efficient, detailed searches with features like searchable tables of contents and bibliographic records, as well as a function to bookmark texts.

As of now, around 20,000 titles and 30,000 volumes are available for access via the Internet. The files come to about 350 GB in size. We are planning to add another 10,000 titles and 15,000 volumes in the coming months. By the end of fiscal year 2004 most of our Meiji collections will be available to the public through the Internet.

One of the most-difficult challenges in building this database system is clearing copyright. Although the system is able to manage the copyrights page by page, we have been able to identify only about one-third of the copyright holders for 169,000 volumes. Thus we have begun to ask the public through our Web site to get in touch with copyright holders we have not yet discovered. If we cannot find, and get permission from, copyright holders, in the end we will need to apply for permission to the Director-General of the Agency for Cultural Affairs to clear the copyrights of those books. We will also need some fine-tuning based on feedback to keep the system up-to-date and easy to use.

WARP/Dnavi

As Japan's only depository library, NDL has been collecting publications in Japan, including maps, phonographic discs, and microfilms, with the help of the legal deposit system mandated by the National Diet Library Law. CD-ROMs and other "packaged" electronic publications became subject to the legal deposit system in the autumn of 2000. As for digital information on telecommunications networks, in March 2002 the Librarian of the National Diet Library asked the Legal Deposit System Council, an advisory panel of outside experts, to consider whether "networked digital publications" could be put into the legal deposit system, and, if not, what kind of legal framework would make it possible for the NDL to collect online information.

Until the Legal Deposit System Council comes to a conclusion, the NDL will implement experimental projects for acquiring and storing online information by contract, as well as for the navigation of databases on the Internet. These projects have been planned as a part of the NDL’s Digital Library Project.

One of the projects is WARP—(Web Archiving Project). Since much of the information on the Web is regularly updated and deleted on a daily basis worldwide, the NDL is collecting and preserving information from the Web sites of various organizations that have agreed to participate in the project. WARP will also allow us to collect and preserve digital editions of periodicals and born-digital periodicals on the Internet. The results of this project will be submitted to the Legal Deposit System Council for reference as it considers a possible legal framework that would allow the collection of domestic networked information. We have already collected over 460 titles of online periodicals and a dozen Web sites. Although we are now taking a selective approach, we are looking for ways to collect in bulk and are investigating a couple of projects overseas.

The second project is Dnavi, the NDL Database Navigation Service. Until now, NDL has offered its wealth of library resources through a number of research and reference services. It is now crucial that we make use of the digital information resources on the Internet. While we are exploring the best systems and technologies for Web archiving with WARP, the databases still cannot be archived because they are in the so-called Deep Web.

The wealth of databases on the Internet provides indispensable information resources for academic research and other forms of study and surveys. For these databases Dnavi creates such records as title, creator, category, and content. Users can access the NDL Web site and be linked to them. Dnavi, which just started in November 2002, is a portal that has recorded a large amount of information from Web sites in Japan and that helps users to navigate a variety of databases. It already contains more than 5,000 databases.

Long-Term Preservation

Although the importance of preserving digital information has been recognized in intellectual communities worldwide, and so many projects and studies have been aimed at preservation in recent years, we must admit that few in Japan recognize that digital preservation is crucial for future generations. Thus few projects have been implemented especially for born-digital materials.

As already mentioned, we have been focusing on digitizing the printed materials in our collections to provide access to them as one of our services to the public via the Internet, not for long-term preservation. This point of view seems to be the same for other organizations, institutions, and businesses in Japan. We know that digitizing rare books or images is an important part of the preservation of our heritage but also recognize that it is not enough for this day and age.

Given this situation, NDL has begun research and study for long-term preservation of digital information to make the public aware of its importance. We are going to establish a group to discuss issues in this field and improve our skills, technologies, and collaborations in conjunction with the communities concerned.

Fiscal year 2002 is the first year of a three-year term for research and study on the preservation of digital information in the NDL. The main purpose of this project is to set up comprehensive guidelines to fix our long-term strategy.

The guidelines should include the following policies:

  • What kind of digital information NDL should preserve
  • What kind of processes and technologies should be applied to different kinds of digital information
  • What kind of media and environment should be chosen for preservation
  • A set of rules for collaborating with the creators of digital information

By setting up our own guidelines, we will be able to handle increasing amounts of digital information both in physical media and networked information under an established policy. In addition, announcing our guidelines will help to increase awareness of the importance of preserving digital information in our society.

We plan to apply the following timeline:

Fiscal year 2002

  • Compile a report based on research and study of the projects, guidelines, policies, and other related achievements of the countries active in this field, including hard facts about preservation activities in Japan
  • Publicize the results of research and studies

Fiscal year 2003

  • Research what kinds of digital materials have been collected in NDL
  • Wrap up a draft version of the guidelines based on research and studies conducted in fiscal year 2002
  • Set up a test environment for experiments on preserving the digital materials in physical media archived in NDL

Fiscal year 2004

  • Establish the guidelines
  • Develop an action plan for the following years according to the guidelines
  • Identify ways of organizing a consortium in Japan
  • Conduct experiments

All the projects we have mentioned have just started. As the saying goes, "This is just the prelude."

print this article

The Paradigma Project

Carol van Nuys
National Library of Norway

Growth of Web Archiving in Europe

Digital documents of all kinds are disappearing daily, and with them the opportunity for new generations of readers to study and enjoy today's documents in the future. The preservation of our digital cultural heritage is an increasingly important and challenging issue. In response to the situation, about fifteen European countries have started some type of Web archiving activity.[1]

Different countries have chosen different collection strategies: Denmark and Australia have taken the selective approach; Sweden, Iceland, and Finland have harvested their entire national Web spaces; and the National Library of the Netherlands has made an agreement with the Dutch Publishers' Association (NUV)[2] for the deposit of electronic publications offline and online. Only five of the countries that are involved in Web archiving can base their work on legal deposit legislation, and Norway is one of them.[3]

Background on Legal Deposit in Norway

Legal deposit has a long tradition in Norway. The first Legal Deposit Act for Denmark/Norway was passed in 1697, and censorship undoubtedly played an important role in its establishment. The law remained in force until the Union with Denmark was dissolved in 1814. A royal decree on legal deposit was passed in 1815, followed by a new Legal Deposit Act in 1882. This again was succeeded by the Legal Deposit Act of 9 June 1939. The common denominator of all these acts was that they included printed material only. However, as new media developed, the need to pass a new and extended law became more and more evident. The present Legal Deposit Act was thus passed on 9 June 1989, and, of course, the main intent of this law was no longer censorship, but cultural preservation.

The National Library of Norway's current Web archiving work is strongly influenced by the Norwegian Legal Deposit Act. The purpose of this act is to

ensure that documents containing generally available information are deposited in national collections, so that these records of Norwegian cultural and social life may be preserved and made available as source material for purposes of research and documentation. (§ 1)

Considered extremely modern when it was passed in 1989, the act covers all generally available Norwegian documents stored in any medium, including paper, microforms, photographs, combined documents, sound recordings, films, video, electronic publications, and broadcast programs. It also covers documents published abroad for Norwegian publishers and those specially adapted for a Norwegian public.

The act does not cover documents found in closed networks, computer software, documents accessible only through a company or organization's intranet, net communications (i.e., e-mail or closed discussion and chat groups of a private nature), archival material covered by other legislation, or official governmental publications.

Chapter 9 of the act's regulations (§ 30, second subsection) states:

Electronic documents that are available by means of online transmission on a telecommunications, television, or data communications network or the like shall be deposited in two copies at the specific request of the depository in each individual case.

We can easily see that the act and its regulations were written before the World Wide Web arrived. Filling the request for two copies of each generally available Norwegian Web document is simply impossible. Today, the National Library is investigating the most-effective ways to fulfill the intent of the act as applied to digital documents and is considering the possibility of using a combination of different collection approaches.

Overview of the Paradigma Project

The Paradigma Project[4] began in August 2001. Its goals are to develop and establish routines for the selection, collection, description, identification, and storage of all types of digital documents and to give users access to these publications in compliance with the Legal Deposit Act. The project is scheduled to end on December 31, 2004.

Paradigma's activities fall within the bibliographic, technical, and legal areas, as reflected in its eight work packages:

Selection Criteria for Digital Documents Discover the nature of the "digital document universe"
Develop a typology and selection criteria
Suggest harvesting frequencies and procedures
Legal Framework Survey the current legal framework for collecting, archiving, and providing access
Negotiate aggreements with publishers for deposit of dynamic documents (e.g., databases)
Look at existing legislation in light of today's technology
Suggest how to give access to deposited digital documents in the near future
Harvesting Tools Choose, refine, and test software for collecting Web pages
Access Tools Use the Nordic Web Archive's (NWA) Tool Kit to assess the National Library's Web Archive
Adapt the NWA Access Module to fit the project's needs
Develop and test several user interfaces
Unique Identification and Description of Digital Documents Conduct a survey of standards for unique identifiers and descriptive metadata
Recommend standards for the library
Promotion of Recommended Standards for Identification and Description Develop a Web service based on the standards recommended in work package 5, to be promoted by the library to help publishers and other user groups provide metadata and unique identifiers prior to deposit
Test Beds Complement the Legal Deposit Division's ongoing, selective harvesting activity test and adapt software, methods, etc., until the entire Web archiving process functions in a satisfactory manner
Organizational and Economic Consequences Report the organizational and financial consequences of the project to the National Librarian

 

Currently the project continues the National Library's earlier work in several of these areas. At present, activities from several of the work packages are under way or completed. The following sections highlight the work connected to the legal deposit of Web materials.

Aspects of the Collection Strategy

Selection Criteria

Based on recommendations from the Paradigma Project, and with the Ministry of Culture and Church Affairs' approval, the National Library has decided to start the general harvesting of all generally available digital documents from the Norwegian Web space (".no"). In time, documents found on domains such as .com, .org, and .net will also be harvested.

There are several reasons for taking this general harvesting approach. First, we cannot predict which documents will be of value in future research and documentation. Second, digital storage is becoming cheaper every day. Third, unfiltered harvesting saves us from resource-consuming manual selection at harvesting time. Finally, a Web Archive user can find documents via free-text search functions, thus being able to review all documents, including those that do not qualify for manual cataloging. Selection criteria for any use, such as further bibliographic description, can be challenged and changed at any time. This would, of course, be impossible if the material were excluded at harvesting time.

Total harvesting of the Norwegian Web space does not exclude the library's use of other collection strategies as well. The Legal Deposit Division carries out event-based collecting. It has collected, for example, the Web sites belonging to political parties prior to, during, and after elections. This type of capture activity will continue to supplement future routine harvesting rounds. A selection of Web documents is currently harvested semi-manually using the HTTrack software, and these are cataloged for the National Library's catalog (BIBSYS). This activity will continue until the Paradigma Project's general harvesting activity and related procedures are fully established.

In many cases other methods must be used to collect digital documents. The Legal Deposit Division has already contacted Norwegian publishers about the deposit of e-books, and the library's Sound and Image Archive is working with the Norwegian Broadcasting Corporation on solutions for the deposit of "born-digital" radio and television programs. However, a large amount of administrative, legal, and technical work remains, and the deposit of dynamic publications (e.g., Web newspapers and electronic materials of all types that are stored in databases) is especially challenging. The Paradigma Project will address these problems as the project continues.

Bibliographic Description

Today the National Library of Norway registers different types of material in various ways. Ephemeral material is given an abbreviated cataloging treatment, while books and serials are given a full bibliographic description, both in the library's catalog and in the National Bibliography.

The Paradigma Project estimates that less than 1% of the material collected from the Norwegian Web space may be subject to individual manual treatment or registration at some level. After surveying selection criteria used in other countries and in the National Library's own divisions, the project suggested selection criteria and harvesting frequencies for new types of electronic publications that are based on content (genre). We also suggested a typology based on Shepherd and Watters's[5] work and have used three main types of digital documents: traditional, i.e., similar to printed documents (monographs, periodicals, reference works, etc.); transient, i.e., based on traditional forms but extended with new functionality (net newspapers, Internet novels, etc.); and new, i.e., previously nonexistant, such as blogs and Web portals.

Automatic Processing and Analysis

We are currently investigating the use of automatic analysis and extraction of information (metadata) from Web documents. Such analysis can be used to generate "weighted" hit lists, thus helping librarians to select documents for manual registration. The technology is not yet good enough to determine a document's type automatically, but it can help to reduce the number of documents that require human intervention. For documents that are not evaluated manually, properties of a document type that are automatically captured can be made available for structured searching in the Web Archive. The value of these properties will be limited but, in combination with other search criteria, may indeed prove useful.

Metadata and Unique Identification

The Paradigma Project is also surveying metadata standards for the description of digital documents and for the exchange of bibliographic data. These recommendations may form the basis for a service to publishers and other interested parties, allowing them to generate metadata descriptions for their digital documents before legal-deposit delivery.

The library must be able to handle a huge number of small data objects automatically, and it will need to identify each component (text file, picture file, sound file) in a single Web document. We are currently surveying standards for identification, and we will suggest how to improve the library's existing identifier allocation service. One enhancement would be the ability to handle chronological versions of a Web document.

Scope of the Norwegian Internet Domain

Size

The exact size of the Norwegian Internet domain is unknown at this time. The first harvesting round, in December 2002, resulted in some 3.1 million URLs, of which approximately 53% were images (.jpg, .gif, .png). The NEDLIB-harvester[6] started with about 1,000 initial URLs, and harvesting was limited to the HTTP protocol, to the Norwegian national domain (".no"), and to URLs without a search query attached.

Assuming a distribution similar to that found in Sweden and Finland, we expect to find 45% to 55% of the Norwegian Internet sites in domains outside .no. We expect future rounds to span roughly ten million URLs, especially when we include Norwegian sites in domains like .org, .net, and .com, as well as URLs with search queries.

Volume

The first harvesting round retrieved files requiring 140GB of space in the National Library's Long-Term Preservation Repository. File sizes will probably grow in the future. The space requirement estimates for the Norwegian Web space are based on an average of 100KB per URL. We expect the first complete harvesting round to be approximately 10 million URLs, thus filling around 1TB.

1 TByte represents roughly 1% of the total capacity of the Long-Term Preservation Repository. We expect that less than 10% of the storage capacity will be used by the Web Archive, even if both the number of objects and their average size grow drastically in the future.

Issues of Access Strategy

Providing access for users to the deposited collection of digital documents is a complex matter that is regulated by legislation and relies on technical mechanisms.

Legal Deposit Act

Section 1 of the Legal Deposit Act restricts access to source material for purposes of "research and documentation." These terms are not defined in the act itself, so the underlying intent of the act must be studied in a bill from 1988-89.[7] Loosely translated, that document says:

The purpose [of the act] is to ensure that documents containing available information are deposited, registered, and preserved, so that they may be used for purposes of study, research, documentation, and investigation today and in the future.

Using this document as a guide, the National Library interprets research to mean investigation or inquiry at a certain scholarly or scientific level and documentation to be investigation or study without the same status as research in the traditional meaning of the word, but based on a systematic use of source material.

The general public has never been defined as a user of the traditional legal-deposit materials, but because public libraries generally do not maintain collections of previously published digital documents in the same way they maintain collections of traditional material, the Paradigma Project has recommended that a larger user group be given access to the deposited digital collection in the future.

Copyright Act

The National Library strives to be the nation's foremost source of knowledge about Norway, Norwegians, and Norwegian conditions at home and abroad. We should consider whether Norwegian legislation permits a researcher on the other side of the globe to gain access to the Web Archive. Such access is technically possible, but it must be considered from a legal perspective. The Copyright Act[8] regulates a copyright owner's intellectual and economic rights. We are painfully aware of the "conflict" between the Norwegian Legal Deposit Act (saying that digital source materials must be made available for research and documentation) and the Copyright Act (strictly limiting user access to digital documents). This concern is especially relevant to digital documents available via networks.

The conflict is understandable, considering that many digital documents are associated with commercial interests. A single electronic item on the loose can quickly be distributed all over the globe, possibly resulting in economic loss for the copyright owner. Digital documents can easily be misused (copied, manipulated, etc.). For that reason the National Library can give access to the Web Archive only to users defined in the Legal Deposit Act and then only from a PC designated for such use on the library's premises.

Project Manager Carol van Nuys in the Long Term Preservation Repository (photo: Kjetil Iversen).

Norway is bound by several international copyright conventions. The recently passed Common Market Directive 2001/29/EF (22 May 2001) on the harmonization of copyright law is scheduled to be implemented legally in Norway this year. We are watching this process closely, as it can influence the way in which the National Library allows access to its Web Archive.

Personal Data Act

The purpose of the Personal Data Act[9] is to protect persons from violations of their right to privacy through the processing of personal data. The National Library must process the digital documents that have been collected from the Norwegian Web space. Because many of these documents may contain personal data, the library received permission from the Data Inspectorate before initiating the first harvesting round. We are now authorized to collect and store Web material in 2003, but before giving access to the collection, we must secure permanent permission to do so.

Nordic Web Archive (Access Module)

For user access to the Web Archive, the Paradigma Project selected the Access Module developed by the Nordic Web Archive Project.[10] (NWA). The five Nordic national libraries have now embarked on the next project, NWA-II, in which this software will be further developed. We plan to adapt the NWA Access Module to accommodate several special-user functions, including tailored interfaces for catalogers, program operators, and library patrons. This user interface will show a timeline enabling users to select different versions of the same document as captured on specific dates.

The NWA Access Module may play an important part in the collaboration between the Internet Archive and several national libraries in their combined efforts to develop software in the projected National Library Web Archive Consortium.

Our Digital Cultural Heritage

The Paradigma Project's work will be finished in two years. Hopefully, by then the National Library of Norway will have the technology, methods, and organization necessary to enforce the Legal Deposit Act—also for the many documents that are born digital.

Footnotes
[1] Halgrímsson, Torsteinn (February 28, 2003). Web Archiving in Europe [discussion]. NWA [online]. (back)
[2] National Library of the Netherlands.(PDF) (back)
[3] The other four countries are Denmark, Iceland, Lithuania, and Sweden. See Halgrímsson, Torsteinn. Web Archiving in Europe. (back)
[4] The Paradigma Project. (back)
[5]Shepherd, M., and C.R. Watters (1998). "The Evolution of Cybergenres." Proceedings of the Thirty-First Annual Hawaii International Conference on Systems Sciences, Big Island of Hawaii, 6-9 January 1998. v. 2, 97-109. (back)
[6] According to Halgrímssons survey, Norway is one of ten countries in Europe to use the NEDLIB-harvester software. (back)
[7] Ot.prp.nr. 52 (1988-89). (back)
[8] Act no. 2 of May 12, 1961, relating to Copyright in Literary, Scientific and Artistic Works, etc., with subsequent amendments up to June 30, 1995 (the Copyright Act). (back)
[9] Act of 14 April 2000, no. 31, relating to the Processing of Personal Data. (back)
[10]Nordic Web Archive (NWA).(back)



print this article

CAMiLEON: Emulation and BBC Domesday

Phil Mellor
University of Leeds

In December 2002 a group convened at the University of Leeds to demonstrate and discuss CAMiLEON's work in preserving the BBC Domesday project[1] (a social record of UK life in the 1980s[2]), which is now in danger of being lost through technological obsolescence. BBC Domesday was created to celebrate the 900th anniversary of the Domesday book of 1086, the original record of William the Conqueror's survey of England.[3]

The meeting brought together some of the original BBC Domesday videodisc developers, including Peter Armstrong from the BBC, Ecodisc's Roger Moore[4] (who brought an original glass master of one of the Domesday discs), two of the editors, and some individuals who had contributed to the content as schoolchildren; experts on digital preservation; and others interested in developing modern interfaces to the original Domesday data, as well as some nostalgic computing enthusiasts.

The Presentation

There were three speakers. Armstrong, as the chairman of BBC Domesday, had been heavily involved in its production. He presented several interesting anecdotes and background material about the highs and lows of the making of Domesday. Then Dr. Tom Graham, from the Consortium of University Research Libraries (CURL),[5] explained that digital preservation is our duty to future generations for both historical and technological reasons. Digital preservation needs to be understood by people at all levels, from data creators to end users, whether national, institutional, or individual. The final speaker, David Holdsworth, from CAMiLEON, has worked in Information Technology since the mid-sixties and is now an expert in digital preservation and storage. He described the choice of Domesday as a test case to demonstrate the many problems in digital preservation and how CAMiLEON has developed strategies to solve these.

There followed demonstrations of the BBC Domesday system running on the original hardware and also of CAMiLEON's modern emulation that provides an accurate reproduction of almost all the original functionality.

A Brief History

Armstrong, who had established the BBC's Interactive Unit to make educational multimedia, wondered if it would be possible to celebrate the 900th anniversary of the original Domesday book by producing a modern-day equivalent. It was an ambitious idea, but it captured the imagination. The plan was to give the first copy to Prince William, the "poetic successor to William the Conqueror."

Funding for the project—an estimated £2 million—was relatively easy to obtain. Multimedia was an exciting, upcoming technology, and people involved in education and national archiving, as well as computing, were keen to push it forward. The BBC put together a team of around sixty staff to develop the project and recruited pupils from over half the schools in the country to help produce the content. In all, around a million children were involved from 14,000 schools.

Resource Gathering

The map of the UK was divided into blocks, each measuring 4 x 3 km.—it is no coincidence that this is the ratio of a television screen—and each block was adopted by a school. As the UK consists of over 25,000 blocks, it was practical to cover only about half of these. It was difficult to find schools in the more-remote areas of Scotland and Wales, but the majority of England was well accounted for. Pupils investigated the land use; counted the number of doctors, post offices, and so on; and wrote articles about the people and buildings in their blocks. Each area was allotted twenty screens of BBC text and three photographs.

 

 

Back in the classroom, the school computers, with a large user base on BBC Micros, were used for data entry. The articles were sent on floppy discs to the BBC. The text was left unedited—any spelling mistakes or typing errors remained in the final print. The only alterations were prompted by the lawyers. They found that some descriptions of local characters "could cause us some problems."

 

 

Hardware and Software

Developing the hardware took two years. BBC's Interactive Unit approached Philips, the only manufacturer of videodisc players in Europe, to produce the laserdisc player. This was actually a SCSI device—the original SCSI specification had only just been confirmed—which meant a SCSI interface had to be developed for the BBC Master. The player looked like a large, slow hard disc to the computer. The BBC Master had a special read-only version of its Disc Filing System called VFS (Videodisc Filing System) and could be controlled using similar commands.[6]

The laser videodisc player produced PAL video, and the BBC Master also produced a PAL-like video signal. The player carried a genlock and video mixing board to combine the computer and disc pictures. Other hardware was developed for the BBC Master, including a coprocessor and a trackerball. The trackerball featured three buttons, although the BBC Domesday's graphical interface made use of only two.

Logica wrote the software using BCPL, a forerunner of C. In total, over 70,000 lines of custom code were written.

 

Into Production

"If we'd known the problems involved, we would never have attempted it," said one of the staff. The project was completed on time and on budget, thanks to the remarkable work of the team. Over 24,000 maps and 200,000 photos were processed. Remember that there were no digital copies to work from—the paper originals of the maps were quite literally "cut and pasted" together. Each map and photo was captured as a single frame of continuous videotape. These then had to be captioned and have their copyright cleared. As well, over 8,000 data sets (traffic congestion, radiation levels, etc.) were stored.

The size of the Domesday project was overwhelming. The budget of £2 million sounds like a lot, but the real cost must have been far, far more than that when the dedicated work of all the schoolchildren and volunteers is considered. It has been estimated that if you worked a forty-hour week viewing Domesday, it would take seven years to see all the information. One source calculated that it would have cost a quarter-million pounds for institutions to access that amount of data, which made the price tag of the Domesday system sound like a bargain.

When the plans for the project were announced, the estimated price was £1,100, but when Domesday came to market, it had increased to over £4,000. As this was too expensive for most libraries and schools, Domesday became a commercial flop. The first set of discs was presented to the keeper of records at the Public Record Office, to be placed alongside the original Domesday book.

Life went on. The BBC Interactive Unit developed a few other ideas but eventually folded when the director general decided there was no future in multimedia. Armstrong and a group of colleagues bought out the department and set up the MultiMedia Corporation. Reworkings of the Domesday ideas appeared in other forms: the 3D World Atlas (Domesday on a global scale) sold over a million copies; Oneworld.net [7] features Another Domesday, which focuses on global justice issues and is one of Kofi Annan's favorite Web sites. BBC Domesday became an icon, the granddaddy of interactive multimedia. And then it became obsolete.

How to Preserve a Time Machine

The CAMiLEON project (Creative Archiving at Michigan and Leeds Emulating the Old on the New) has spent three years developing strategies for digital preservation and testing them with materials such as the BBC Domesday system. The BBC Domesday project encapsulated many difficult problems encountered by those working in the field: a huge amount of multimedia data, technological complexities, and the intellectual property rights (IPR) issues.

There are several aspects to preserving BBC Domesday. First is the decay of the media—discs get scratched during use and become less reliable. The hardware to read the discs is rare, and the few remaining laserdisc players are prone to break down (and require very specialized repair). All the hardware is long past its shelf life. BBC computers have always been durable, but not many were produced with the special Domesday extras. The Domesday system also has a particular look and feel that requires preservation in addition to the actual content.

Rescuing the Resource

CAMiLEON obtained access to a semi-working Domesday system donated by the School of Geography at the University of Leeds. One of the first tasks of preservation was to transfer the data files from the twelve-inch laserdiscs to modern hardware, storing the bytestreams in a media-neutral form. A Linux PC could be connected to the laserdisc player using a SCSI cable, allowing the PC to read the text articles and database. Images, including still-frame video, were transferred to a PC using a standard video frame-grabber card at maximum resolution. These images were stored in an uncompressed format to avoid quality loss or the introduction of artifacts (as can occur with JPEG compression). In total, around 70GB of image data was transferred per side of each laserdisc.

The next step was to develop software that emulates the adapted BBC Master computer and the laserdisc player on which the original BBC Domesday system ran. An open-source emulator—BeebEm[8]—was used as the starting point for this software. Emulation of the specific Domesday system hardware had to be incorporated by CAMiLEON, which included the coprocessor, SCSI communication, and the many functions of the laserdisc player.

Preservation Strategy

CAMiLEON's philosophy is to preserve the data in its original, unmodified format (i.e., the original abstract bytestream, not in the same physical medium). Software can then be written to use this data: perhaps an emulation of the original system, perhaps a tool that reformats it into a modern format, or perhaps software that provides a new interface to the data. This view builds on the ideas of the CEDARS project.[9] For BBC Domesday CAMiLEON developed an emulation of the original system in which knowledge of how the original system worked is encapsulated. The emulation software, together with the abstracted data, provides a record of the original BBC Domesday system. A "black box" emulation of the laserdisc player was written to allow the emulated BBC Master to access the data recovered from the original laserdiscs.

To avoid the problem of emulation software becoming obsolete, it was important to ensure that the software was not chained to any specific operating system or machine architecture. Careful development with a clear focus on the goal of longevity will make it easier to run this software on a future (as yet unknown) computer, needing only a few simple (and well documented) modifications.[10] This also means that it should be possible to port the emulation software to any current machine.

Currently the CAMiLEON BBC Domesday emulator runs only on Windows because, owing to time constraints, a Windows-based emulator, BeebEm, was used as the starting point. Because this was not written to follow guidelines for software longevity, it is tied to the Windows platform. The CAMiLEON team is currently seeking a small amount of funding to complete the software-longevity work and prepare the emulator for archiving.

Distribution and Copyright

Sadly, it is unlikely that Domesday will become available to the general public unless the IPR problems can be solved. The contents of the discs are heavily tied up in copyright—parts are owned by the BBC, the Ordnance Survey, and possibly the Local Education Authorities and schools. However, it may be possible for owners of original BBC Domesday laserdiscs to gain access to the preserved data and to make the emulator software publicly available. This would allow access in library reading rooms and some schools, for example. CAMiLEON is interested in examining and solving Domesday's IPR issues. Andrew Charlesworth discusses the issues in detail in "Legal Issues Arising from the Work Aiming to Preserve Elements of the Interactive Multimedia Work Entitled 'The BBC Domesday Project'"[11].

The recent auction of a BBC Domesday system on eBay is evidence of the revived interest in the project. There are also a couple of people working to produce modern interfaces to the original Domesday data—they met each other for the first time at the CAMiLEON meeting.

CAMiLEON is keen to hear from any participants who worked on, or contributed information toward, BBC Domesday and would be interested in their views on the issue of making it available to the public. Please contact Paul Wheatley if you can help.

Footnotes
[1] CAMiLEON, "BBC Domesday." [back]
[2]Finney, Andy, "The Domesday Project—November 1986" (2003). [back]
[3] King William (the Conqueror) et al., "Greater Domesday" (1066). [back]
[4] "The EcoDisc—BBC Enterprises Introduce a New Disc for Their Advanced Interactive Video System" (1988). [back]
[5] CURL. [back]
[6]Coll, John, "BBC Microcomputer User Guide" (1982). British Broadcasting Corporation.
[back]
[7]OneWorld.
[back]
[8]"Emulators: BeebEm."
[back]
[9]Cedars, "Cedars Guide to the Distributed Digital Archiving Prototype" (2002).
[back]
[10]
Holdsworth, David, and Wheatley, Paul, "Emulation, Preservation and Abstraction" (2001). RLG DigiNews, v. 5, no. 4. [back]
[11]Charlesworth, Andrew, "Legal Issues Arising from the Work Aiming to Preserve Elements of the Interactive Multimedia Work Entitled 'The BBC Domesday Project'" (2002).
[back]


Highlighted Web Site

MetaMap

The MetaMap emerged as a reaction to the alphabet soup of acronyms related to World Wide Web metadata initiatives. It is an instructional interactive graphic that uses the metaphor of a subway map to convey a variety of standards, organizations, initiatives, file types, and issues related to metadata. The map is divided into subway lines, each line representing a different category of information, such as "dissemination," "organizations," "libraries," and "still images." Each line contains several stations that relate to that category, such as the "W3C" station on the "organizations" line. Stations can be expanded to reveal a basic definition, relevant links, and related topics. This site provides an entertaining and innovative way to track, explore, and compare metadata standards and developments. It is also an interesting way to present a complex and multidimensional set of information.

The MetaMap is sponsored by the Groupe départemental de recherche en information visuelle (Visual Information Research Group) at the École de bibliothéconomie et des sciences de l'information.

*Plug-in required. Currently, most browsers require you to download a plug-in to view the MetaMap, which was created using SVG (Scalable Vector Graphics). This is a simple process, but the plug-in works best on Internet Explorer. When you go to the MetaMap, you will be automatically prompted to download the plug-in. For more information or to find out more about the site without installing the plug-in, visit the site's FAQ.



print this faq

FAQ

Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part II.

Your editor's interview in the December 2002 RLG DigiNews states that JPEG 2000 can save space and replace the multitude of file formats used for conversion and display of cultural heritage images but that it isn't suitable for bitonal material. We have lots of bitonal images. Is there anything similar available for them?

In part I of this FAQ we examined the rationale for bitonal scanning going back to 1990 and reaffirmed its continuing relevance for digital capture of certain types of cultural heritage materials. We also considered the potential advantages and disadvantages of migrating collections away from the popular but aging TIFF G4 bitonal imaging standard. Here in part II, we'll take a first look at some of the alternative bitonal file formats and compression schemes. Part III, to appear in the June 2003 issue of RLG DigiNews, will compare the quality and performance of some specific products on a range of document content, including text, halftones, and complex graphics.

The Contenders

Several image file formats and compression schemes are potential migration targets for existing TIFF G4 files. Here's a rundown of some of the most important options, presented in alphabetical order.

CPC (Cartesian Perceptual Compression)

Overview. Patented in 1991 by Cartesian Products, Inc., CPC is a proprietary compression scheme and image file format for bitonal images. Cartesian Products claims that CPC can compress substantially better than G4 and, though particularly well suited for text, that it outperforms G4 for all kinds of document content, including halftones. Unique amongst the technologies presented here, CPC does not have a lossless mode. Cartesian Products calls its method "nondegrading," meaning that after conversion to CPC, the original file can no longer be restored, but the differences cannot be perceived by the human eye (other vendors use the terms "visually lossless" and "perceptually lossless" for the same concept).

Advantages. CPC is a proven technology that has been adopted for some large bitonal image collections, such as JSTOR, which converted all its online journal holdings to CPC in 1997[1]. There is a list of major CPC users available online (scroll to the bottom of the Web page). Cartesian's claims of nondegraded compression have been verified by user preference tests conducted by ISO. CPC supports single- and multi-page documents. Its viewer is available for all major platforms.

Disadvantages. CPC is proprietary, though Cartesian Products makes available APIs (application programming interfaces) to facilitate development of software using the scheme. Cartesian also claims that it is “working with a number of vendors who will be releasing CPC-enabled products, encompassing a broad range of applications including Internet fax services, document distribution, educational assistance, and electronic libraries.” However, at the moment Cartesian is the sole source of CPC encoders and viewers. CPC's lack of a true lossless mode could be an issue for demanding preservation applications. CPC is only for bitonal images. The format is not Web native and requires the installation of a special viewer if the CPC files are to be used for display purposes. CPC offers limited metadata capacity.

DjVu

Overview. DjVu (pronounced like déjà vu) was developed by AT&T Labs in 1996 with the first publicly released products coming in 1998. DjVu is designed to be a comprehensive, all-in-one document solution, suitable for bitonal text as well as gray scale and color content. DjVu defines a document format and encompasses several different compression schemes. A layering scheme allows documents that combine text and continuous tone content to treat each separately for optimal compression and display. AT&T Labs sold the rights to DjVu to LizardTech, Inc. in 2000. The independent PlanetDjVu Web site is an excellent source of information on all things DjVu.

Advantages. Lossy (claimed visually lossless) and true lossless compression of bitonal images, both claiming considerably better compression than G4. Also handles gray scale and color. Viewer is available for all major platforms. Handles single- and multi-page documents. In December 2001 LizardTech released partial open source of the v3.0 DjVu Reference Library, and others further enhanced that library.

Disadvantages. DjVu is proprietary, though LizardTech makes available SDKs (software developer kits) to facilitate development of software for encoding and decoding. LizardTech will license the DjVu Reference Library only for noncommercial use. At this time LizardTech is the sole source of commercial DjVu products that adhere to the current standard. Though DjVu clearly has some very enthusiastic supporters, its adoption has been spotty. The DjVu Zone Web site (which has not been updated in over a year) includes an outdated list of current users. Two of the largest users cited, Heritage Microfilm's Historical Newspaper Archive and UMI's Early English Books Online have abandoned display of DjVu images in favor of PDF and GIF, respectively. It also offers limited metadata capacity. The format is not Web native and requires the installation of a browser plug-in for display purposes.

JBIG2

Overview. Developed by the Joint Bi-Level Interest Group, JBIG2 is a new compression scheme for bitonal images that became an ISO standard at the end of 2001. It is the only contender mentioned here that is an international standard. According to the introduction of the draft JBIG2 standard, "the design goal for JBIG2 was to allow for lossless compression performance better than that of the existing standards, and to allow for lossy compression at much higher compression ratios than the lossless ratios of the existing standards, with almost no visible degradation of quality." JBIG2 is a relatively new standard that is now starting to appear in commercial products.

Advantages. Nonproprietary. Supports both lossy and lossless compression of bitonal images, including a special mode for halftones. Considerably better compression than G4, especially for halftone images. Can theoretically be incorporated into several existing file formats, such as TIFF and PDF.

Disadvantages. JBIG2 is strictly a compression scheme, so it is up to developers to incorporate it into existing file formats. Certain functionality, such as metadata, depends on what file format is used. Some applications can now produce JBIG2-encoded PDFs, but only Acrobat Reader 5.0 and later can decode them, potentially limiting user access. An open source decoder is being worked on but is in the very early stages of development.

PDF

Overview. Adobe's PDF is itself neither an image file format, nor an image compression scheme. However, PDF can serve as a container for digital images compressed with several schemes, including G4 and JBIG2. PDF has been around since 1993 and has evolved over the years as the leading format for online distribution of complex documents.

Advantages. PDF supports single- or multi-page documents. For example, although it doesn't reduce their size significantly, individual TIFF G4s can be bundled into multi-page PDF G4s, making them directly accessible to most Web users and automatically gaining all the navigation and display control offered by Acrobat Reader. Though proprietary, Adobe has maintained PDF as an open specification, resulting in a substantial level of third-party support. A free viewer is available for all major computing platforms. PDF is well established and is now the subject of a fledgling effort called PDF/A to "develop an international standard that defines the use of the Portable Document Format (PDF) for archiving and preserving documents." See also "Archiving and Preserving PDF Files," by John Mark Ockerbloom, in the February 2001 issue of RLG DigiNews.

Disadvantages. Despite the open specification, PDF is still a proprietary format. Acrobat Reader can decode JBIG2-encoded images, but only since version 5.0. Users who haven't upgraded to version 5 will get an error message if they attempt to read a JBIG2-encoded PDF. There are hundreds of tools for converting to PDF, but they must be evaluated carefully since there is considerable variation in the quality of output and efficiency of display. It needs better metadata capability (PDF/A may address this). The format is not Web native and requires the installation of a special viewer, though Acrobat Reader and the Acrobat browser plug-in are widely deployed.

To be continued ….

Are any of these migration targets for existing TIFF G4 images appropriate for your collection? Much will depend on institutional circumstances and priorities and the nature of the documents in question. In part III of this FAQ, to be published in the next issue of RLG DigiNews, we'll look at some specific implementations and reassess migration risk in light of all our findings.

Richard Entlich

Footnote

[1] JSTOR still scans to TIFF G4 and considers those files its preservation masters. The CPC files are used to reduce storage requirements for its online collection but are converted to GIF for display and to PDF for printing. (back)


Calendar of Events

ERPANET Seminar on the Preservation of Web Material
May 22-24, 2003
Kerkira (Corfou), Greece

Co-hosted by the University of Kerkira and ERPANET, the seminar intends to provide a detailed analysis of the main initiatives in the area of Web archiving and to discuss the long-term preservation of material obtained from Internet sources.

Electronic Media Group of AIC Meeting
June 5-10, 2003
Arlington, Virginia

EMG program activities will take place at the American Institute for Conservation's 31st annual meeting. Activities include the workshop, "Identification and Care of Videotapes,” a special joint session with the Photographic Materials Group, and a presentation of “PLAYBACK,” a new interactive DVD on the subject of analog video preservation.

Sound Savings: Preserving Audio Collections
July 24-26, 2003
Austin, Texas

Sound Savings will feature talks by experts in audio preservation on topics ranging from assessing the preservation needs of audio collections to creating, preserving, and making publicly available digitally reformatted audio recordings. The School of Information at the University of Texas at Austin, the Library of Congress Preservation Directorate, the National Recording Preservation Board, and the Association of Research Libraries are co-sponsors.

Digital Preservation Management: Short-Term Solutions to Long-Term Problems
August 4-8, 2003
Cornell University Library, Ithaca, NY

Cornell University Library will offer a new digital preservation training program with funding from the National Endowment for the Humanities. Institutions are encouraged to send a pair of participants to realize the maximum benefit from the managerial and technical tracks that will be incorporated into the program. This limited enrollment workshop has a registration fee of $750 per participant. Registration is now open for the August workshop. A second workshop is scheduled for October 13-17 (registration will open this summer). There will be three workshops in 2004.

Digital Resources for the Humanities Conference
August 31-September 3, 2003
University of Gloucestershire, Cheltenham

The annual Digital Resources for the Humanities conference will concentrate on the following themes:

  • The impact of access to digital resources on teaching and learning
  • Digital libraries, archives and museums
  • Time-based media and multimedia studies in performing arts
  • Network technologies to support international community programs
  • The anticipated convergence between televisual, communication and computing media and its effect on the humanities
  • Knowledge representation, including visualization and simulation

Seventh Summer Institute at the University of New Brunswick
August 24-29, 2003
New Brunswick, Canada

The course will focus on creating a set of electronic texts and digital images. The program is designed primarily for librarians and archivists who are planning to develop electronic text and imaging projects, for scholars who are creating electronic texts as part of their teaching and research, and for publishers who are looking to move publications to the Web.

Balancing Museum Technology and Transformation
November 5-8, 2003
Las Vegas, Nevada

The Museum Computer Network annual meeting will focus on ways in which technology influences work, the museum programming, and the way we think about museums and cultural heritage.


Announcements

Task Force on Digital Repository Certification
RLG and the National Archives and Records Administration (NARA) have jointly created this task force to produce certification requirements for establishing and selecting reliable digital information repositories.

Library of Congress Announces Approval of Plan to Preserve America's Digital Heritage
The Library of Congress has received approval from Congress for its Plan for the National Digital Information Infrastructure and Preservation Program (NDIIPP), which will enable the Library to launch the initial phase of building a national infrastructure for the collection and long-term preservation of digital content.

Final Report for the Pilot Project
The final report of the Netarchive.dk project to study the acquisition and archiving of the Danish Internet is available. The project was carried out by the Royal Library; the Copenhagen State and University Library, Aarhus; and the University of Aarhus's Centre for Internet Research. The report describes the experience gained from a pilot study in which existing software was used to harvest and subsequently test out materials relating to the county and district elections of 2001.

A Joint Digital Repository Project for University and MIT
Cambridge University Library and the Massachusetts Institute of Technology Libraries have embarked on a joint project to establish a digital repository for Cambridge University. The two libraries, working with Cambridge University Computing Service, will jointly receive £1.7 million over two years from the Cambridge-MIT Institute (CMI) to install an open source computer system called DSpace
™.

OCLC Research Publishes White Paper on the Economics of Digital Preservation (PDF)
OCLC researcher Brian Lavoie has published a white paper entitled The Incentives to Preserve Digital Materials: Roles, Scenarios, and Economic Decision-Making.

Pitt University Library System Launches Archive on Formation of European Union
Pitt’s University Library System has established the Archive of European Integration, an OAI compliant e-print repository for literature related to integration in Europe in the 20th and 21st centuries. The archive, launched with 78 papers, is accessible with no restrictions.


RLG News

Selection and Collaboration in Digital Preservation: an RLG-JISC Symposium
Over 140 people from Europe, North America, and Taiwan convened in Washington, DC, March 24-25, 2003, for an international conference on selection and collaboration in digital preservation. Sponsored by the Joint Information Systems Committee (JISC) and RLG, and hosted by the Library of Congress, the conference provided a venue to share, disseminate, and discuss the critical issues of selection and collaboration in preserving digital materials.

Leading speakers from the USA and Europe described their experiences and future plans. Through presentations, discussion, and breakout groups, participants had opportunities to contrast different approaches, consider which approaches were relevant for their own institution and interests, and further explore opportunities for collaboration in digital preservation across organizational and national boundaries.

This event was the latest in a series of collaborations between JISC and RLG begun in 1996 and resulting in conferences, research projects and publications. Full proceedings from the symposium will be available on the RLG web site in the next week.


Publishing Information

RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews,

." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski.

All links in this issue were confirmed accurate as of April 15, 2003.

 
   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
 
Hosted by uCoz