Council on Library and Information Resources

Username (email)

Password

Appendix D

dot

Appendix D

Case Study for Image File Format

rule

1. Collection and Analysis of Source and Target File Format Related Information Investigation Test Bed

To assess the risks associated with file format migration for digital image collections, the project team selected one of Cornell University Library's digital image collections as a test bed. The Ezra Cornell Papers consist of correspondence, financial and legal records, court proceedings, and other documents pertaining principally to the Cornell family, the telegraph industry, and the founding of Cornell University. The collection is composed of 30,000 images stored on small computer system interface (SCSI) disks. They are scanned as 600 dpi, 1-bit TIFF 5.0 ITU Group 4 images. Tag(ged) Image File Format (TIFF) is one of the most popular raster image file formats and is often the format of choice for master image files. It is platform-independent and supports 1- to 24-bit imaging using a variety of compression methods.

The Ezra Cornell materials were scanned in-house using a Xerox scanning system. This system organizes and stores the structuring information (e.g., page number, folder number) in a format called Raster Document Object (RDO), which is Xerox's adoption of the International Office Document Architecture (ODA) and Interchange Format.1

Goals of the File Format Migration Investigation

The goals of the file format migration investigation for image files were to:

  • identify the TIFF file format attributes at risk during migration,
  • assess the need to move these TIFF 5.0 image files to the current version (6.0),
  • evaluate the risks involved in converting TIFF 5.0 files to TIFF 6.0 files,
  • investigate the status of upcoming revision to TIFF (7.0),
  • assess the risks involved in skipping a generation (TIFF 6.0) and waiting for the release of TIFF 7.0, and
  • assess risks and data loss associated with converting from RDO format to the open Cornell Digital Library (CDL) format.

Collection and Analysis of Source and Target File Format Related Information

To identify digital image format attributes at risk, the project staff collected and analyzed information on different versions of TIFF file format. The research process included the following:

  • Conducting a literature search on digital archiving issues pertaining to digital image collections, with a specific focus on migration and the effects of file format choice in the migration chain.
  • Investigating new digital preservation research and initiatives, such as ISO's Open Archival Information System (OAIS) (International Organization for Standardization 1998), WGBH's Universal Preservation Format (UPF) (Shepard and MacCarn 1999), and the Digital Rosetta Stone Model (Heminger and Robertson 1998), among others.
  • Conducting a literature and projects survey to determine the extent of work performed on developing risk analysis based on image files.
  • Reviewing risk-assessment tools developed for various purposes, focusing on the form and functionality of these tools and how they can be adapted for the purposes of this project.
  • Exploring the dependencies that extend beyond basic image file format attributes, such as internal and external relationships between images and their accompanying metadata files (viewing images as "digital objects" and examining their metadata, associated scripts, programs, etc.).
  • Identifying the attributes of digital images that are at risk during format migration, including the effects of migration on metadata, and various scripts and programs that support retrieval and management of the collection.
  • Investigating the existing and emerging bitmap image file formats with a focus on their longevity and other archival attributes.
  • Exploring vulnerabilities associated with file format migration and identifying risks associated with "migrating" or "not migrating" these files, with a focus on TIFF files.
  • Analyzing the factors involved in decision making in migration projects, such as reformatting a collection of images from TIFF 4.0 to TIFF 5.0 format.
  • Examining and comparing the TIFF file format specifications for Versions 4.0, 5.0, and 6.0.
  • Exploring the future of TIFF as a file format, with a focus on the characteristics of the TIFF 7.0 format under development.
  • Investigating the issues introduced by storing structuring metadata in Xerox RDO format.
  • Identifying the risks involved in converting RDO files to the CDL format (http://www2.hunter.com/docs/rfc/rfc1691.html).

An outcome of this research process is summarized in Table 1, which categorizes the risks associated with file format based migration.

RISK CATEGORY

EXAMPLES

Content fixity


(bit configuration, including bit stream, form, and structure)

Bits/bit streams are corrupted by software bugs or mishandling of storage media, mechanical failure of devices, etc.

File format is accompanied by new compression that alters the bit configuration.

File header information does not migrate or is partially or incorrectly migrated.

Image quality (e.g., resolution, dynamic range, color spaces) is affected by alterations to the bit configuration.

New file format specifications change byte order.

Security

Format migration affects watermark, digital stamp, or other cryptographic techniques for "fixity."

Context and integrity


(the relationship and interaction with other related files or other elements of the digital environment, including hardware/software dependencies)

Because of different hardware and software dependencies, reading and processing the new file format require a new configuration.

Linkages to other files (e.g., metadata files, scripts, derivatives such as marked-up or text versions or on-the-fly conversion programs) are altered during migration.

New file format reduces the file size (because of file format organization or new compression) and causes denser storage and potential directory-structuring problems if one tries to consolidate files to use extra storage space.

Media become more dense, affecting labels and file structuring. (This might also be caused by file organization protocols of the new storage medium or operating system.)

References

(the ability to locate images definitively and reliably over time among other digital objects)

File extensions change because of file format upgrade and its effect on URLs.

Migration activity is not well documented, causing provenance information to be incomplete or inaccurate (a potential problem for future migration activities).

Cost

Long-term costs associated with migration are unpredictable because each migration cycle may involve different procedures, depending on the nature of the migration (routine migration vs. paradigm shift).

The value of the collection may be insufficiently determined, making it impossible to set priorities for migration.

Costs may be unscalable unless there is a standard architecture (e.g., centralized storage, metadata standards, file format/compression standards) that encompasses the image collections so that the same migration strategy can be easily implemented for other similar collections.

Staffing

Staff turnover and lack of continuity in migration decisions can hurt long-term planning, especially if insufficient preservation metadata is captured and the migration path is not well documented.

Decisions must be made whether to hire full-time, permanent staff or use temporary workers for rescue operations.

Staff may have insufficient technical expertise.

The unpredictability of migration cycles makes it difficult to plan for staffing requirements (e.g., skills, time, funding).

Functionality

Features introduced by the new file format may affect derivative creation, such as printing.

 

If the master copy is also used for access, changes may cause decreased or increased functionality and require interface modifications (e.g., static vs. multiresolution image, inability of the Web to support the new format).

 

Unique features that are not supported in other file formats may be lost (e.g., the progressive display functionality when Graphics Interchange Format [GIF] files are migrated to another format).

 

The artifactual value (original use context) may be lost because of changes introduced during migration; as a result, the "experience" may not be preserved.

Legal

Copyright regulations may limit the use of new derivatives that can be created from the new format (e.g., the institution is allowed to provide images only at a certain resolution so as not to compete with the original).

Table 1. Risks associated with file-format-based migration for image collections

Conclusions of the Source and Target File Format Analysis

Because most of the specifications are publicly available on the Adobe FTP site, the project staff was able to gather a substantial amount of information about the different versions of TIFF. TIFF was developed by Aldus and Microsoft, and the specification was owned by Aldus, which in turn merged with Adobe Systems, Incorporated. Consequently, Adobe now holds the copyright for the TIFF specifications. TIFF is a highly flexible and platform-independent file format. It is supported by numerous image-processing applications. A great strength of the TIFF file format is its file header option, which enables recording within the file itself of a wide variety of metadata (descriptive, administrative, and structural). The set of fields or "tags" in TIFF is extensive, making it the format of choice for most archival reformatting. However, a large number of TIFF fields are not defined by the standard. Therefore, while TIFF offers the advantage of being open and usable, there is the danger that different institutions will define these fields in different ways, leading to problems of compatibility. Another flexibility of TIFF that causes confusion is related to byte order. For example, the TIFF format permits both MSB ("Motorola") and LSB ("Intel") byte order data to be stored, with a header item indicating which order is used.

Tracking the TIFF 7.0 development turned out to be a challenging task. The project team's attempts to contact TIFF 7.0 developers, Adobe, and even TIFF listserv subscribers were fruitless. The TIFF 7.0 development group seems to be determined not to release any information regarding their work. Therefore, the project team was unable to make any comparisons between TIFF 7.0 and the earlier versions. After conducting an extensive evaluation and comparison of TIFF 5.0 and TIFF 6.0 specifications, the team ran several tests to compare the quality and utility of a subset of TIFF 5.0 images before and after conversion to TIFF 6.0. This exploration revealed no major differences between the versions. The project team concluded that there were no risks involved at this point in leaving the testbed images in TIFF 5.0 format. After reaching this conclusion, the team shifted its focus for the risk-assessment study for image files to an examination of storing structural metadata in the proprietary Xerox RDO format. The team will continue to monitor the development of TIFF 7.0.

Raster Document Object Files

An RDO file contains information about the structure of an image document as well as a file location pointer for each page image in that document. A single TIFF file represents each page in the document. The TIFF files each contain the digital data from the scanned page and a header that describes the characteristics of the image file. Because the Xerox Documents on Demand (XDOD) system is proprietary, the structure of image documents can be displayed only by using the appropriate Xerox software.

2. Selection and Evaluation of Conversion Software

Since a decision was made to maintain the files in TIFF 5.0 format, evaluation of the TIFF conversion software was unnecessary. There are several conversion programs on the market for converting TIFF files to various TIFF versions and other file formats (e.g., TIFF to GIF, TIFF to PNG). TIFF 5.0 to TIFF 6.0 conversion could be interpreted as an update rather than as a migration process.

In 1994, Cornell undertook a project to convert the proprietary RDO files to an open CDL format. The specifications for the CDL, which were released in August 1994 through a Request for Comments (#1691), defines an architecture for the storage and retrieval of Cornell University Library's image collection. Similar to RDO files, the CDL document structure provides direct access to the components of image collections (e.g., pages, sections, and chapters).

While the project team's main interest was exploring the export of files created on XDOD 3.0, its immediate concern was with the older RDOs, especially in light of the Y2K compliance issues (i.e., concern that the XDODs would no longer work unless an expensive upgrade were implemented).

The conversion from XDOD RDO to CDL format involved two steps. Cornell used a Xerox-supplied tool (XDOD Export Tool) to convert the RDO files into a series of ASCII metadata files. This tool is old and can run only in Windows 3.1, and its dissemination is authorized "only pursuant to a valid written license from Xerox." Second, through a locally developed PERL script, the ASCII metadata files were converted to the CDL format. These CDL-formatted structural metadata files are used for navigating through a document (http://moa.cit.cornell.edu/MOA/EZRA.html). The Cornell University Library information technology staff wrote the ASCII RDO-to-CDL program.

RDO-to-CDL conversion cannot be achieved through a single software tool since Xerox has not released any RDO specifications.

3. Development of Tools for Assessing the Source-To-Target Format Transfer

No specific software tool was developed to analyze the effects of migration from RDO to CDL format, because all files created using the XDOD scanning system possess identical information fields.

4. Comparison and Analysis after Conversion to Source File Format

The comparison was done manually by comparing the structural metadata elements that were captured in RDO files with the CDL structure. The team compared the list of structural metadata elements captured during scanning with the CDL structuring requirements. All the structural elements mapped to the CDL structure, and there was no loss. Even if there had been a loss, the project team decided that it was much riskier (actually detrimental) to leave the structuring information in an unsupported proprietary format.

5. Releasing the Export Tool to Other Institutions

As part of this project, Cornell investigated the possibility of further developing the Export Tool and making it available to other institutions that have legacy collections in the proprietary Xerox RDO format. This investigation was spurred by two concerns. First, several institutions had requested access to the tool over the past few years, but only Yale University had secured permission from Xerox to use it. Second, in early summer 1999, Xerox informed Cornell that the XDOD 2.x scanning workstations would not be Y2K-compliant without an expensive upgrade. Because Cornell had begun to phase out use of the XDOD systems and had converted all RDO files to the CDL format, our concerns over the millennium focused on our sister institutions' collections.

We initially considered developing the Export Tool into more generic software for external use, but quickly concluded that this would be both expensive and time-consuming. Cornell did not receive any specifications from Xerox for the proprietary tool, and the software developer at Xerox indicated that he doubted that the company still had the tools and specifications to make the system work. We decided to focus on securing permission to release the current version of the Export Tool. A two-year effort to obtain a blanket permission from Xerox to make the tool broadly accessible had stalled, so we turned to documenting the extent of the problem, concluding that Xerox might be more amenable to a very limited release.

In late April 1999, Cornell posted the following announcement on 11 listservs.

Export Tool to Convert Xerox RDO Files to Open Digital Library Format
Has your institution created digital image files using the proprietary Xerox Documents on Demand software that generates Raster Document Objects (RDOs) to store structural metadata? Cornell University is seeking feedback from these institutions to determine what demand there would be for freeware to convert those RDOs for use in other metadata applications. Cornell has used the RDO2CDL export tool to migrate RDOs to ASCII metadata files that recreate the logical and physical structure format of the RDO (called CDL). If your institution is interested in utilizing such an Export Tool, please send contact information and a brief description of your needs to: Anne R. Kenney (ark3@cornell.edu).

By early June, surprisingly few responses were received. Universities with files created on XDOD 2.5 or older versions included Harvard, Penn State, the University of Tennessee­Knoxville, and Yale. Those responding with files created using XDOD 3.01 or DigiPath included the Hein Publishing Company, Illinois State Library, the National Document Center (Athens), Indiana University, the University of Toronto, and the National Oceanic and Atmospheric Administration (NOAA) Miami Regional Library (which was considering using the technology).

Inquiries to Xerox about releasing the tool to this group resulted in further clarification that the RDO Export Tool software would work only as configured on XDOD Version 2.x systems. The format of the RDO changed slightly from version 2 to version 3, and the Export Tool would not convert the structural data on version 3 or higher systems. The Hein Publishing Company had used the tool with version 3 files through a collaborative project with Cornell, but only page labels, not structuring information, were exported. William Anderson, the Xerox software engineer who created the tool, suggested that it would be possible to get the structure information out of the version 3 RDO files, but it would take a programmer with knowledge of the Office Document Architecture (of which RDO is a variant), fair knowledge of Unix tools, and a copy of the RDO Version 3 specification, which Xerox seemed unwilling or unable to make available publicly. Anderson suggested that, "If customers are looking to buy DigiPath today, and they need that facility, they should ask for it." Xerox decided to grant access to this software only to XDOD 2.x customers who were not migrating to DigiPath.

From June to early September, efforts continued to reach legal agreement with Xerox over the release of the Export Tool software to XDOD 2.x users. Cornell received a copy of a proposed Software License Agreement on August 26, 1999. The agreement granted the institution a nonexclusive, perpetual, royalty-free license to use the software and the right to provide a sublicense only to those institutions that had reported using the XD Version 2.x systems, collectively referred to as "Authorized Educational Institutions" (AEI). Lee Cartmill, the chief financial officer at Cornell University Library, expressed concern about the indemnity clause in the agreement, which required Cornell to "defend, indemnify and hold Xerox harmless from and against any and all third party claims that arise from or relate to the Software and their respective use of the Software." Cornell attempted to have this clause modified. When Xerox remained adamant, Cartmill drafted a Software Sublicense Agreement that would require the AEIs to extend the indemnity and limitation of liability to Cornell University. As of this writing, the four institutions have been notified of these stipulations, and their legal advisers are reviewing copies of the agreements. It remains to be seen whether any or all of these institutions will agree to these license stipulations, but Cornell will not sign the agreement with Xerox unless they do so.

References

International Organization for Standardization. ISO Reference Model for an Open Archival Information System (OAIS). 1998. Available from http://ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html .

Heminger, Alan R., and Steven B. Robertson. 1998. Digital Rosetta Stone: A Conceptual Model for Maintaining Long-Term Access to Digital Documents. Available from http://crack.inesc.pt/events/ercim/delos6/papers/rosetta.doc.

Shepard, Thom, and Dave MacCarn. 1999. UPF: Universal Preservation Format. Available from http://info.wgbh.org/upf/.

rule

Footnotes

1 ODA, which became an ISO standard in 1988, has been developed to represent and allow the interchange of office documents. It contains facilities that allow both the structure and content of complex multimedia documents to be represented. Although ODA is an open standard, specifications for the RDO architecture are proprietary.

rule

Links to other parts of this report:

Table of Contents

Risk Management of Digital Information

Appendix A: Risk-Assessment Workbook

Appendix B: Documentation for Format Migration Test File, Lotus 1-2-3, Release 2.2

Appendix C: Documentation: Examiner and RiskEditor

Appendix E: Case Study for Lotus 1-2-3 .wk1 Format

Appendix F: Migration Software Analysis, Software Assessment Sheet

Appendix G: Specifications for the Cornell Digital Library Format

Return to CLIR Home Page >>