Appendix D
Appendix DCase Study for Image File Format
1. Collection and Analysis of Source and Target File Format Related Information Investigation Test BedTo assess the risks associated with file format migration for digital image collections, the project team selected one of Cornell University Library's digital image collections as a test bed. The Ezra Cornell Papers consist of correspondence, financial and legal records, court proceedings, and other documents pertaining principally to the Cornell family, the telegraph industry, and the founding of Cornell University. The collection is composed of 30,000 images stored on small computer system interface (SCSI) disks. They are scanned as 600 dpi, 1-bit TIFF 5.0 ITU Group 4 images. Tag(ged) Image File Format (TIFF) is one of the most popular raster image file formats and is often the format of choice for master image files. It is platform-independent and supports 1- to 24-bit imaging using a variety of compression methods. The Ezra Cornell materials were scanned in-house using a Xerox scanning system. This system organizes and stores the structuring information (e.g., page number, folder number) in a format called Raster Document Object (RDO), which is Xerox's adoption of the International Office Document Architecture (ODA) and Interchange Format.1 Goals of the File Format Migration InvestigationThe goals of the file format migration investigation for image files were to:
Collection and Analysis of Source and Target File Format Related InformationTo identify digital image format attributes at risk, the project staff collected and analyzed information on different versions of TIFF file format. The research process included the following:
An outcome of this research process is summarized in Table 1, which categorizes the risks associated with file format based migration.
Table 1. Risks associated with file-format-based migration for image collectionsConclusions of the Source and Target File Format AnalysisBecause most of the specifications are publicly available on the Adobe FTP site, the project staff was able to gather a substantial amount of information about the different versions of TIFF. TIFF was developed by Aldus and Microsoft, and the specification was owned by Aldus, which in turn merged with Adobe Systems, Incorporated. Consequently, Adobe now holds the copyright for the TIFF specifications. TIFF is a highly flexible and platform-independent file format. It is supported by numerous image-processing applications. A great strength of the TIFF file format is its file header option, which enables recording within the file itself of a wide variety of metadata (descriptive, administrative, and structural). The set of fields or "tags" in TIFF is extensive, making it the format of choice for most archival reformatting. However, a large number of TIFF fields are not defined by the standard. Therefore, while TIFF offers the advantage of being open and usable, there is the danger that different institutions will define these fields in different ways, leading to problems of compatibility. Another flexibility of TIFF that causes confusion is related to byte order. For example, the TIFF format permits both MSB ("Motorola") and LSB ("Intel") byte order data to be stored, with a header item indicating which order is used. Tracking the TIFF 7.0 development turned out to be a challenging task. The project team's attempts to contact TIFF 7.0 developers, Adobe, and even TIFF listserv subscribers were fruitless. The TIFF 7.0 development group seems to be determined not to release any information regarding their work. Therefore, the project team was unable to make any comparisons between TIFF 7.0 and the earlier versions. After conducting an extensive evaluation and comparison of TIFF 5.0 and TIFF 6.0 specifications, the team ran several tests to compare the quality and utility of a subset of TIFF 5.0 images before and after conversion to TIFF 6.0. This exploration revealed no major differences between the versions. The project team concluded that there were no risks involved at this point in leaving the testbed images in TIFF 5.0 format. After reaching this conclusion, the team shifted its focus for the risk-assessment study for image files to an examination of storing structural metadata in the proprietary Xerox RDO format. The team will continue to monitor the development of TIFF 7.0. Raster Document Object FilesAn RDO file contains information about the structure of an image document as well as a file location pointer for each page image in that document. A single TIFF file represents each page in the document. The TIFF files each contain the digital data from the scanned page and a header that describes the characteristics of the image file. Because the Xerox Documents on Demand (XDOD) system is proprietary, the structure of image documents can be displayed only by using the appropriate Xerox software. 2. Selection and Evaluation of Conversion SoftwareSince a decision was made to maintain the files in TIFF 5.0 format, evaluation of the TIFF conversion software was unnecessary. There are several conversion programs on the market for converting TIFF files to various TIFF versions and other file formats (e.g., TIFF to GIF, TIFF to PNG). TIFF 5.0 to TIFF 6.0 conversion could be interpreted as an update rather than as a migration process. In 1994, Cornell undertook a project to convert the proprietary RDO files to an open CDL format. The specifications for the CDL, which were released in August 1994 through a Request for Comments (#1691), defines an architecture for the storage and retrieval of Cornell University Library's image collection. Similar to RDO files, the CDL document structure provides direct access to the components of image collections (e.g., pages, sections, and chapters). While the project team's main interest was exploring the export of files created on XDOD 3.0, its immediate concern was with the older RDOs, especially in light of the Y2K compliance issues (i.e., concern that the XDODs would no longer work unless an expensive upgrade were implemented). The conversion from XDOD RDO to CDL format involved two steps. Cornell used a Xerox-supplied tool (XDOD Export Tool) to convert the RDO files into a series of ASCII metadata files. This tool is old and can run only in Windows 3.1, and its dissemination is authorized "only pursuant to a valid written license from Xerox." Second, through a locally developed PERL script, the ASCII metadata files were converted to the CDL format. These CDL-formatted structural metadata files are used for navigating through a document (http://moa.cit.cornell.edu/MOA/EZRA.html). The Cornell University Library information technology staff wrote the ASCII RDO-to-CDL program. RDO-to-CDL conversion cannot be achieved through a single software tool since Xerox has not released any RDO specifications. 3. Development of Tools for Assessing the Source-To-Target Format TransferNo specific software tool was developed to analyze the effects of migration from RDO to CDL format, because all files created using the XDOD scanning system possess identical information fields. 4. Comparison and Analysis after Conversion to Source File FormatThe comparison was done manually by comparing the structural metadata elements that were captured in RDO files with the CDL structure. The team compared the list of structural metadata elements captured during scanning with the CDL structuring requirements. All the structural elements mapped to the CDL structure, and there was no loss. Even if there had been a loss, the project team decided that it was much riskier (actually detrimental) to leave the structuring information in an unsupported proprietary format. 5. Releasing the Export Tool to Other InstitutionsAs part of this project, Cornell investigated the possibility of further developing the Export Tool and making it available to other institutions that have legacy collections in the proprietary Xerox RDO format. This investigation was spurred by two concerns. First, several institutions had requested access to the tool over the past few years, but only Yale University had secured permission from Xerox to use it. Second, in early summer 1999, Xerox informed Cornell that the XDOD 2.x scanning workstations would not be Y2K-compliant without an expensive upgrade. Because Cornell had begun to phase out use of the XDOD systems and had converted all RDO files to the CDL format, our concerns over the millennium focused on our sister institutions' collections. We initially considered developing the Export Tool into more generic software for external use, but quickly concluded that this would be both expensive and time-consuming. Cornell did not receive any specifications from Xerox for the proprietary tool, and the software developer at Xerox indicated that he doubted that the company still had the tools and specifications to make the system work. We decided to focus on securing permission to release the current version of the Export Tool. A two-year effort to obtain a blanket permission from Xerox to make the tool broadly accessible had stalled, so we turned to documenting the extent of the problem, concluding that Xerox might be more amenable to a very limited release. In late April 1999, Cornell posted the following announcement on 11 listservs. Export Tool to Convert Xerox RDO Files to Open Digital Library
Format By early June, surprisingly few responses were received. Universities with files created on XDOD 2.5 or older versions included Harvard, Penn State, the University of TennesseeKnoxville, and Yale. Those responding with files created using XDOD 3.01 or DigiPath included the Hein Publishing Company, Illinois State Library, the National Document Center (Athens), Indiana University, the University of Toronto, and the National Oceanic and Atmospheric Administration (NOAA) Miami Regional Library (which was considering using the technology). Inquiries to Xerox about releasing the tool to this group resulted in further clarification that the RDO Export Tool software would work only as configured on XDOD Version 2.x systems. The format of the RDO changed slightly from version 2 to version 3, and the Export Tool would not convert the structural data on version 3 or higher systems. The Hein Publishing Company had used the tool with version 3 files through a collaborative project with Cornell, but only page labels, not structuring information, were exported. William Anderson, the Xerox software engineer who created the tool, suggested that it would be possible to get the structure information out of the version 3 RDO files, but it would take a programmer with knowledge of the Office Document Architecture (of which RDO is a variant), fair knowledge of Unix tools, and a copy of the RDO Version 3 specification, which Xerox seemed unwilling or unable to make available publicly. Anderson suggested that, "If customers are looking to buy DigiPath today, and they need that facility, they should ask for it." Xerox decided to grant access to this software only to XDOD 2.x customers who were not migrating to DigiPath. From June to early September, efforts continued to reach legal agreement with Xerox over the release of the Export Tool software to XDOD 2.x users. Cornell received a copy of a proposed Software License Agreement on August 26, 1999. The agreement granted the institution a nonexclusive, perpetual, royalty-free license to use the software and the right to provide a sublicense only to those institutions that had reported using the XD Version 2.x systems, collectively referred to as "Authorized Educational Institutions" (AEI). Lee Cartmill, the chief financial officer at Cornell University Library, expressed concern about the indemnity clause in the agreement, which required Cornell to "defend, indemnify and hold Xerox harmless from and against any and all third party claims that arise from or relate to the Software and their respective use of the Software." Cornell attempted to have this clause modified. When Xerox remained adamant, Cartmill drafted a Software Sublicense Agreement that would require the AEIs to extend the indemnity and limitation of liability to Cornell University. As of this writing, the four institutions have been notified of these stipulations, and their legal advisers are reviewing copies of the agreements. It remains to be seen whether any or all of these institutions will agree to these license stipulations, but Cornell will not sign the agreement with Xerox unless they do so. ReferencesInternational Organization for Standardization. ISO Reference Model for an Open Archival Information System (OAIS). 1998. Available from http://ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html . Heminger, Alan R., and Steven B. Robertson. 1998. Digital Rosetta Stone: A Conceptual Model for Maintaining Long-Term Access to Digital Documents. Available from http://crack.inesc.pt/events/ercim/delos6/papers/rosetta.doc. Shepard, Thom, and Dave MacCarn. 1999. UPF: Universal Preservation Format. Available from http://info.wgbh.org/upf/.
Footnotes1 ODA, which became an ISO standard in 1988, has been developed to represent and allow the interchange of office documents. It contains facilities that allow both the structure and content of complex multimedia documents to be represented. Although ODA is an open standard, specifications for the RDO architecture are proprietary.
Links to other parts of this report:
Table of Contents
Risk Management of Digital Information
Appendix A: Risk-Assessment Workbook
Appendix B: Documentation for Format Migration
Test File, Lotus 1-2-3, Release 2.2
Appendix C: Documentation: Examiner and
RiskEditor
Appendix E: Case Study for Lotus 1-2-3
.wk1 Format
Appendix F: Migration Software Analysis,
Software Assessment Sheet
Appendix
G: Specifications for the Cornell Digital Library Format
|
