Case Study for Image File Format

rule

1. Collection and Analysis of Source and
Target File Format Related Information Investigation Test Bed

To assess the risks associated with file format migration for digital
image collections, the project team selected one of Cornell University
Library’s digital image collections as a test bed. The Ezra Cornell
Papers consist of correspondence, financial and legal records, court
proceedings, and other documents pertaining principally to the Cornell
family, the telegraph industry, and the founding of Cornell University.
The collection is composed of 30,000 images stored on small computer
system interface (SCSI) disks. They are scanned as 600 dpi, 1-bit
TIFF 5.0 ITU Group 4 images. Tag(ged) Image File Format (TIFF) is
one of the most popular raster image file formats and is often the
format of choice for master image files. It is platform-independent
and supports 1- to 24-bit imaging using a variety of compression
methods.

The Ezra Cornell materials were scanned in-house using a Xerox scanning
system. This system organizes and stores the structuring information
(e.g., page number, folder number) in a format called Raster Document
Object (RDO), which is Xerox’s adoption of the International Office
Document Architecture (ODA) and Interchange Format.1

Goals of the File Format Migration Investigation

The goals of the file format migration investigation for image files
were to:

  • identify the TIFF file format attributes at risk during migration,
  • assess the need to move these TIFF 5.0 image files to the current
    version (6.0),
  • evaluate the risks involved in converting TIFF 5.0 files to TIFF
    6.0 files,
  • investigate the status of upcoming revision to TIFF (7.0),
  • assess the risks involved in skipping a generation (TIFF 6.0)
    and waiting for the release of TIFF 7.0, and
  • assess risks and data loss associated with converting from RDO
    format to the open Cornell Digital Library (CDL) format.

Collection and Analysis of Source and Target File Format Related
Information

To identify digital image format attributes at risk, the project
staff collected and analyzed information on different versions of
TIFF file format. The research process included the following:

  • Conducting a literature search on digital archiving issues pertaining
    to digital image collections, with a specific focus on migration
    and the effects of file format choice in the migration chain.
  • Investigating new digital preservation research and initiatives,
    such as ISO’s Open Archival Information System (OAIS) (International
    Organization for Standardization 1998), WGBH’s Universal Preservation
    Format (UPF) (Shepard and MacCarn 1999), and the Digital Rosetta
    Stone Model (Heminger and Robertson 1998), among others.
  • Conducting a literature and projects survey to determine the
    extent of work performed on developing risk analysis based on image
    files.
  • Reviewing risk-assessment tools developed for various purposes,
    focusing on the form and functionality of these tools and how they
    can be adapted for the purposes of this project.
  • Exploring the dependencies that extend beyond basic image file
    format attributes, such as internal and external relationships
    between images and their accompanying metadata files (viewing images
    as “digital objects” and examining their metadata, associated
    scripts, programs, etc.).
  • Identifying the attributes of digital images that are at risk
    during format migration, including the effects of migration on
    metadata, and various scripts and programs that support retrieval
    and management of the collection.
  • Investigating the existing and emerging bitmap image file formats
    with a focus on their longevity and other archival attributes.
  • Exploring vulnerabilities associated with file format migration
    and identifying risks associated with “migrating” or “not
    migrating” these files, with a focus on TIFF files.
  • Analyzing the factors involved in decision making in migration
    projects, such as reformatting a collection of images from TIFF
    4.0 to TIFF 5.0 format.
  • Examining and comparing the TIFF file format specifications for
    Versions 4.0, 5.0, and 6.0.
  • Exploring the future of TIFF as a file format, with a focus on
    the characteristics of the TIFF 7.0 format under development.
  • Investigating the issues introduced by storing structuring metadata
    in Xerox RDO format.
  • Identifying the risks involved in converting RDO files to the
    CDL format (http://www2.hunter.com/docs/rfc/rfc1691.html).

An outcome of this research process is summarized in Table 1, which
categorizes the risks associated with file format based migration.

RISK CATEGORY EXAMPLES
Content fixity

(bit configuration, including bit stream, form, and structure)

Bits/bit streams are corrupted by software bugs or mishandling
of storage media, mechanical failure of devices, etc.
File format is accompanied by new compression that alters
the bit configuration.
File header information does not migrate or is partially or
incorrectly migrated.
Image quality (e.g., resolution, dynamic range, color spaces)
is affected by alterations to the bit configuration.
New file format specifications change byte order.
Security Format migration affects watermark, digital stamp, or other
cryptographic techniques for “fixity.”
Context and integrity

(the relationship and interaction with other related files
or other elements of the digital environment, including
hardware/software dependencies)

Because of different hardware and software dependencies, reading
and processing the new file format require a new configuration.
Linkages to other files (e.g., metadata files, scripts, derivatives
such as marked-up or text versions or on-the-fly conversion
programs) are altered during migration.
New file format reduces the file size (because of file format
organization or new compression) and causes denser storage
and potential directory-structuring problems if one tries to
consolidate files to use extra storage space.
Media become more dense, affecting labels and file structuring.
(This might also be caused by file organization protocols of
the new storage medium or operating system.)
References

(the ability
to locate images definitively and reliably over time among
other digital objects)

File extensions change because of file format upgrade and
its effect on URLs.
Migration activity is not well documented, causing provenance
information to be incomplete or inaccurate (a potential problem
for future migration activities).
Long-term costs associated with migration are unpredictable
because each migration cycle may involve different procedures,
depending on the nature of the migration (routine migration
vs. paradigm shift).
The value of the collection may be insufficiently determined,
making it impossible to set priorities for migration.
Costs may be unscalable unless there is a standard architecture
(e.g., centralized storage, metadata standards, file format/compression
standards) that encompasses the image collections so that the
same migration strategy can be easily implemented for other
similar collections.
Staffing Staff turnover and lack of continuity in migration decisions
can hurt long-term planning, especially if insufficient preservation
metadata is captured and the migration path is not well documented.
Decisions must be made whether to hire full-time, permanent
staff or use temporary workers for rescue operations.
Staff may have insufficient technical expertise.
The unpredictability of migration cycles makes it difficult
to plan for staffing requirements (e.g., skills, time, funding).
Functionality Features introduced by the new file format may affect derivative
creation, such as printing.
If the master copy is also used for access, changes may cause
decreased or increased functionality and require interface
modifications (e.g., static vs. multiresolution image, inability
of the Web to support the new format).
Unique features that are not supported in other file formats
may be lost (e.g., the progressive display functionality when
Graphics Interchange Format [GIF] files are migrated to another
format).
The artifactual value (original use context) may be lost because
of changes introduced during migration; as a result, the “experience” may
not be preserved.
Legal Copyright regulations may limit the use of new derivatives
that can be created from the new format (e.g., the institution
is allowed to provide images only at a certain resolution so
as not to compete with the original).

Table 1. Risks associated with file-format-based migration for
image collections

Conclusions of the Source and Target File Format Analysis

Because most of the specifications are publicly available on the
Adobe FTP site, the project staff was able to gather a substantial
amount of information about the different versions of TIFF. TIFF
was developed by Aldus and Microsoft, and the specification was owned
by Aldus, which in turn merged with Adobe Systems, Incorporated.
Consequently, Adobe now holds the copyright for the TIFF specifications.
TIFF is a highly flexible and platform-independent file format. It
is supported by numerous image-processing applications. A great strength
of the TIFF file format is its file header option, which enables
recording within the file itself of a wide variety of metadata (descriptive,
administrative, and structural). The set of fields or “tags” in
TIFF is extensive, making it the format of choice for most archival
reformatting. However, a large number of TIFF fields are not defined
by the standard. Therefore, while TIFF offers the advantage of being
open and usable, there is the danger that different institutions
will define these fields in different ways, leading to problems of
compatibility. Another flexibility of TIFF that causes confusion
is related to byte order. For example, the TIFF format permits both
MSB (“Motorola”) and LSB (“Intel”) byte order
data to be stored, with a header item indicating which order is used.

Tracking the TIFF 7.0 development turned out to be a challenging
task. The project team’s attempts to contact TIFF 7.0 developers,
Adobe, and even TIFF listserv subscribers were fruitless. The TIFF
7.0 development group seems to be determined not to release any information
regarding their work. Therefore, the project team was unable to make
any comparisons between TIFF 7.0 and the earlier versions. After
conducting an extensive evaluation and comparison of TIFF 5.0 and
TIFF 6.0 specifications, the team ran several tests to compare the
quality and utility of a subset of TIFF 5.0 images before and after
conversion to TIFF 6.0. This exploration revealed no major differences
between the versions. The project team concluded that there were
no risks involved at this point in leaving the testbed images in
TIFF 5.0 format. After reaching this conclusion, the team shifted
its focus for the risk-assessment study for image files to an examination
of storing structural metadata in the proprietary Xerox RDO format.
The team will continue to monitor the development of TIFF 7.0.

Raster Document Object Files

An RDO file contains information about the structure of an image
document as well as a file location pointer for each page image in
that document. A single TIFF file represents each page in the document.
The TIFF files each contain the digital data from the scanned page
and a header that describes the characteristics of the image file.
Because the Xerox Documents on Demand (XDOD) system is proprietary,
the structure of image documents can be displayed only by using the
appropriate Xerox software.

2. Selection and Evaluation of Conversion
Software

Since a decision was made to maintain the files in TIFF 5.0 format,
evaluation of the TIFF conversion software was unnecessary. There
are several conversion programs on the market for converting TIFF
files to various TIFF versions and other file formats (e.g., TIFF
to GIF, TIFF to PNG). TIFF 5.0 to TIFF 6.0 conversion could be interpreted
as an update rather than as a migration process.

In 1994, Cornell undertook a project to convert the proprietary
RDO files to an open CDL format. The specifications for the CDL,
which were released in August 1994 through a Request for Comments
(#1691), defines an architecture for the storage and retrieval of
Cornell University Library’s image collection. Similar to RDO files,
the CDL document structure provides direct access to the components
of image collections (e.g., pages, sections, and chapters).

While the project team’s main interest was exploring the export
of files created on XDOD 3.0, its immediate concern was with the
older RDOs, especially in light of the Y2K compliance issues (i.e.,
concern that the XDODs would no longer work unless an expensive upgrade
were implemented).

The conversion from XDOD RDO to CDL format involved two steps. Cornell
used a Xerox-supplied tool (XDOD Export Tool) to convert the RDO
files into a series of ASCII metadata files. This tool is old and
can run only in Windows 3.1, and its dissemination is authorized “only
pursuant to a valid written license from Xerox.” Second, through
a locally developed PERL script, the ASCII metadata files were converted
to the CDL format. These CDL-formatted structural metadata files
are used for navigating through a document (http://moa.cit.cornell.edu/MOA/EZRA.html).
The Cornell University Library information technology staff wrote
the ASCII RDO-to-CDL program.

RDO-to-CDL conversion cannot be achieved through a single software
tool since Xerox has not released any RDO specifications.

3. Development of Tools for Assessing the
Source-To-Target Format Transfer

No specific software tool was developed to analyze the effects of
migration from RDO to CDL format, because all files created using
the XDOD scanning system possess identical information fields.

4. Comparison and Analysis after Conversion
to Source File Format

The comparison was done manually by comparing the structural metadata
elements that were captured in RDO files with the CDL structure.
The team compared the list of structural metadata elements captured
during scanning with the CDL structuring requirements. All the structural
elements mapped to the CDL structure, and there was no loss. Even
if there had been a loss, the project team decided that it was much
riskier (actually detrimental) to leave the structuring information
in an unsupported proprietary format.

5. Releasing the Export Tool to Other Institutions

As part of this project, Cornell investigated the possibility of
further developing the Export Tool and making it available to other
institutions that have legacy collections in the proprietary Xerox
RDO format. This investigation was spurred by two concerns. First,
several institutions had requested access to the tool over the past
few years, but only Yale University had secured permission from Xerox
to use it. Second, in early summer 1999, Xerox informed Cornell that
the XDOD 2.x scanning workstations would not be Y2K-compliant without
an expensive upgrade. Because Cornell had begun to phase out use
of the XDOD systems and had converted all RDO files to the CDL format,
our concerns over the millennium focused on our sister institutions’
collections.

We initially considered developing the Export Tool into more generic
software for external use, but quickly concluded that this would
be both expensive and time-consuming. Cornell did not receive any
specifications from Xerox for the proprietary tool, and the software
developer at Xerox indicated that he doubted that the company still
had the tools and specifications to make the system work. We decided
to focus on securing permission to release the current version of
the Export Tool. A two-year effort to obtain a blanket permission
from Xerox to make the tool broadly accessible had stalled, so we
turned to documenting the extent of the problem, concluding that
Xerox might be more amenable to a very limited release.

In late April 1999, Cornell posted the following announcement on
11 listservs.

Export Tool to Convert Xerox RDO Files to Open Digital Library
Format

Has your institution created digital image files using the proprietary
Xerox Documents on Demand software that generates Raster Document
Objects (RDOs) to store structural metadata? Cornell University is
seeking feedback from these institutions to determine what demand
there would be for freeware to convert those RDOs for use in other
metadata applications. Cornell has used the RDO2CDL export tool to
migrate RDOs to ASCII metadata files that recreate the logical and
physical structure format of the RDO (called CDL). If your institution
is interested in utilizing such an Export Tool, please send contact
information and a brief description of your needs to: Anne R. Kenney
(ark3@cornell.edu).

By early June, surprisingly few responses were received. Universities
with files created on XDOD 2.5 or older versions included Harvard,
Penn State, the University of Tennessee­Knoxville, and Yale.
Those responding with files created using XDOD 3.01 or DigiPath included
the Hein Publishing Company, Illinois State Library, the National
Document Center (Athens), Indiana University, the University of Toronto,
and the National Oceanic and Atmospheric Administration (NOAA) Miami
Regional Library (which was considering using the technology).

Inquiries to Xerox about releasing the tool to this group resulted
in further clarification that the RDO Export Tool software would
work only as configured on XDOD Version 2.x systems. The format of
the RDO changed slightly from version 2 to version 3, and the Export
Tool would not convert the structural data on version 3 or higher
systems. The Hein Publishing Company had used the tool with version
3 files through a collaborative project with Cornell, but only page
labels, not structuring information, were exported. William Anderson,
the Xerox software engineer who created the tool, suggested that
it would be possible to get the structure information out of the
version 3 RDO files, but it would take a programmer with knowledge
of the Office Document Architecture (of which RDO is a variant),
fair knowledge of Unix tools, and a copy of the RDO Version 3 specification,
which Xerox seemed unwilling or unable to make available publicly.
Anderson suggested that, “If customers are looking to buy DigiPath
today, and they need that facility, they should ask for it.” Xerox
decided to grant access to this software only to XDOD 2.x customers
who were not migrating to DigiPath.

From June to early September, efforts continued to reach legal agreement
with Xerox over the release of the Export Tool software to XDOD 2.x
users. Cornell received a copy of a proposed Software License Agreement
on August 26, 1999. The agreement granted the institution a nonexclusive,
perpetual, royalty-free license to use the software and the right
to provide a sublicense only to those institutions that had reported
using the XD Version 2.x systems, collectively referred to as “Authorized
Educational Institutions” (AEI). Lee Cartmill, the chief financial
officer at Cornell University Library, expressed concern about the
indemnity clause in the agreement, which required Cornell to “defend,
indemnify and hold Xerox harmless from and against any and all third
party claims that arise from or relate to the Software and their
respective use of the Software.” Cornell attempted to have this
clause modified. When Xerox remained adamant, Cartmill drafted a
Software Sublicense Agreement that would require the AEIs to extend
the indemnity and limitation of liability to Cornell University.
As of this writing, the four institutions have been notified of these
stipulations, and their legal advisers are reviewing copies of the
agreements. It remains to be seen whether any or all of these institutions
will agree to these license stipulations, but Cornell will not sign
the agreement with Xerox unless they do so.

References

International Organization for Standardization. ISO Reference Model
for an Open Archival Information System (OAIS). 1998. Available from http://ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html .

Heminger, Alan R., and Steven B. Robertson. 1998. Digital Rosetta
Stone: A Conceptual Model for Maintaining Long-Term Access to Digital
Documents. Available from http://crack.inesc.pt/events/ercim/delos6/papers/rosetta.doc.

Shepard, Thom, and Dave MacCarn. 1999. UPF: Universal Preservation
Format. Available from http://info.wgbh.org/upf/.

rule

Footnotes

1 ODA,
which became an ISO standard in 1988, has been developed to represent
and allow the interchange of office documents. It contains facilities
that allow both the structure and content of complex multimedia documents
to be represented. Although ODA is an open standard, specifications
for the RDO architecture are proprietary.