 |
The steady growth of digital information as a component of major
research collections has significant implications for college and
research libraries. Many institutions, including Cornell University
Library (CUL), have been creating or collecting digital information
produced in a wide variety of standard and proprietary formats, including
ASCII, common image formats, word processing, spreadsheet, and database
documents. Each of these formats continues to evolve, becoming more
complex as revised software versions add new features or functionality.
It is not uncommon for software enhancements to "orphan," or
leave unreadable, files generated by earlier versions. The threat
to aging digital information has surpassed the danger of unstable
media or obsolete hardware. The most pressing problems confronting
managers of digital collections are data format and software obsolescence.
There is a tacit assumption that digital libraries will preserve
the electronic information they create or the information that is
entrusted to their care. To preserve this information, institutions
must manage collections in a consistent and decisive manner. It is
important to decide what should be preserved, in what priority, and
with what techniques. Unfortunately, there is little guidance in
this area. Leading organizations such as the National Archives and
Records Administration have been cautious in adopting standards for
document formats other than ASCII; specialized reports prepared by
national committees have focused either on broad recommendations (Task
Force on Archiving of Digital Information 1996) or on organizational
and legal issues (Euhlir 1997). On the basis of its experience in
managing electronic collections, the CUL chose to develop a method
of "risk management" to replace "heroic rescue" as
a means of preserving digital information. The concept of an information
life cycle is emerging as a major theme in digital preservation,
and as a model it provides some guidance on where risk-management
efforts should be directed. In the abstract, a digital life cycle
plans for the creation and stages of use of information and, ultimately,
for whether the file will remain in a terminal, unchanging state
or be transformed into another format for reuse. The choice of how
or when to assess risk in the digital life cycle depends on circumstances,
the state of the digital information, and the general preservation
strategy adopted.
Currently, there are two radically different strategies for managing
the later period of a digital life cycle: migration and emulation. Preserving
Digital Information defines migration broadly, as "the periodic
transfer of digital materials from one hardware/software configuration
to another, or from one generation of computer technology to a subsequent
generation" (Task Force on Archiving of Digital Information
1996). A more specific definition would indicate that migration changes
the structure of the original data file. With the exception of files
that are simple data streams, most files contain two basic components:
structural elements and data elements. A file format represents the
arrangement of the structural and data elements in a unique and specific
manner. In this context, migration is the process of rearranging
the original sequence of structural and data elements (the source
format) to conform to another configuration (the target format).
In practice, migration is prone to generating obvious and subtle
errors. An obvious error occurs when the set of structural elements
in the source format does not fully match the structural elements
of the target format. For instance, in a spreadsheet file a structural
element defines a cell containing a numeric value. If a comparable
element is missing from the format specifications of the target format,
data will be lost. A subtle error might occur if the data themselves
do not convert properly. Floating point numbers (numbers with fractions)
are found in many numeric files. Some formats might allow a floating-point
number of 16 digits (e.g., 26.00126700998l9070) while others might
allow only 8 digits (e.g., 26.00126701). For some applications, such
as vector calculations in geographic information system (GIS) programs,
small but significant errors could creep into calculations. In other
situations, migration might preserve the content of the file but
lose the internal relationships or context of the information. For
example, a spreadsheet file migrated to ASCII may save the current
values of all the cells but lose any formulas embedded within the
cells that are used to create those values.
An alternative preservation approach, emulation, is concerned with
preserving the original software environment. Emulators are programs
that mimic computer hardware. Strategies adopting this approach store
copies of the initial software and descriptions of how to emulate
the initial hardware to run the software along with the data files
(Rothenberg 1999; 1995). Emulation has been practiced for many years,
and there are several commercial and public domain emulators for
a variety of hardware/operating system configurations. A good example
is MS-DOS emulation in the Windows 95/98/NT operating system.
Emulation as a strategy has some limitations. Emulation assumes
future access to the following multiple data objects in a cluster
or package:
- the data file to be preserved and reused,
- the application software that generated the data file,
- the operating system in which the application functioned, and
- the hardware environment emulated in software using detailed
information about the attributes of that hardware.
This complex environment would most likely fail if one or more components
were missing. Moreover, emulation is a patchwork effort, with contributions
from commercial vendors and private individuals. There is no system
for coordinating or maintaining these emulators, and maintaining
obsolete emulators may prove to be as problematic as migrating obsolete
file formats.
With two complex and very different strategies, it would be difficult
to examine both options simultaneously. Our decision to select migration
was partially based on the resources at our disposal. With locally
developed and commercial off-the-shelf data migration software, migration
could be tested, measured, and evaluated on the basis of certain
common criteria from which we could design a suite of risk-assessment
tools. File migration was also appealing because it could encompass
the following different preservation scenarios:
- the routine refreshing of digital files;
- varying changes in digital formats when files are converted from
one application to another;
- radical changes in digital formats, such as the conversion of
numeric files from proprietary formats to ASCII; and
- the migration of derivative access copy systems; for instance,
system software might convert Tagged Image File Format (TIFF),
a master storage format for scanned images, into a Portable Document
Format (PDF) derivative designed for easy use by the reader.
For the reasons described above, Cornell concentrated exclusively
on developing aids to assess the safety of a migration strategy for
its digital information.
Literature Search
We reviewed the literature for information concerning digital preservation,
digital migration, risk assessment, and file formats.
Digital Preservation and Migration
An extensive survey of the library literature identified many papers
that provided in-depth analyses of issues associated with different
aspects of digital preservation. The Task Force on Archiving of Digital
Information (1996) documents these issues most effectively, and they
will not be repeated here. Most of the remaining literature discussed
digital reformatting or file copying from one medium to another.
We identified four papers that directly related to our project. The
first is the work of John Bennett (1997). His study evaluates preservation
requirements by genre, format, media, and platform and uses a rudimentary
risk-assessment scoring system. Displayed in a two-dimensional matrix,
these requirements effectively communicate the complexity and interdependence
of digital materials. Haynes et al. (1997) reported on an in-depth
investigation into the responsibilities associated with maintaining
digital archives. This paper summarizes numerous interviews with
focus groups and individuals and effectively communicates the range
of opinions and expectations associated with different stakeholders.
The third work is the Reference Model for an Open Archival Information
System (OAIS) (CCSDS 1999). The report is remarkable for its breadth
and depth. In the authors' words, the model they describe "provides
a framework for the understanding and increased awareness of archival
concepts needed for long-term digital information preservation and
access, and for describing and comparing architectures and operations
of existing and future archives." The last item is a report
written by Ann Green, JoAnn Dionne, and Martin Dennis (1999). Their
study describes a project at Yale to convert data from column binary
to spread ASCII format. The nine-step data migration process is well
documented, and the findings and recommendations clarify important
preservation issues.
Risk Assessment
Our search of the library literature for information concerning
risk assessment was not fruitful. We then examined the literature
for computer science. In the last 50 years, computer science has
witnessed numerous cycles of software development migration, and
the literature contains many studies, case reports, and models. Several
publications were very useful in developing our understanding of
risk assessment of digital information. Rapid Development (McConnell
1996) is a monograph on the general problems associated with software
development. In many respects, software development exhibits several
of the same problems associated with basic digital preservation.
Chapter 5 of McConnell's book, which concerns risk management, provides
an excellent theoretical and practical introduction to controlling
risk in software development. It is a good primer for risk studies
in digital preservation. Van Scoy (1992) examines a similar topic
in a study funded by the U.S. Department of Defense. His study identifies
risk-management participants and their activities. A later study
(Sisti and Joseph 1994), also for the Department of Defense, expands
on the work of Van Scoy and offers a highly detailed software risk
evaluation method. All three studies pay particular attention to
the organizational issues in risk management.
While researching risk assessments, we were struck by the vast differences
in basic definitions used by different disciplines. (For example,
see Reinert, Bartell, and Biddinger [1994], Warren-Hicks and Moore
[1995], McNamee [1996], Wilson and Crouch [1987], Starr [1969], and
Lagadec [1982]). Numerous professions measure risk, and each assigns
risks a unique vocabulary and context. The degree and type of risk
associated with any data archive may be understood differently by
administrators, operational staff members, and data users, depending
upon their individual training and experience. The measurement of
risk was equally problematic. One paper correlated risk level with
the nonlinear relative probability of risk occurring (Kansala 1997).
Another publication introduced an algebraic formula (McConnell 1996).
In a third instance, a research group felt that cases where one could
accurately assess the probability of a future event were rare because
the information technology environment for software changes so rapidly. They
preferred simple estimates, such as high, medium, and low,
which they believed facilitated decision making (Williams, Walker,
and Dorofee 1997). Risk-measurement scales, like risk definitions,
are as distinctive as their developers.
File Format
File format information was located from format specification files
available on the Internet and from descriptions of file formats appearing
in several monographs. Specifications for TIFF and .wk1 files were
located at the following Internet sites:
Murray and vanRyper (1996) describe TIFF with numerous illustrations
and a detailed narrative about TIFF structure. Brown and Shepherd
(1995) provide an effective description of the low-level data stream
organization of the TIFF format. Lotus Development Corporation (1986)
has prepared the definitive work for Lotus 1-2-3 .wk1 files. More
than just a reference about file structure, the work explains why
Lotus moved away from simple ASCII representation of spreadsheet
data and documents its early attempts to use a general file format
for worksheet, database, word processing, and graphics activities.
The Lotus book is the best source for information about the .wk1
format. Related .wks format information, released into the public
domain in 1984 and found at File Transfer Protocol (FTP) sites, or
published by Walden (1986), should be used cautiously.
Risk Assessment as a Migration Analysis
Method
In its present state, migration as a digital preservation strategy
can be characterized as an uncertain process generating uncertain
outcomes. One way to minimize the risk associated with such uncertainty
is to develop a risk-management scheme that deconstructs the migration
process into steps that can be described and quantified. A risk assessment
is simply a means of structuring the process of analyzing risk. If
the risk-assessment methodology is well specified, different individuals,
supplied with the same information about a digital file, should estimate
similar risk values.
We believe that three major categories of risk must be measured
when considering migration as a digital strategy:
- Risks associated with the general collection. These risks
include the presence or absence of institutional support, funding,
system hardware and software, and the staff to manage the archive.
These are essential components of a digital archive, which the
Task Force on Archiving of Digital Information (1996) describes
as "deep infrastructure." The collection, and the stakeholders
who use the collection, will be affected to some degree by a migration
of data. Legal and policy issues associated with digital information
will introduce additional risks.
- Risks associated with the data file format. These include
the internal structural elements of the file that are subject to
modification.
- Risks associated with a file format conversion process.
The conversion software may or may not produce the intended result;
conversion errors may be gross or subtle.
Analysis of these three categories can be illuminating. Table 1
presents information from the image file case study that illustrates
the risks specific to image files in migration. The findings are
based on research, discussions with digital preservation specialists,
and our own experience.
|
RISK CATEGORY
|
EXAMPLES
|
|
Content fixity
(bit configuration, including bit stream, form, and structure) |
Bits/bit streams are corrupted by software bugs or mishandling
of storage media, mechanical failure of devices, etc.
|
|
File format is accompanied by new compression that alters
the bit configuration.
|
|
File header information does not migrate or is partially or
incorrectly migrated.
|
|
Image quality (e.g., resolution, dynamic range, color spaces)
is affected by alterations to the bit configuration.
|
|
New file format specifications change byte order.
|
|
Security
|
Format migration affects watermark, digital stamp, or other
cryptographic techniques for "fixity."
|
|
Context and integrity
(the relationship and interaction with other related files
or other elements of the digital environment, including
hardware/software dependencies) |
Because of different hardware and software dependencies, reading
and processing the new file format require a new configuration.
|
|
Linkages to other files (e.g., metadata files, scripts, derivatives
such as marked-up or text versions or on-the-fly conversion
programs) are altered during migration.
|
|
New file format reduces the file size (because of file format
organization or new compression) and causes denser storage
and potential directory-structuring problems if one tries to
consolidate files to use extra storage space.
|
|
Media become more dense, affecting labels and file structuring.
(This might also be caused by file organization protocols of
the new storage medium or operating system.)
|
|
References (the ability
to locate images definitively and reliably over time among
other digital objects) |
File extensions change because of file format upgrade and
its effect on URLs.
|
|
Migration activity is not well documented, causing provenance
information to be incomplete or inaccurate (a potential problem
for future migration activities).
|
|
Cost
|
Long-term costs associated with migration are unpredictable
because each migration cycle may involve different procedures,
depending on the nature of the migration (routine migration
vs. paradigm shift).
|
|
The value of the collection may be insufficiently determined,
making it impossible to set priorities for migration.
|
|
Costs may be unscalable unless there is a standard architecture
(e.g., centralized storage, metadata standards, file format/compression
standards) that encompasses the image collections so that the
same migration strategy can be easily implemented for other
similar collections.
|
|
Staffing
|
Staff turnover and lack of continuity in migration decisions
can hurt long-term planning, especially if insufficient preservation
metadata is captured and the migration path is not well documented.
|
|
Decisions must be made whether to hire full-time, permanent
staff or use temporary workers for rescue operations.
|
|
Staff may have insufficient technical expertise.
|
|
The unpredictability of migration cycles makes it difficult
to plan for staffing requirements (e.g., skills, time, funding).
|
|
Functionality
|
Features introduced by the new file format may affect derivative
creation, such as printing.
|
| |
If the master copy is also used for access, changes may cause
decreased or increased functionality and require interface
modifications (e.g., static vs. multiresolution image, inability
of the Web to support the new format).
|
| |
Unique features that are not supported in other file formats
may be lost (e.g., the progressive display functionality when
Graphics Interchange Format [GIF] files are migrated to another
format).
|
| |
The artifactual value (original use context) may be lost because
of changes introduced during migration; as a result, the "experience" may
not be preserved.
|
|
Legal |
Copyright regulations may limit the use of new derivatives
that can be created from the new format (e.g., the institution
is allowed to provide images only at a certain resolution so
as not to compete with the original).
|
Table 1. Risks associated with file-format-based migration for
image collections
As each risk category was explored, we recognized that we needed
to develop different methods, or tools, to sample each situation
and to help quantify risk probability and impact. Over the course
of the project, we developed three assessment tools:
- A risk-assessment workbook for the general collection. The workbook
provides a general review of risks associated with migration at
the collection level.
- A reader software to examine specific files, or collections of
files, for high-risk format elements.
- A test file for a .wk1 format of known structural and data elements
to test, or exercise, conversion software.
Individually, these three tools provide useful information. Together,
they offer a means to gauge the readiness of any archive to migrate
information successfully from one format to another.
Risk Assessment of General Collections
In an ideal situation, risk assessments would be performed by a
team of experts; each member would be a specialist in a specific
area and would have general knowledge of digital preservation. In
reality, access to expert advice is costly and not always timely.
In place of a human adviser, a workbook can provide a systematic
approach to assessing risks and problems. If the questions or exercises
are sufficiently developed, the workbook can help the user not only
identify potential risks but also measure risk in terms of impact.
When used as a common method of analysis, a workbook should identify
and describe problems in a concise, uniform, and easily understood
manner that could be shared by administrators and archivists in a
given setting.
For the risk-assessment workbook developed in this study, we prepared
two risk-assessment scales: one to measure the probability a hazard
would occur, and another to measure the impact of that occurrence.
These scales were prepared for a risk-assessment case study of a
numeric file collection, the test bed for much of our project. Admittedly,
the scales lack scientific precision, and at the end one does not
simply sum the results and decide to migrate on the basis of a single
number. On the other hand, assessment scales can more precisely convey
meaningful assessments of risk, and this can help set priorities
in preparing for a migration project (Beatty 1999).
The complete workbook is presented in Appendix
A.
Risk Assessment of File Formats
As noted earlier, file migration is the process of altering structural
and data elements in one file format to conform to a new configuration
in another format. In our project, we label the original format the "source" format
and the new format the "target" format. Software programs
that convert source formats into target formats are grouped into
three general categories:
- Translation programs for a specific project written by a company,
by the owner of the information, or by a third-party vendor. Data
archives often write these programs at considerable cost. The CUL
experience with locally developed software is described in the
TIFF image file case study.
- A commercial translation program written for a specific purpose.
For example, some products extract data fields from numerous files
with different formats and create a new data product with a different
format. Programs such as DataJunction are written specifically
for this purpose.
- A general-purpose commercial translation program. Conversions
Plus by DataVis is a good example of this growing genre of software.
Each of these approaches to conversion has its benefits and liabilities.
Many conversion programs developed by archives can incorporate extensive
knowledge about the functions of the translation software, but require
lengthy development cycles and are expensive to prepare. Off-the-shelf
commercial programs provide little information about the translation
process but offer many features at a low cost.
A format risk assessment has to explore two distinct areas of risk:
the risk introduced by the conversion program and the magnitude of
recurring risk inherent in a large collection. In addition, the features
and usability of the conversion software should be considered as
well as the impact on the metadata associated with the files.
Assessing Risk in Conversion Software
Assessing risk inherent in conversion programs can be accomplished
by examining a file before and after migration. A test file can be
passed through the conversion software, migrating from source to
target format. If, following the format conversion, the fields and
field values of the original source file are properly reproduced
in the target file, the risks incurred in migration are significantly
reduced. On the other hand, if the fields or their values are not
properly converted, the risks of migration are significantly increased.
If the field tags and values in the test file are known, data changes
associated with file conversion can be independently verified.
In the numeric file case study, a test file for the Lotus 1-2-3
.wk1 format was created. With the use of public domain specifications
and reference manuals published with the original application software,
a large file was generated that exercised all the field tags and
field values. A simple conversion test might determine how well a
conversion program tests the following known values with those generated
in a formula:
Fig. 1. Sample test values for assessing conversion
accuracy (Lotus 1-2-3 file)
In the example shown above, the "average" function (@AVG)
operates on a range of cells (H293..DC293). The precomputed correct
result (495) is compared with the computed result derived from the
expression, and any differences between the two are recorded. In
a similar manner, other complex formulas and functions can be compared
before and after conversion.
It took us about three hours to compare our test files manually
before and after conversion. Although this method is somewhat laborious,
it is quite accurate for the formats we tested. Conversion of different
structural elements and data elements is not always a matter of "hit
or miss." We were able to identify conversions that were almost,
but not quite perfect. Testing these problematic conversions, we
were able to develop a rough scale of conversion risk (1=minor risk,
5=high risk). Documentation for the test file can be found in Appendix
B.
Assessing Recurring Risk Inherent in a Large
Heterogeneous File Collection
Manual identification of risk associated with file structures is
possible for a small number of files. For large digital collections
that have thousands or millions of files that may contain one or
more of these at-risk elements, manual methods are expensive and
inefficient. One way to measure the collection for files that contain
at-risk elements would be to prepare a file reader programmed to
examine each file for these items. If one or more risk items are
found, the program could be written to produce a report that identifies
the file, its location in the collection, and the type and number
of at-risk elements associated with that file. Good design would
make the program flexible enough to read most, if not all, files
with defined structural elements.
A program was developed for the project that can read structured
ASCII and binary files. Named Examiner, the program reads a file
and detects the presence and frequency of specific file format elements.
It does not read or evaluate the data value, although this feature
could be implemented. The following example shows a few lines from
a report generated during a scan of .wk1 files in the USDA Economics
and Statistics System, hosted at Mann Library.
- /usda/ftp/usda/data-sets/crops/94018/budget.wk1: Risk Level 5
Tag 14: NUMBER: Floating point numberQty: 584
-----
/usda/ftp/usda/data-sets/crops/94018/charactr.wk1: Risk Level 5
There are no tags in this file at this level
-----
/usda/ftp/usda/data-sets/crops/94018/conf_int.wk1: Risk Level 5
Tag 14: NUMBER: Floating point numberQty: 59
In the output just listed, Examiner has examined a series of .wk1
files in a single subdirectory with the absolute path /usda.ftp/usda/data-sets/crops/94018.
In two of the three files, it located a structural element, or Tag.
The program writes to a report file the structural element number
(14), the name of the structural element given in the format specifications
(NUMBER:), a short description of the structural element (Floating-point
number), and the total count of floating-point numbers discovered
in that specific file (Qty:). The program also describes the risk
level for the structural element. The risk level was determined during
the initial sourcetarget analysis described previously. The
program can be set to report at-risk tags only if the risk value
equals or exceeds a certain threshold.
One strong feature of the Examiner program is that it is nondestructive.
It simply reads a file from beginning to end and declares what is
found. Also, Examiner can be set to read a single file, all the files
in a directory, or all the files on a drive. The program is reasonably
efficient and scans approximately 10,000 .wk1 files per hour. Finally,
Examiner is written in Java, a modern programming language designed
to be easily compiled on different operating systems. The program
has been fully tested in the Unix and Windows 95/NT environments.
General documentation for Examiner is described in Appendix C. The
source code and full documentation are available on the Web site
of the Council on Library and Information Resources.
Assessing Risk Associated with the File Conversion Process
Finally, there are risks associated with the features of different
conversion software. The project examined two commercial off-the-shelf
programs and quickly scanned the advertisements or published reviews
of six others. In any mix of conversion programs available, each
will provide some or all "core" functions as well as optional
features. General performance benchmarks, which can be tailored for
specific migration scenarios, provide some uniformity of measurement
and highlight obvious defects. For example, we examined DataJunction
as a general-purpose conversion program for spreadsheet and database
formats. Conversion of .wk1 formats was trouble-free, except for
one major flaw: DataJunction was difficult to program to work in
batch mode. We did not recognize this flaw until the evaluation was
nearly complete. Obviously, a project timetable could be seriously
jeopardized by such a limitation. Although not an intended product
of the project, we recorded software assessment questions that we
should have asked at the start of the project. From these, we developed
a short functionality assessment that is now available on the Web
site of the Council on Library and Information Resources.
Identification of Metadata-Related Risk
We frequently think of disk files as the sole object of migration
because, at first glance, the information they contain is what we
have to move from one format to another. The individual files in
a collection, however, are frequently useless without other information
describing how the files are to be used or how they relate to one
another. In other words, any group of files that constitute a cohesive
unit can be considered a digital object, and what makes the digital
object intelligible is metadata describing the contents and providing
structure for the group. When such digital objects exist, the metadata,
as well as the individual files containing the raw data, must be
successfully migrated.
Metadata at the digital-object level can take various forms. For
example, in the collection of TIFF images in one of our case studies,
a file in a proprietary format, Raster Document Object (RDO), contains
metadata that provides structure to the multiple TIFF files. The
RDO file relates the page image stored in each TIFF file to the others
that compose the document; in this case, the navigable and searchable
digital object represents a paper document containing pages and chapters
and other logical constructs. A second example, from our case study
of a collection of numeric files in the .wk1 format, shows another
way of structuring and describing digital objects. Each digital objecta
set of related binary data fileshas three metadata components:
one that contains information about the structure of the object,
one that describes the content of the object, and one that creates
a link between the two. The structural metadata is contained in an
HTML file whose links point to the individual files that constitute
the digital object. The content metadata is in an English-language
ASCII file. Its purpose is to provide searchable text so that the
object can be located in a search across the larger collection of
objects. The third component is a record in a database that creates
a relationship between the content file and the structural file.
In a successful migration to another data format, the structural
metadata in the HTML file would have to be changed if the name or
location of the individual files in the digital object were changed.
The content description and the database record would not have to
be touched.
Case Studies
The risk-assessment tools developed were tested on two digital collections
at the Cornell University Library: the Ezra Cornell Papers and the
USDA Economics and Statistics System. Each collection contains a
dominant file format: TIFF or .wk1. The assessments of these two
collections are presented in Appendixes D and E.
Findings and Recommendations
Migration Risk Can Be Quantified
Migration, or the conversion of data from one format to another,
has measurable risk. The amount of risk will vary, sometimes significantly,
given the context of the migration project. One form of risk depends
on the nature of the source and target formats. We have shown that
it is possible to compare formats in a number of ways and to identify
the level of risk for different format attributes. The format analysis
techniques and software may be technical, but the results can be
described in general terms. Since basic file structure concepts are
common to many file formats, experience with one format can be used
to understand other formats.
We draw a similar conclusion concerning organizational, hardware,
software, and metadata risks. Information delivery systems must sustain
a certain level of organization simply to function. Consistent components
of these systems can be evaluated; for example, personnel, funding,
metadata, and rough but quantifiable measures of risk can be established
for these subjects.
The greatest challenge is the interpretation of the risk, i.e.,
to determine when a risk is acceptable. Risk-assessment tools cannot
replace experience and good judgment. The tools can be compared with
navigation aids used on the high seas. Following five centuries of
intensive effort to develop risk-reducing technologies, ships' helms
are still manned, and collisions between ships at sea still occur.
In this study, we provide examples to illustrate the evaluation
process. In practice, the risk-assessment tools are not fully developed.
We recommend the further refinement of these tools to provide results
that are more reliable. We must recognize, however, that this will
take some time, during which we will lose some data.
Conversion Software
This study is unable to recommend a cost-effective, off-the-shelf
commercial software program to implement a migration strategy. From
our analysis, we believe that migration software should perform the
following functions:
- Read the source file and analyze the differences between it and
the target format.
- Identify and report the degree of risk if a mismatch occurs.
- Accurately convert the source file(s) to target specifications.
- Work on single files and large collections.
- Provide a record of its conversions for inclusion in the migration
project documentation.
Neither of the two programs analyzed in this case study met all
these criteria, although our results suggest that commercial conversion
programs, with further development, have the potential to meet them.
Considering the cost of writing conversion software for a wide range
of file formats, we believe a commercially developed solution for
migration software will ultimately be cheaper and more flexible than
locally developed conversion software. We recommend further work
with vendors, such as DataJunction and DataViz, to educate them about
our needs and help them develop products that promote safer file
migration.
Access to Format Data
The most difficult aspect of this project was the acquisition of
complete and reliable file format specifications. Throughout the
project, format-specific information was difficult to acquire from
a single source. Ultimately, format information for this study was
acquired from the following four general sources:
- software developers
- public FTP archives
- monographs
- Internet discussion lists
Developers of software applications who use a specific proprietary
file format should be the best source for file format information.
At the start of our search for Lotus .wk1 format information, this
was not the case. Lotus, like other large software companies, treats
file format information as a business product to sell to software
developers. Lotus business products evolved, responding to revisions
in 1-2-3 as well as to changes in the DOS/Windows operating system.
With the introduction of Windows 3.1, developer interest in earlier
DOS specifications disappeared. Since the specifications for the
.wk1 format were integrated into the format specifications for later
releases (i.e., .wk3, .wk4), the specifications and documentation
for the earlier .wk1 format quietly disappeared. Lotus as a company
also evolved, and key members of the early development staffoften
the corporate memory in software companiesmoved on to establish
their own companies. In the last months of this project, we were
able to contact an individual at Lotus who had been with the company
since the mid-1980s. This individual helped us acquire a copy of
Lotus File Formats for 1-2-3, Symphony, and Jazz. This work, authored
by Lotus, is the only surviving documentation from the company for
that period. Fortunately, it describes the .wk1 format in complete
detail.
Throughout the year, Lotus staff repeatedly referred us to their
FTP archive that contains 1-2-3 .wk1 format specifications. These
specifications were indirectly certified by Walden (1986), who describes
the specification in detail and provides a sample .wk1 file analyzed
byte by byte. Unfortunately, these specifications are incomplete
and describe the .wks file format, the format of 1-2-3 release 1A.
We were surprised that Walden made such an oversight, but Wotsit's
Format Web site (Oliver 1999) and the comp.apps.spreadsheets FAQ
(1999) repeat the error. It is clear that neither the professionals
nor the amateurs recognized the mistake.
TIFF specifications are accessible from two Internet locations.
The official specifications for TIFF 6.0 are available from the Adobe
developers' support site. Adobe's site does not list the specifications
for TIFF 4.0 and 5.0. These can be located at the Unofficial TIFF
Home Page. Our manual examination of the specifications showed them
to be consistent with each other, but they are incomplete. For years,
developers have been adding their own proprietary tags to the TIFF
specification that they register with Adobe. Special tags do not
appear in either the official or unofficial specifications. Several
books have been written about the TIFF file format specifications
and they survey many file formats. However, no single work presents
a clear, comprehensive description of the TIFF file format specification
or of information about proprietary tags.
We expect these difficulties to be repeated when other formats are
explored. Conceptually, the solution is to adopt "open" format
specifications, where complete, authoritative specifications are
available for anyone to access and analyze. Our experience with TIFF
and .wk1 suggests that with file formats, there are two specifications
at work. One is the public document, which describes the basic or
core elements of any format. The other is a private, nonstandard
set of file elements, usually developed to extend the functionality
of a file format. These private file elements provide the competitive
edge for third-party software and rarely are openly circulated. Over
time, new format elements are often integrated into format revisions.
For example, TIFF grew from 37 tags in version 4 to 74 tags in version
6.0. New proprietary tags for TIFF version 6.0 are registered with
Adobe, which does not make them public. It is uncertain whether all
or some of these difficult-to-identify tags will be integrated into
the anticipated TIFF version 7.0. We endorse the concept of open
specifications and recommend that more thought be directed at coordinating
access to both the relatively static, public domain specifications
and the dynamic, nonstandard elements.
Public Access Archives of Format Information
If we measured the risk associated with public domain archives on
the Internet, we would assess all these sites as high-risk operations.
Sites such as Wotsit's represent the public service efforts of individuals.
They lack any vision or plan to sustain the information. This limitation,
combined with the unreliable nature of the information contained
within these sites, make it unlikely that these sites will contribute
meaningfully to digital preservation efforts. There is a pressing
need to establish reliable, sustained repositories of file format
specifications, documentation, and related software. We recommend
the establishment of such depositories for format-specific materials
related to migration as a preservation strategy. It is a concern,
as well, for emulation programs and their documentation.
References
Beatty, J. Kelly. 1999. The Torino Scale: Gauging the Impact Threat. Sky & Telescope 98(4):32-3.
Bennett, John C. 1997. A Framework of Data Types and Formats,
and Issues Affecting the Long Term Preservation of Digital Material.
British Library Research and Innovation Report, No. 50. West Yorkshire,
U.K.: British Library Research and Innovation Centre. Available
from http://www.ukoln.ac.uk/services/elib/papers/supporting/#blric.
Brown, C. Wayne, and Barry J. Shepherd. 1995. Graphic File Formats.
Greenwich, Conn.: Manning Press.
comp.apps.spreadsheets. 1999. comp.apps.spreadsheets FAQ. Available
from http://www.faqs.org/faqs//spreadsheets/faq.
Consultative Committee for Space Data Systems. 1999. Reference Model
for an Open Archival Information System, Red Book, Issue 1 (CCSDS
650.0-R-1). Available from http://wwwdev.ccsds.org/
documents/pdf/CCSDS-650.0-R-1.pdf.
Euhlir, Paul. 1997. Framework for the Preservation of and Public
Access to USDA Digital Publications. Available from http://preserve.nal.usda.gov:8300/npp/frameprt.html.
Green, Ann, JoAnn Dionne, and Martin Dennis. 1999. Preserving
the Whole: A Two-Track Approach to Rescuing Social Science Data
and Metadata. Washington, D.C.: Digital Library Federation.
Available from http://www.clir.org/pubs/reports/pub83/contents.html.
Haynes, David, et al. 1997. Responsibility for Digital Archiving
and Long Term Access to Digital Data. JISC/NPO Studies on the
Preservation of Electronic Materials. British Library Research
and Innovation Report, No. 67. West Yorkshire, U.K.: British Library
Research and Innovation Centre. Available from http://www.ukoln.ac.uk/services/elib/papers/supporting/#blric.
Kansala, Kari. 1997. Integrating Risk Assessment with Cost Estimation. IEEE
Software (May/June ):61-7.
Lagadec, Patrick. 1982. Major Technological Risk: An Assessment
of Industrial Disaster. Oxford, U.K.: Pergamon Press.
Lotus Development Corporation. 1986. Lotus File Formats for 1-2-3,
Symphony and Jazz: File Structure Descriptions for Developers.
Cambridge, Mass.: Lotus Books, and Reading, Mass.:
McConnell, Steve. 1996. Rapid Development: Taming Wild Software
Schedules. Redmond, Wash.: Microsoft Press.
McNamee, David. 1996. Assessing Risk Assessment. Available from
http://www.mc2consulting.com/riskart2.htm.
Murray, James D., and William vanRyper. 1996. Encyclopedia of
Graphics File Formats, second edition. Cambridge, Mass.: O'Reilly & Associates,
Inc.
Oliver, Paul. 1999. Wotsit's Format: the Programmer's Resource.
Available from http://www.wotsit.org/.
Reinert, Kevin H., Steven M. Bartell, and Gregory R. Biddinger,
eds. 1994. Ecological Risk Assessment Decision-support System:
A Conceptual Design. Pensacola, Fla.: SETAC Press.
Rothenberg, Jeff. 1999. Avoiding Technological Quicksand: Finding
a Viable Technical Foundation for Digital Preservation. Washington,
D.C.: Council on Library and Information Resources. Available from
http://www.clir.org/pubs/reports/rothenberg/contents.html.
Rothenberg, Jeff. 1995. Ensuring the Longevity of Digital Documents. Scientific
American 272(1):42-7.
Sisti, Frank J. and Sujoe Joseph. 1994. Software Risk Evaluation
Method. Version 1.0. Technical Report CMU/SEI-94-TR-19. ECS-TR-94-019.
Pittsburgh, Penn.: Software Engineering Institute, Carnegie Mellon
University.
Starr, Chauncey. 1969. Social Benefits versus Technological Risk:
What is Our Society Willing to Pay for Safety? Science 165:1232-8.
Task Force on Archiving of Digital Information. 1996. Preserving
Digital Information. Report to the Commission on Preservation and
Access and the Research Libraries Group. Washington, D.C.: Commission
on Preservation and Access. Available from http://www.rlg.org/ArchTF/.
Van Scoy, Roger L. 1992. Software Development Risk: Opportunity,
Not Problem. Technical Report CMU/SEI-92-TR-30/ESC-TR-93-030. Pittsburgh,
Penn.: Software Engineering Institute, Carnegie Mellon University.
Available from http://www.sei.cmu.edu/publications/documents/92.reports/92.tr.030.html.
Warren-Hicks, William J., and Dwayne R. J. Moore. 1995. Uncertainty
Analysis in Ecological Risk Assessment. Pensacola, Fla.: SETAC Press.
Walden, Jeff. 1986. File Formats for Popular PC Software: A Programmer's
Reference. New York, N.Y.: John Wiley and Sons, Inc.
Williams, Ray C., Julie A. Walker, and Audrey J. Dorofee. 1997.
Putting Risk Management into Practice. IEEE Software (May/June):75-82.
Wilson, Richard, and E. A. C. Crouch. 1987. Risk Assessment and
Comparisons: An Introduction. Science 236:267-70.
Web sites noted in report:
Adobe developers' support site: http://partners.adobe.com/asn/developer/technotes.html.
Council on Library and Information Resources: www.clir.org.
The Unofficial TIFF Home Page: http://home.earthlink.net/~ritter/tiff/.
Links to other parts of this report:
Table of Contents
Appendix A: Risk-Assessment Workbook
Appendix B: Documentation for Format Migration
Test File, Lotus 1-2-3, Release 2.2
Appendix C: Documentation: Examiner and
RiskEditor
Appendix D: Case Study for Image File Format
Appendix E: Case Study for Lotus 1-2-3
.wk1 Format
Appendix F: Migration Software Analysis,
Software Assessment Sheet
Appendix
G: Specifications for the Cornell Digital Library Format
Return to CLIR Home Page >> |