Case Study for Lotus 1-2-3 .wk1 Format

rule

1. Introduction

The United States Department of Agriculture (USDA) Economics and
Statistics System is a heavily used collection of agricultural economic
information that is evenly divided between time-sensitive economic
reports and data series of numeric files in spreadsheet format. Some
of the data series are 14 years old—ancient artifacts in the
context of personal computers. Although dated, these files have historic
and current value to a well-defined group of data users. The goals
of this case study were to:

  • evaluate file format migration as a strategy to maintain access
    to these numeric files,
  • identify file format components at risk during migration,
  • identify related risk attributes associated with migration, and
  • evaluate data migration software.

The case study was conducted with several risk-assessment tools
developed during this project. These include a specially prepared
test file, file reader software, and a risk-assessment workbook.

2. Description of Archive and Data User Base

The USDA Economics and Statistics System is a joint venture between
the Albert R. Mann Library of Cornell University and three USDA economic
agencies: Economic Research Service (ERS), National Agricultural
Statistics Service (NASS), and the World Agricultural Outlook Board
(WAOB). These three agencies measure the production and health of
domestic and international agricultural activities. ERS publications
analyze current agriculture market activity and forecast future market
conditions. Other ERS publications offer economic analyses in the
areas of trade, production, rural development, farm inputs, and other
economic topics. The NASS publishes estimates of production, stocks,
inventories, deposition, utilization, prices of agricultural commodities,
and such other items as labor and numbers of farms. The WAOB issues
regular forecasts of United States and world supply-and-demand prospects
for major agricultural commodities.

During the federal fiscal year 1999, more than 368,000 distinct
hosts accessed the USDA Economics System. The system disseminates
more than 500 MB in 7,000 file downloads daily. Many system users
compare current and historic statistical series to detect trends.
In a 5,000-user survey conducted between January and April 1999,
64 percent of respondents used the service for monitoring price trends,
47 percent for forecasting or obtaining market predictions, and 38
percent for research. Most of the more than 250 data series
in the Economics System are published in a DOS binary format, usually
Lotus 1-2-3 .wk1. The collection contains nearly 9,000 .wk1-formatted
files stored online on SCSI disks, and offline on CDs and on floppy
magnetic diskettes.

3. The 1-2-3 Format

Lotus 1-2-3 applications have undergone continuous revisions. Early
versions of Lotus 1-2-3 created stand-alone .wk1 files. Functions
and macro languages were associated with the application, not the
data files. Special formatting instructions were saved in separate
.fmt files. Since the Windows 3.1 release of Lotus, data and format
files have been merged into a .wk4 format. Later releases have integrated
scripting language and data objects into the files. The most recent
release of 1-2-3, the Millennium edition, allows the user to embed
hyperlinks within the spreadsheet, manipulate 1-2-3 files within
Active Document containers such as Internet Explorer or Lotus Notes,
embed ActiveX controls within documents, import real-time information
into a spreadsheet, and so on. The current Lotus file is no longer
just a spreadsheet; it is better described as an interactive data
container.

As the format has evolved, all features have been maintained and
supported through backward compatibility, and representatives at
Lotus Corporation have underlined the company’s continued support
for all 1-2-3 file formats. Limited backward compatibility for 1-2-3
files is found in Excel. Microsoft provides extensive documentation
for Excel and identifies the functions associated with 1-2-3 that
do not convert properly to Excel. None of these problems appears
to affect functions associated with .wk1 formatted files, but the
extent of change is hard to measure fully.

Both companies keep information about some features of their software
privileged. 1-2-3 files use a proprietary format closely controlled
by Lotus (now IBM). The file specifications for release 1.0 were
published in 1984 and revised for release 2.0 in 1985. Specifications
for releases 3.0 to present have been controlled by agreements with
business partners and software developers. No one has researched
or documented the frequency and scope of these changes, or how well
non-Lotus developers integrate these revisions into their products.
An unfortunate side effect of proprietary restrictions on this information
was the apparent loss of early 1-2-3 specifications within the Lotus
Company. As the company revised its product, old specifications were
fully integrated into the new releases, and the need for the old
documentation disappeared. Further effort is required to assemble
an authoritative set of specifications for all 1-2-3 and other major
spreadsheet software releases.

Finally, early 1-2-3 files are bereft of descriptive data. Until
Lotus 97, the Windows 95-compatible release, file names adhered to
the 8.3 DOS format. Embedded descriptive data are often typed into
cell A1. File names for USDA products are simplistic. Files labeled “table1.wk1” are
quite common among the more than 250 data sets. None of the .wk1
files in our collection has any imbedded links to its respective
documentation files. The main method of identifying a file is through
its relative position in a hierarchical file structure.

4. Development of Tools for Assessing the
Source-to-target Format Risk

File Migration and Black-Box Converters

A major risk in migrating collections of files is the conversion
software used to translate the files from the original format to
the target format. A migration project begins with a file whose content
should be translated, without corruption, to another format. The
file is passed through conversion software, essentially a “black
box,” with the intent that the integrity of the content be preserved
in the new file. We can presume success if we know that the conversion
software faithfully maps every property of the source format to corresponding
features in the target format (assuming, of course, that the target
format has a feature set that is rich enough to store the properties
and data of the source). For example, if the source format we are
converting has a way to indicate bold text, assuming that the target
format can also indicate bold text, we want to know that the conversion
software correctly maps the source bold attribute to the target bold
attribute. More important, in most conceivable cases, data values—whether
numeric, image, or text—should also move from one format to
the other intact.

Two Ways of Evaluating the Black Box

If we can examine the mapping process and the data-moving techniques
of the conversion software, we can evaluate the correctness of both.
This method must be repeated for every combination of source and
target formats with which we are working, because each combination
has a unique mapping. Moreover, to attempt this method, we must have
access to the source code of the converter and possess the expertise
to evaluate that code. Our experience with approaching commercial
software vendors for the code to their programs has not been fruitful.
Even if it were, the cost of evaluating each combination of source
and target formats in the program algorithm makes this method impractical
for controlling risk involved in migrating collections.

Another method is to compare a converted file with the original
file. If the result meets our standard of success, whatever that
standard may be, we can say that the conversion software has performed
adequately. However, this method is limited to the particular source
and target file formats under consideration. The ideal file for the
test would be one that exercised all the capabilities of the source
format and contained every possible feature and data value at the
minimum and maximum of every range possible. If that file were run
through the converter, the resulting file could be compared at every
point with the original. This is the approach we chose for our case
study.

Developing Risk-Assessment Tools

For our own collection of Lotus 1-2-3 files, we created a test file
in the .wk1 format. The features documented in the user manuals provided
with the 1-2-3 release 2.2 software, along with the published file
format specifications, determined the content of the test file. The
test file was generated with 1-2-3 release 2.2 software, which, according
to our estimates, was the software that generated the oldest .wk1
files in our collection. With the test file, we can evaluate potential
conversion software by running the software on the test file and
then comparing the converted file with the test file. Visual inspection
and comparison of all the properties and values is necessary to identify
differences; this process took about three hours in our example.
Proprietary software code and knowledge of an uncertain number of
format-to-format mappings in the program are unnecessary for visual
inspection. Another benefit of the test file is that it gives us
a baseline against which we can evaluate and compare multiple conversion
applications.

Whichever method is used to evaluate the conversion software, if
any of the properties or data values are not the same in the source
and target files, then we know that the conversion software has introduced
one or more points of risk. Thinking about the whole collection of
files to be migrated, we will want to know whether some of the files
in the collection have any at-risk properties that will combine with
the conversion software to create problems in the converted file(s).
We can then decide whether to find another converter, to refrain
from migrating those files, or, perhaps, to consider some or all
of the loss acceptable.

We wrote the Examiner software application to test a collection
of files for the presence of particular properties. The list of properties
consists of the structural element tags that define the .wk1 format.
Following the results of the test file comparisons with different
conversion software, we were able to identify the tags that would
not properly translate into the target format. Using a companion
configuration utility, RiskEditor, we provided the Examiner program
with a list of the properties that are at risk, optionally ordering
them by the degree of importance or impact. Then, as a test, the
Examiner program was run against parts of the collection. After further
evaluation, Examiner was run against the entire collection, a hierarchy
of 30,000 files. This requires a little more than two hours to complete.
The program produces a report showing which files contain the properties
in question. Examiner provides a quantitative assessment of the risks
that could be introduced by the conversion software.

The Examiner application is written in Java, and both its user documentation
and technical documentation are available as HTML files. It has been
designed to be extendable to any file format that indicates properties
as numbered tags, including Lotus 1-2-3 and TIFF, the formats of
our case-study collections. A requirement for running the application
is a Java interpreter on the computer holding the collection. We
wrote a command-line version of the program to be used on our Unix
servers, but the program could be easily extended to have a graphical
user interface.

5. Selection and Evaluation of Conversion
Software

The migration software examined for this project was a commercially
available, off-the-shelf (COTS) product. Locally developed conversion
software was avoided for two reasons: development costs and immediate
obsolescence. Software development is labor intensive, and long-term
maintenance is expensive. The costs to develop single, one-project
programs cannot be justified for mainstreamed software formats. We
were interested in examining the alternatives that commercial software
developers might offer. We examined two products, DataJunction and
Conversions Plus, using the following criteria:

  • source and target formats,
  • accuracy of conversion,
  • file decompression,
  • batch processing, and
  • error reporting.We evaluated other features, but these five criteria describe
    the critical features we thought essential when evaluating software
    for a migration process.

    DataJunction

    A product of Data Junction Corporation, DataJunction 7.0 is conversion
    software that appears capable of integrating and transforming data
    among hundreds of applications and structured data formats in both
    UNIX and Windows 95/98/NT environments. We counted 150 source and
    155 target format options. Image-file formats—GIF, JPEG, PDF,
    TIFF—are not included in the list of supported file formats.
    For our case study, we narrowed file format options to Lotus 123
    r2 and Excel 97.

    We found DataJunction works as specified. In single-file tests,
    DataJunction quickly and accurately converted Lotus .wk1 files
    into Excel .xls files. Tests were not conducted to determine whether
    DataJunction could convert .wk1 to ASCII. Setting up DataJunction,
    however, was somewhat difficult. DataJunction is designed to work
    with large legacy database files, from which it extracts data and
    converts it into the target format of choice, using a complex set
    of rules. The graphical user interface is not intuitive to use
    and was mastered only after detailed review of the online documentation
    and considerable trial and error. We did not investigate the possibility
    of designing transferable conversion templates. DataJunction was
    very difficult to set up for batch-mode processing, a major problem
    if a migration project must process more than a few files. In addition,
    DataJunction does not have the ability to decompress archived or
    zipped data files. DataJunction can be configured to provide different
    error messages, including fatal and general errors, warnings, and
    information messages. DataJunction is handicapped by its batch-job
    restrictions. An upgrade in this feature would make this program
    suitable for conversion of a variety of standard file formats.

    Conversions Plus 4.5

    Conversions Plus 4.5 (CP 4.5) is a product of DataViz Corporation.
    It is a stand-alone program that complements several other software
    products in a suite of tools designed to read and/or write to a
    variety of file formats. We counted 74 source and 110 target formats
    available in four general file categories: word processing, spreadsheets,
    database files, and image files. For our case study, we narrowed
    file format options to Lotus 123 (.wk1) and Excel 97.

    File conversion in Conversions Plus is implemented by pairs of
    file readers and writers. Each reader is written to read and decode
    a specific file format. The file reader identifies file format
    components and stores them within the program in a standard way.
    From this data template, each file writer program can extract each
    specific information object and restructure the data into the new
    file format. In addition, CP 4.5 can detect and uncompress files
    using the following compression algorithms: gzip, zip, tar, and
    Z. Conversions Plus can work on single files or in batch mode.
    CP 4.5 works in the Windows 95/98/NT environment.

    We found that Conversions Plus provided accurate translations
    from .wk1 to .xls formats with the exception of a subset of floating-point
    numbers. We used our standard test file in a conversion test and
    discovered CP 4.5 read and displayed the source file properly,
    but would embed an incorrect display code for certain floating-point
    numbers in the .xls target file. Comparison tests with Excel indicated
    display problems with fractions that were represented with exponential
    notation. All other basic format attributes—text strings,
    integers, formula, and equations—converted properly.

    The graphical user interface is intuitive and easy to use. One
    selects the file(s) or directory (directories) by clicking on them.
    Setting up target format choices or directories for converted files
    is easily done using pull-down menus. CP 4.5 works less smoothly
    when a directory contains numerous files of mixed format, a common
    situation on many servers. Batch preferences are limited to a single
    file format type for each of the four general categories of file
    type. In these situations, we would anticipate significant user
    oversight of the conversion operation. Conversion and error statistics
    are displayed at the end of a batch job, and the information can
    be written to a log file.

    Comparison and Analysis After Conversion to Source File Format

    During our tests of DataJunction and Conversions Plus, we manually
    compared the standard test file and other sample files in two states:
    before and after conversion. The comparison was conducted on two
    Windows NT workstations using two monitors of similar size and
    features. The comparison entailed a line-by-line examination of
    structural and data elements in each file. Conversion errors occurred
    in Conversions Plus only on data that contained floating-point
    numbers. We modified the Examiner program to identify .wk1 files
    in our archive that contained a structural element for floating-point
    numbers (tag 0Eh) and ran it against our collection. The program
    indicated that 8,619 files, or 96 percent of the collection, contained
    floating-point numbers. Because this is a significant portion of
    the collection, conversion using Conversions Plus was not attempted.

    6. Migration Risk Analysis for 1-2-3 .wk1
    Files

    We examined three migration options for 1-2-3 files: backward
    compatibility, and file migration to Excel .xls and to ASCII characters.

    Backward Compatibility

    Backward compatibility of 1-2-3 files provides a baseline for
    comparisons with Excel and ASCII. Data captured in older 1-2-3
    files are still readable in more recent 1-2-3 software. This strong
    backward compatibility support indicates that old file formats
    are superseded, but not obsolete; older files are not “orphaned” by
    major revisions in Lotus software. In addition, earlier application
    software can operate either in DOS operating systems or in the
    DOS emulator in Windows 95/98/NT. Although 1-2-3 has a reduced
    share of the spreadsheet market, Lotus is still providing strong
    support for this product. Evaluating these factors, it appears
    that avoiding migration and relying on the backward compatibility
    to sustain the .wk1 format can be considered a low-risk option.

    Migration to Excel .xls Format

    Excel is currently the market leader for spreadsheet software
    in both Windows and Macintosh operating systems and it has established
    a large corporate and private user base. If we can make predictions
    on the basis of the documented history of 1-2-3, Excel .xls should
    be a heavily used format for the next 10 years. We also believe
    that if Excel is superseded by another spreadsheet program, Microsoft
    will provide reliable migration software from Excel to the new
    target format. Examination of our standard test file and a random
    sampling of files from the archive indicate that the latest version,
    Excel for Windows 97, provides an accurate conversion of 1-2-3
    .wk1 files. Four major components of a spreadsheet filetext strings,
    integers, floating-point numbers, and embedded formulas and functionswere
    properly converted; they retained content, context, and a reasonable
    reproduction of the “functional experience.” Our evaluation
    indicates that migration of .wk1 to .xls format is a low-risk option.
    Unfortunately, Excel itself is a poor choice for conversion software
    because it cannot perform batch conversions of files.

    Migration to ASCII

    ASCII is the format of choice for large numeric file archives.
    ASCII files are easily scanned manually and can be imported into
    most software programs. ASCII is perceived as a low-maintenance
    format. On the basis of three decades of experience, most digital
    archivists predict ASCII will still be a common file format in
    50 years. Given the proper circumstances, a single migration to
    ASCII should be more cost-effective than repeated migrations through
    other evolving file formats. At its most fundamental level, migration
    of 1-2-3 files to ASCII converts the content of spreadsheet cells
    to values located in a matrix of x­y coordinates. The
    actual values of embedded functions, equations, or pointers to
    other cells will be retained, but the functions, equations, or
    pointers in those cells will be lost. Long text strings, essentially
    embedded metadata, are truncated at different lengths, depending
    on the conversion software. The formula, functions, pointers, and
    text strings could be recorded in an external conversion record,
    but for large collections, this might be impractical.

    Our assessment suggests that ASCII is a low-risk preservation
    option for the .wk1 format and could be adopted if dependence on
    backward compatibility or conversion to Excel .xls format were
    impossible.

    7. Metadata Risk

    There are six sources of metadata to coordinate:

    1. MARC record. Each of the 250+ data sets is cataloged and has
      a full MARC record in the campus online catalog.
    2. Gateway record. Each data set also has a MARC-like record for
      the Cornell University Library gateway to online resources.
    3. USDA Economics and Statistics System Database. Each data set
      also has a detailed record in a searchable database designed
      specifically for USDA System users.
    4. README documentation. Each data set has a separate ASCII text
      README file. The README documentation is a mix of descriptive,
      content, and administrative metadata.
    5. File hierarchy. Placement and selection of individual 1-2-3
      files depend on descriptive data associated with directory names.
    6. File names. 1-2-3 files use the DOS 8.3 naming convention.
      A common name for a .wk1 file in the collection is table01.wk1.
      The file extension (.wk1 or .xls) is a necessary feature for
      the file to be read by a spreadsheet application.

    Migration of the 1-2-3 files to another format would require modification
    of two of these six metadata sources: the file name and the README
    files.

    • The file name modification would change the file name extension.
      For example: table01.wk1 > table01.xls
      The conversion software should implement the file name modification.
      A possible risk could be introduced if the file name were manually
      changed without processing by the conversion software.
    • README documentation would be modified to reflect the new file
      name extension. For instance, the following table example was
      extracted from a sample USDA data set, Feed Yearbook:
      Table12.wk1 12. Farm programs and participation, 1975–1998
      Table13.wk1 13. Average prices received by farmers, United States,
      by month, and loan rate, 1975/76–1998/99
      The file name extension in the documentation would be converted
      from
      Table12.wk1 > Table12.xls
      Table13.wk1 > Table13.xls

    If this conversion is required, the file extension conversion
    can be modified in a word processor using a find/replace feature.
    Using powerful search-and-replace functions on the documentation
    introduces a low-level risk. Modification of documentation will
    need thorough review before the migration process is considered
    complete. In addition, new administrative information describing
    the migration process will need to be added to the record. This
    would include, but not be restricted to:

    • date of migration or modification,
    • data set involved,
    • description of conversion process,
    • identification of conversion software,
    • identified risks and actions taken to manage them, and
    • person supervising the migration or modification.

    Because of the limited modifications required to file names and
    documentation and the lack of system scripts, the migration risk
    introduced to metadata is low.

    8. Summary

    From these findings, we believe we can draw the following conclusions:

    1. 1-2-3 format-specific migration risks associated with specific
      file conversion software can be identified and described. The
      use of a standard test file allows side-by-side comparison of
      file structure and content before and after conversion. Errors
      or significant changes to the information can be isolated and
      examined for their persistence or effect, or for both.
    2. Some measures of risk can be quantified. The Examiner program
      can measure the number and frequency of problematic file tags
      within a collection. Although the measurement is crude, it provides
      a general measure of risk for a collection, a step that has been
      lacking in digital file management. Further refinements to the
      program could provide greater resolution of problem file attributes.
    3. Migration paths for the .wk1 format can be mapped. We can describe
      the relationship between the source and target formats and the
      conversion software with reasonable certainty. For example:

      Source Black Box Target
      Excel
      DataJunction
      Conversion Plus
    4. Cost-effective COTS migration software was not identified.
      The two programs analyzed in this case study were considered
      the best candidates for a COTS migration program. Both were found
      deficient in basic, but significant, tasks. Minor modifications
      to the programs would alleviate our reservations. We intend to
      bring our results to the attention of these developers.