Case Study for Lotus 1-2-3 .wk1 Format
The United States Department of Agriculture (USDA) Economics and
Statistics System is a heavily used collection of agricultural economic
information that is evenly divided between time-sensitive economic
reports and data series of numeric files in spreadsheet format. Some
of the data series are 14 years oldancient artifacts in the
context of personal computers. Although dated, these files have historic
and current value to a well-defined group of data users. The goals
of this case study were to:
- evaluate file format migration as a strategy to maintain access
to these numeric files,
- identify file format components at risk during migration,
- identify related risk attributes associated with migration, and
- evaluate data migration software.
The case study was conducted with several risk-assessment tools
developed during this project. These include a specially prepared
test file, file reader software, and a risk-assessment workbook.
2. Description of Archive and Data User Base
The USDA Economics and Statistics System is a joint venture between
the Albert R. Mann Library of Cornell University and three USDA economic
agencies: Economic Research Service (ERS), National Agricultural
Statistics Service (NASS), and the World Agricultural Outlook Board
(WAOB). These three agencies measure the production and health of
domestic and international agricultural activities. ERS publications
analyze current agriculture market activity and forecast future market
conditions. Other ERS publications offer economic analyses in the
areas of trade, production, rural development, farm inputs, and other
economic topics. The NASS publishes estimates of production, stocks,
inventories, deposition, utilization, prices of agricultural commodities,
and such other items as labor and numbers of farms. The WAOB issues
regular forecasts of United States and world supply-and-demand prospects
for major agricultural commodities.
During the federal fiscal year 1999, more than 368,000 distinct
hosts accessed the USDA Economics System. The system disseminates
more than 500 MB in 7,000 file downloads daily. Many system users
compare current and historic statistical series to detect trends.
In a 5,000-user survey conducted between January and April 1999,
64 percent of respondents used the service for monitoring price trends,
47 percent for forecasting or obtaining market predictions, and 38
percent for research. Most of the more than 250 data series
in the Economics System are published in a DOS binary format, usually
Lotus 1-2-3 .wk1. The collection contains nearly 9,000 .wk1-formatted
files stored online on SCSI disks, and offline on CDs and on floppy
3. The 1-2-3 Format
Lotus 1-2-3 applications have undergone continuous revisions. Early
versions of Lotus 1-2-3 created stand-alone .wk1 files. Functions
and macro languages were associated with the application, not the
data files. Special formatting instructions were saved in separate
.fmt files. Since the Windows 3.1 release of Lotus, data and format
files have been merged into a .wk4 format. Later releases have integrated
scripting language and data objects into the files. The most recent
release of 1-2-3, the Millennium edition, allows the user to embed
hyperlinks within the spreadsheet, manipulate 1-2-3 files within
Active Document containers such as Internet Explorer or Lotus Notes,
embed ActiveX controls within documents, import real-time information
into a spreadsheet, and so on. The current Lotus file is no longer
just a spreadsheet; it is better described as an interactive data
As the format has evolved, all features have been maintained and
supported through backward compatibility, and representatives at
Lotus Corporation have underlined the company’s continued support
for all 1-2-3 file formats. Limited backward compatibility for 1-2-3
files is found in Excel. Microsoft provides extensive documentation
for Excel and identifies the functions associated with 1-2-3 that
do not convert properly to Excel. None of these problems appears
to affect functions associated with .wk1 formatted files, but the
extent of change is hard to measure fully.
Both companies keep information about some features of their software
privileged. 1-2-3 files use a proprietary format closely controlled
by Lotus (now IBM). The file specifications for release 1.0 were
published in 1984 and revised for release 2.0 in 1985. Specifications
for releases 3.0 to present have been controlled by agreements with
business partners and software developers. No one has researched
or documented the frequency and scope of these changes, or how well
non-Lotus developers integrate these revisions into their products.
An unfortunate side effect of proprietary restrictions on this information
was the apparent loss of early 1-2-3 specifications within the Lotus
Company. As the company revised its product, old specifications were
fully integrated into the new releases, and the need for the old
documentation disappeared. Further effort is required to assemble
an authoritative set of specifications for all 1-2-3 and other major
spreadsheet software releases.
Finally, early 1-2-3 files are bereft of descriptive data. Until
Lotus 97, the Windows 95-compatible release, file names adhered to
the 8.3 DOS format. Embedded descriptive data are often typed into
cell A1. File names for USDA products are simplistic. Files labeled “table1.wk1” are
quite common among the more than 250 data sets. None of the .wk1
files in our collection has any imbedded links to its respective
documentation files. The main method of identifying a file is through
its relative position in a hierarchical file structure.
4. Development of Tools for Assessing the
Source-to-target Format Risk
File Migration and Black-Box Converters
A major risk in migrating collections of files is the conversion
software used to translate the files from the original format to
the target format. A migration project begins with a file whose content
should be translated, without corruption, to another format. The
file is passed through conversion software, essentially a “black
box,” with the intent that the integrity of the content be preserved
in the new file. We can presume success if we know that the conversion
software faithfully maps every property of the source format to corresponding
features in the target format (assuming, of course, that the target
format has a feature set that is rich enough to store the properties
and data of the source). For example, if the source format we are
converting has a way to indicate bold text, assuming that the target
format can also indicate bold text, we want to know that the conversion
software correctly maps the source bold attribute to the target bold
attribute. More important, in most conceivable cases, data valueswhether
numeric, image, or textshould also move from one format to
the other intact.
Two Ways of Evaluating the Black Box
If we can examine the mapping process and the data-moving techniques
of the conversion software, we can evaluate the correctness of both.
This method must be repeated for every combination of source and
target formats with which we are working, because each combination
has a unique mapping. Moreover, to attempt this method, we must have
access to the source code of the converter and possess the expertise
to evaluate that code. Our experience with approaching commercial
software vendors for the code to their programs has not been fruitful.
Even if it were, the cost of evaluating each combination of source
and target formats in the program algorithm makes this method impractical
for controlling risk involved in migrating collections.
Another method is to compare a converted file with the original
file. If the result meets our standard of success, whatever that
standard may be, we can say that the conversion software has performed
adequately. However, this method is limited to the particular source
and target file formats under consideration. The ideal file for the
test would be one that exercised all the capabilities of the source
format and contained every possible feature and data value at the
minimum and maximum of every range possible. If that file were run
through the converter, the resulting file could be compared at every
point with the original. This is the approach we chose for our case
Developing Risk-Assessment Tools
For our own collection of Lotus 1-2-3 files, we created a test file
in the .wk1 format. The features documented in the user manuals provided
with the 1-2-3 release 2.2 software, along with the published file
format specifications, determined the content of the test file. The
test file was generated with 1-2-3 release 2.2 software, which, according
to our estimates, was the software that generated the oldest .wk1
files in our collection. With the test file, we can evaluate potential
conversion software by running the software on the test file and
then comparing the converted file with the test file. Visual inspection
and comparison of all the properties and values is necessary to identify
differences; this process took about three hours in our example.
Proprietary software code and knowledge of an uncertain number of
format-to-format mappings in the program are unnecessary for visual
inspection. Another benefit of the test file is that it gives us
a baseline against which we can evaluate and compare multiple conversion
Whichever method is used to evaluate the conversion software, if
any of the properties or data values are not the same in the source
and target files, then we know that the conversion software has introduced
one or more points of risk. Thinking about the whole collection of
files to be migrated, we will want to know whether some of the files
in the collection have any at-risk properties that will combine with
the conversion software to create problems in the converted file(s).
We can then decide whether to find another converter, to refrain
from migrating those files, or, perhaps, to consider some or all
of the loss acceptable.
We wrote the Examiner software application to test a collection
of files for the presence of particular properties. The list of properties
consists of the structural element tags that define the .wk1 format.
Following the results of the test file comparisons with different
conversion software, we were able to identify the tags that would
not properly translate into the target format. Using a companion
configuration utility, RiskEditor, we provided the Examiner program
with a list of the properties that are at risk, optionally ordering
them by the degree of importance or impact. Then, as a test, the
Examiner program was run against parts of the collection. After further
evaluation, Examiner was run against the entire collection, a hierarchy
of 30,000 files. This requires a little more than two hours to complete.
The program produces a report showing which files contain the properties
in question. Examiner provides a quantitative assessment of the risks
that could be introduced by the conversion software.
The Examiner application is written in Java, and both its user documentation
and technical documentation are available as HTML files. It has been
designed to be extendable to any file format that indicates properties
as numbered tags, including Lotus 1-2-3 and TIFF, the formats of
our case-study collections. A requirement for running the application
is a Java interpreter on the computer holding the collection. We
wrote a command-line version of the program to be used on our Unix
servers, but the program could be easily extended to have a graphical
5. Selection and Evaluation of Conversion
The migration software examined for this project was a commercially
available, off-the-shelf (COTS) product. Locally developed conversion
software was avoided for two reasons: development costs and immediate
obsolescence. Software development is labor intensive, and long-term
maintenance is expensive. The costs to develop single, one-project
programs cannot be justified for mainstreamed software formats. We
were interested in examining the alternatives that commercial software
developers might offer. We examined two products, DataJunction and
Conversions Plus, using the following criteria:
- source and target formats,
- accuracy of conversion,
- file decompression,
- batch processing, and
- error reporting.We evaluated other features, but these five criteria describe
the critical features we thought essential when evaluating software
for a migration process.
A product of Data Junction Corporation, DataJunction 7.0 is conversion
software that appears capable of integrating and transforming data
among hundreds of applications and structured data formats in both
UNIX and Windows 95/98/NT environments. We counted 150 source and
155 target format options. Image-file formatsGIF, JPEG, PDF,
TIFFare not included in the list of supported file formats.
For our case study, we narrowed file format options to Lotus 123
r2 and Excel 97.
We found DataJunction works as specified. In single-file tests,
DataJunction quickly and accurately converted Lotus .wk1 files
into Excel .xls files. Tests were not conducted to determine whether
DataJunction could convert .wk1 to ASCII. Setting up DataJunction,
however, was somewhat difficult. DataJunction is designed to work
with large legacy database files, from which it extracts data and
converts it into the target format of choice, using a complex set
of rules. The graphical user interface is not intuitive to use
and was mastered only after detailed review of the online documentation
and considerable trial and error. We did not investigate the possibility
of designing transferable conversion templates. DataJunction was
very difficult to set up for batch-mode processing, a major problem
if a migration project must process more than a few files. In addition,
DataJunction does not have the ability to decompress archived or
zipped data files. DataJunction can be configured to provide different
error messages, including fatal and general errors, warnings, and
information messages. DataJunction is handicapped by its batch-job
restrictions. An upgrade in this feature would make this program
suitable for conversion of a variety of standard file formats.
Conversions Plus 4.5
Conversions Plus 4.5 (CP 4.5) is a product of DataViz Corporation.
It is a stand-alone program that complements several other software
products in a suite of tools designed to read and/or write to a
variety of file formats. We counted 74 source and 110 target formats
available in four general file categories: word processing, spreadsheets,
database files, and image files. For our case study, we narrowed
file format options to Lotus 123 (.wk1) and Excel 97.
File conversion in Conversions Plus is implemented by pairs of
file readers and writers. Each reader is written to read and decode
a specific file format. The file reader identifies file format
components and stores them within the program in a standard way.
From this data template, each file writer program can extract each
specific information object and restructure the data into the new
file format. In addition, CP 4.5 can detect and uncompress files
using the following compression algorithms: gzip, zip, tar, and
Z. Conversions Plus can work on single files or in batch mode.
CP 4.5 works in the Windows 95/98/NT environment.
We found that Conversions Plus provided accurate translations
from .wk1 to .xls formats with the exception of a subset of floating-point
numbers. We used our standard test file in a conversion test and
discovered CP 4.5 read and displayed the source file properly,
but would embed an incorrect display code for certain floating-point
numbers in the .xls target file. Comparison tests with Excel indicated
display problems with fractions that were represented with exponential
notation. All other basic format attributestext strings,
integers, formula, and equationsconverted properly.
The graphical user interface is intuitive and easy to use. One
selects the file(s) or directory (directories) by clicking on them.
Setting up target format choices or directories for converted files
is easily done using pull-down menus. CP 4.5 works less smoothly
when a directory contains numerous files of mixed format, a common
situation on many servers. Batch preferences are limited to a single
file format type for each of the four general categories of file
type. In these situations, we would anticipate significant user
oversight of the conversion operation. Conversion and error statistics
are displayed at the end of a batch job, and the information can
be written to a log file.
Comparison and Analysis After Conversion to Source File Format
During our tests of DataJunction and Conversions Plus, we manually
compared the standard test file and other sample files in two states:
before and after conversion. The comparison was conducted on two
Windows NT workstations using two monitors of similar size and
features. The comparison entailed a line-by-line examination of
structural and data elements in each file. Conversion errors occurred
in Conversions Plus only on data that contained floating-point
numbers. We modified the Examiner program to identify .wk1 files
in our archive that contained a structural element for floating-point
numbers (tag 0Eh) and ran it against our collection. The program
indicated that 8,619 files, or 96 percent of the collection, contained
floating-point numbers. Because this is a significant portion of
the collection, conversion using Conversions Plus was not attempted.
6. Migration Risk Analysis for 1-2-3 .wk1
We examined three migration options for 1-2-3 files: backward
compatibility, and file migration to Excel .xls and to ASCII characters.
Backward compatibility of 1-2-3 files provides a baseline for
comparisons with Excel and ASCII. Data captured in older 1-2-3
files are still readable in more recent 1-2-3 software. This strong
backward compatibility support indicates that old file formats
are superseded, but not obsolete; older files are not “orphaned” by
major revisions in Lotus software. In addition, earlier application
software can operate either in DOS operating systems or in the
DOS emulator in Windows 95/98/NT. Although 1-2-3 has a reduced
share of the spreadsheet market, Lotus is still providing strong
support for this product. Evaluating these factors, it appears
that avoiding migration and relying on the backward compatibility
to sustain the .wk1 format can be considered a low-risk option.
Migration to Excel .xls Format
Excel is currently the market leader for spreadsheet software
in both Windows and Macintosh operating systems and it has established
a large corporate and private user base. If we can make predictions
on the basis of the documented history of 1-2-3, Excel .xls should
be a heavily used format for the next 10 years. We also believe
that if Excel is superseded by another spreadsheet program, Microsoft
will provide reliable migration software from Excel to the new
target format. Examination of our standard test file and a random
sampling of files from the archive indicate that the latest version,
Excel for Windows 97, provides an accurate conversion of 1-2-3
.wk1 files. Four major components of a spreadsheet filetext strings,
integers, floating-point numbers, and embedded formulas and functionswere
properly converted; they retained content, context, and a reasonable
reproduction of the “functional experience.” Our evaluation
indicates that migration of .wk1 to .xls format is a low-risk option.
Unfortunately, Excel itself is a poor choice for conversion software
because it cannot perform batch conversions of files.
Migration to ASCII
ASCII is the format of choice for large numeric file archives.
ASCII files are easily scanned manually and can be imported into
most software programs. ASCII is perceived as a low-maintenance
format. On the basis of three decades of experience, most digital
archivists predict ASCII will still be a common file format in
50 years. Given the proper circumstances, a single migration to
ASCII should be more cost-effective than repeated migrations through
other evolving file formats. At its most fundamental level, migration
of 1-2-3 files to ASCII converts the content of spreadsheet cells
to values located in a matrix of xy coordinates. The
actual values of embedded functions, equations, or pointers to
other cells will be retained, but the functions, equations, or
pointers in those cells will be lost. Long text strings, essentially
embedded metadata, are truncated at different lengths, depending
on the conversion software. The formula, functions, pointers, and
text strings could be recorded in an external conversion record,
but for large collections, this might be impractical.
Our assessment suggests that ASCII is a low-risk preservation
option for the .wk1 format and could be adopted if dependence on
backward compatibility or conversion to Excel .xls format were
7. Metadata Risk
There are six sources of metadata to coordinate:
- MARC record. Each of the 250+ data sets is cataloged and has
a full MARC record in the campus online catalog.
- Gateway record. Each data set also has a MARC-like record for
the Cornell University Library gateway to online resources.
- USDA Economics and Statistics System Database. Each data set
also has a detailed record in a searchable database designed
specifically for USDA System users.
- README documentation. Each data set has a separate ASCII text
README file. The README documentation is a mix of descriptive,
content, and administrative metadata.
- File hierarchy. Placement and selection of individual 1-2-3
files depend on descriptive data associated with directory names.
- File names. 1-2-3 files use the DOS 8.3 naming convention.
A common name for a .wk1 file in the collection is table01.wk1.
The file extension (.wk1 or .xls) is a necessary feature for
the file to be read by a spreadsheet application.
Migration of the 1-2-3 files to another format would require modification
of two of these six metadata sources: the file name and the README
- The file name modification would change the file name extension.
For example: table01.wk1 > table01.xls
The conversion software should implement the file name modification.
A possible risk could be introduced if the file name were manually
changed without processing by the conversion software.
- README documentation would be modified to reflect the new file
name extension. For instance, the following table example was
extracted from a sample USDA data set, Feed Yearbook:
Table12.wk1 12. Farm programs and participation, 19751998
Table13.wk1 13. Average prices received by farmers, United States,
by month, and loan rate, 1975/761998/99
The file name extension in the documentation would be converted
Table12.wk1 > Table12.xls
Table13.wk1 > Table13.xls
If this conversion is required, the file extension conversion
can be modified in a word processor using a find/replace feature.
Using powerful search-and-replace functions on the documentation
introduces a low-level risk. Modification of documentation will
need thorough review before the migration process is considered
complete. In addition, new administrative information describing
the migration process will need to be added to the record. This
would include, but not be restricted to:
- date of migration or modification,
- data set involved,
- description of conversion process,
- identification of conversion software,
- identified risks and actions taken to manage them, and
- person supervising the migration or modification.
Because of the limited modifications required to file names and
documentation and the lack of system scripts, the migration risk
introduced to metadata is low.
From these findings, we believe we can draw the following conclusions:
- 1-2-3 format-specific migration risks associated with specific
file conversion software can be identified and described. The
use of a standard test file allows side-by-side comparison of
file structure and content before and after conversion. Errors
or significant changes to the information can be isolated and
examined for their persistence or effect, or for both.
- Some measures of risk can be quantified. The Examiner program
can measure the number and frequency of problematic file tags
within a collection. Although the measurement is crude, it provides
a general measure of risk for a collection, a step that has been
lacking in digital file management. Further refinements to the
program could provide greater resolution of problem file attributes.
- Migration paths for the .wk1 format can be mapped. We can describe
the relationship between the source and target formats and the
conversion software with reasonable certainty. For example:
Source Black Box Target Excel DataJunction Conversion Plus
- Cost-effective COTS migration software was not identified.
The two programs analyzed in this case study were considered
the best candidates for a COTS migration program. Both were found
deficient in basic, but significant, tasks. Minor modifications
to the programs would alleviate our reservations. We intend to
bring our results to the attention of these developers.
- MARC record. Each of the 250+ data sets is cataloged and has