Council on Library and Information Resources

Username (email)

Password

Appendix E

dot

Appendix E

Case Study for Lotus 1-2-3 .wk1 Format

rule

1. Introduction

The United States Department of Agriculture (USDA) Economics and Statistics System is a heavily used collection of agricultural economic information that is evenly divided between time-sensitive economic reports and data series of numeric files in spreadsheet format. Some of the data series are 14 years old—ancient artifacts in the context of personal computers. Although dated, these files have historic and current value to a well-defined group of data users. The goals of this case study were to:

  • evaluate file format migration as a strategy to maintain access to these numeric files,
  • identify file format components at risk during migration,
  • identify related risk attributes associated with migration, and
  • evaluate data migration software.

The case study was conducted with several risk-assessment tools developed during this project. These include a specially prepared test file, file reader software, and a risk-assessment workbook.

2. Description of Archive and Data User Base

The USDA Economics and Statistics System is a joint venture between the Albert R. Mann Library of Cornell University and three USDA economic agencies: Economic Research Service (ERS), National Agricultural Statistics Service (NASS), and the World Agricultural Outlook Board (WAOB). These three agencies measure the production and health of domestic and international agricultural activities. ERS publications analyze current agriculture market activity and forecast future market conditions. Other ERS publications offer economic analyses in the areas of trade, production, rural development, farm inputs, and other economic topics. The NASS publishes estimates of production, stocks, inventories, deposition, utilization, prices of agricultural commodities, and such other items as labor and numbers of farms. The WAOB issues regular forecasts of United States and world supply-and-demand prospects for major agricultural commodities.

During the federal fiscal year 1999, more than 368,000 distinct hosts accessed the USDA Economics System. The system disseminates more than 500 MB in 7,000 file downloads daily. Many system users compare current and historic statistical series to detect trends. In a 5,000-user survey conducted between January and April 1999, 64 percent of respondents used the service for monitoring price trends, 47 percent for forecasting or obtaining market predictions, and 38 percent for research. Most of the more than 250 data series in the Economics System are published in a DOS binary format, usually Lotus 1-2-3 .wk1. The collection contains nearly 9,000 .wk1-formatted files stored online on SCSI disks, and offline on CDs and on floppy magnetic diskettes.

3. The 1-2-3 Format

Lotus 1-2-3 applications have undergone continuous revisions. Early versions of Lotus 1-2-3 created stand-alone .wk1 files. Functions and macro languages were associated with the application, not the data files. Special formatting instructions were saved in separate .fmt files. Since the Windows 3.1 release of Lotus, data and format files have been merged into a .wk4 format. Later releases have integrated scripting language and data objects into the files. The most recent release of 1-2-3, the Millennium edition, allows the user to embed hyperlinks within the spreadsheet, manipulate 1-2-3 files within Active Document containers such as Internet Explorer or Lotus Notes, embed ActiveX controls within documents, import real-time information into a spreadsheet, and so on. The current Lotus file is no longer just a spreadsheet; it is better described as an interactive data container.

As the format has evolved, all features have been maintained and supported through backward compatibility, and representatives at Lotus Corporation have underlined the company's continued support for all 1-2-3 file formats. Limited backward compatibility for 1-2-3 files is found in Excel. Microsoft provides extensive documentation for Excel and identifies the functions associated with 1-2-3 that do not convert properly to Excel. None of these problems appears to affect functions associated with .wk1 formatted files, but the extent of change is hard to measure fully.

Both companies keep information about some features of their software privileged. 1-2-3 files use a proprietary format closely controlled by Lotus (now IBM). The file specifications for release 1.0 were published in 1984 and revised for release 2.0 in 1985. Specifications for releases 3.0 to present have been controlled by agreements with business partners and software developers. No one has researched or documented the frequency and scope of these changes, or how well non-Lotus developers integrate these revisions into their products. An unfortunate side effect of proprietary restrictions on this information was the apparent loss of early 1-2-3 specifications within the Lotus Company. As the company revised its product, old specifications were fully integrated into the new releases, and the need for the old documentation disappeared. Further effort is required to assemble an authoritative set of specifications for all 1-2-3 and other major spreadsheet software releases.

Finally, early 1-2-3 files are bereft of descriptive data. Until Lotus 97, the Windows 95-compatible release, file names adhered to the 8.3 DOS format. Embedded descriptive data are often typed into cell A1. File names for USDA products are simplistic. Files labeled "table1.wk1" are quite common among the more than 250 data sets. None of the .wk1 files in our collection has any imbedded links to its respective documentation files. The main method of identifying a file is through its relative position in a hierarchical file structure.

4. Development of Tools for Assessing the Source-to-target Format Risk

File Migration and Black-Box Converters

A major risk in migrating collections of files is the conversion software used to translate the files from the original format to the target format. A migration project begins with a file whose content should be translated, without corruption, to another format. The file is passed through conversion software, essentially a "black box," with the intent that the integrity of the content be preserved in the new file. We can presume success if we know that the conversion software faithfully maps every property of the source format to corresponding features in the target format (assuming, of course, that the target format has a feature set that is rich enough to store the properties and data of the source). For example, if the source format we are converting has a way to indicate bold text, assuming that the target format can also indicate bold text, we want to know that the conversion software correctly maps the source bold attribute to the target bold attribute. More important, in most conceivable cases, data values—whether numeric, image, or text—should also move from one format to the other intact.

Two Ways of Evaluating the Black Box

If we can examine the mapping process and the data-moving techniques of the conversion software, we can evaluate the correctness of both. This method must be repeated for every combination of source and target formats with which we are working, because each combination has a unique mapping. Moreover, to attempt this method, we must have access to the source code of the converter and possess the expertise to evaluate that code. Our experience with approaching commercial software vendors for the code to their programs has not been fruitful. Even if it were, the cost of evaluating each combination of source and target formats in the program algorithm makes this method impractical for controlling risk involved in migrating collections.

Another method is to compare a converted file with the original file. If the result meets our standard of success, whatever that standard may be, we can say that the conversion software has performed adequately. However, this method is limited to the particular source and target file formats under consideration. The ideal file for the test would be one that exercised all the capabilities of the source format and contained every possible feature and data value at the minimum and maximum of every range possible. If that file were run through the converter, the resulting file could be compared at every point with the original. This is the approach we chose for our case study.

Developing Risk-Assessment Tools

For our own collection of Lotus 1-2-3 files, we created a test file in the .wk1 format. The features documented in the user manuals provided with the 1-2-3 release 2.2 software, along with the published file format specifications, determined the content of the test file. The test file was generated with 1-2-3 release 2.2 software, which, according to our estimates, was the software that generated the oldest .wk1 files in our collection. With the test file, we can evaluate potential conversion software by running the software on the test file and then comparing the converted file with the test file. Visual inspection and comparison of all the properties and values is necessary to identify differences; this process took about three hours in our example. Proprietary software code and knowledge of an uncertain number of format-to-format mappings in the program are unnecessary for visual inspection. Another benefit of the test file is that it gives us a baseline against which we can evaluate and compare multiple conversion applications.

Whichever method is used to evaluate the conversion software, if any of the properties or data values are not the same in the source and target files, then we know that the conversion software has introduced one or more points of risk. Thinking about the whole collection of files to be migrated, we will want to know whether some of the files in the collection have any at-risk properties that will combine with the conversion software to create problems in the converted file(s). We can then decide whether to find another converter, to refrain from migrating those files, or, perhaps, to consider some or all of the loss acceptable.

We wrote the Examiner software application to test a collection of files for the presence of particular properties. The list of properties consists of the structural element tags that define the .wk1 format. Following the results of the test file comparisons with different conversion software, we were able to identify the tags that would not properly translate into the target format. Using a companion configuration utility, RiskEditor, we provided the Examiner program with a list of the properties that are at risk, optionally ordering them by the degree of importance or impact. Then, as a test, the Examiner program was run against parts of the collection. After further evaluation, Examiner was run against the entire collection, a hierarchy of 30,000 files. This requires a little more than two hours to complete. The program produces a report showing which files contain the properties in question. Examiner provides a quantitative assessment of the risks that could be introduced by the conversion software.

The Examiner application is written in Java, and both its user documentation and technical documentation are available as HTML files. It has been designed to be extendable to any file format that indicates properties as numbered tags, including Lotus 1-2-3 and TIFF, the formats of our case-study collections. A requirement for running the application is a Java interpreter on the computer holding the collection. We wrote a command-line version of the program to be used on our Unix servers, but the program could be easily extended to have a graphical user interface.

5. Selection and Evaluation of Conversion Software

The migration software examined for this project was a commercially available, off-the-shelf (COTS) product. Locally developed conversion software was avoided for two reasons: development costs and immediate obsolescence. Software development is labor intensive, and long-term maintenance is expensive. The costs to develop single, one-project programs cannot be justified for mainstreamed software formats. We were interested in examining the alternatives that commercial software developers might offer. We examined two products, DataJunction and Conversions Plus, using the following criteria:

  • source and target formats,
  • accuracy of conversion,
  • file decompression,
  • batch processing, and
  • error reporting.

    We evaluated other features, but these five criteria describe the critical features we thought essential when evaluating software for a migration process.

    DataJunction

    A product of Data Junction Corporation, DataJunction 7.0 is conversion software that appears capable of integrating and transforming data among hundreds of applications and structured data formats in both UNIX and Windows 95/98/NT environments. We counted 150 source and 155 target format options. Image-file formats—GIF, JPEG, PDF, TIFF—are not included in the list of supported file formats. For our case study, we narrowed file format options to Lotus 123 r2 and Excel 97.

    We found DataJunction works as specified. In single-file tests, DataJunction quickly and accurately converted Lotus .wk1 files into Excel .xls files. Tests were not conducted to determine whether DataJunction could convert .wk1 to ASCII. Setting up DataJunction, however, was somewhat difficult. DataJunction is designed to work with large legacy database files, from which it extracts data and converts it into the target format of choice, using a complex set of rules. The graphical user interface is not intuitive to use and was mastered only after detailed review of the online documentation and considerable trial and error. We did not investigate the possibility of designing transferable conversion templates. DataJunction was very difficult to set up for batch-mode processing, a major problem if a migration project must process more than a few files. In addition, DataJunction does not have the ability to decompress archived or zipped data files. DataJunction can be configured to provide different error messages, including fatal and general errors, warnings, and information messages. DataJunction is handicapped by its batch-job restrictions. An upgrade in this feature would make this program suitable for conversion of a variety of standard file formats.

    Conversions Plus 4.5

    Conversions Plus 4.5 (CP 4.5) is a product of DataViz Corporation. It is a stand-alone program that complements several other software products in a suite of tools designed to read and/or write to a variety of file formats. We counted 74 source and 110 target formats available in four general file categories: word processing, spreadsheets, database files, and image files. For our case study, we narrowed file format options to Lotus 123 (.wk1) and Excel 97.

    File conversion in Conversions Plus is implemented by pairs of file readers and writers. Each reader is written to read and decode a specific file format. The file reader identifies file format components and stores them within the program in a standard way. From this data template, each file writer program can extract each specific information object and restructure the data into the new file format. In addition, CP 4.5 can detect and uncompress files using the following compression algorithms: gzip, zip, tar, and Z. Conversions Plus can work on single files or in batch mode. CP 4.5 works in the Windows 95/98/NT environment.

    We found that Conversions Plus provided accurate translations from .wk1 to .xls formats with the exception of a subset of floating-point numbers. We used our standard test file in a conversion test and discovered CP 4.5 read and displayed the source file properly, but would embed an incorrect display code for certain floating-point numbers in the .xls target file. Comparison tests with Excel indicated display problems with fractions that were represented with exponential notation. All other basic format attributes—text strings, integers, formula, and equations—converted properly.

    The graphical user interface is intuitive and easy to use. One selects the file(s) or directory (directories) by clicking on them. Setting up target format choices or directories for converted files is easily done using pull-down menus. CP 4.5 works less smoothly when a directory contains numerous files of mixed format, a common situation on many servers. Batch preferences are limited to a single file format type for each of the four general categories of file type. In these situations, we would anticipate significant user oversight of the conversion operation. Conversion and error statistics are displayed at the end of a batch job, and the information can be written to a log file.

    Comparison and Analysis After Conversion to Source File Format

    During our tests of DataJunction and Conversions Plus, we manually compared the standard test file and other sample files in two states: before and after conversion. The comparison was conducted on two Windows NT workstations using two monitors of similar size and features. The comparison entailed a line-by-line examination of structural and data elements in each file. Conversion errors occurred in Conversions Plus only on data that contained floating-point numbers. We modified the Examiner program to identify .wk1 files in our archive that contained a structural element for floating-point numbers (tag 0Eh) and ran it against our collection. The program indicated that 8,619 files, or 96 percent of the collection, contained floating-point numbers. Because this is a significant portion of the collection, conversion using Conversions Plus was not attempted.

    6. Migration Risk Analysis for 1-2-3 .wk1 Files

    We examined three migration options for 1-2-3 files: backward compatibility, and file migration to Excel .xls and to ASCII characters.

    Backward Compatibility

    Backward compatibility of 1-2-3 files provides a baseline for comparisons with Excel and ASCII. Data captured in older 1-2-3 files are still readable in more recent 1-2-3 software. This strong backward compatibility support indicates that old file formats are superseded, but not obsolete; older files are not "orphaned" by major revisions in Lotus software. In addition, earlier application software can operate either in DOS operating systems or in the DOS emulator in Windows 95/98/NT. Although 1-2-3 has a reduced share of the spreadsheet market, Lotus is still providing strong support for this product. Evaluating these factors, it appears that avoiding migration and relying on the backward compatibility to sustain the .wk1 format can be considered a low-risk option.

    Migration to Excel .xls Format

    Excel is currently the market leader for spreadsheet software in both Windows and Macintosh operating systems and it has established a large corporate and private user base. If we can make predictions on the basis of the documented history of 1-2-3, Excel .xls should be a heavily used format for the next 10 years. We also believe that if Excel is superseded by another spreadsheet program, Microsoft will provide reliable migration software from Excel to the new target format. Examination of our standard test file and a random sampling of files from the archive indicate that the latest version, Excel for Windows 97, provides an accurate conversion of 1-2-3 .wk1 files. Four major components of a spreadsheet filetext strings, integers, floating-point numbers, and embedded formulas and functionswere properly converted; they retained content, context, and a reasonable reproduction of the "functional experience." Our evaluation indicates that migration of .wk1 to .xls format is a low-risk option. Unfortunately, Excel itself is a poor choice for conversion software because it cannot perform batch conversions of files.

    Migration to ASCII

    ASCII is the format of choice for large numeric file archives. ASCII files are easily scanned manually and can be imported into most software programs. ASCII is perceived as a low-maintenance format. On the basis of three decades of experience, most digital archivists predict ASCII will still be a common file format in 50 years. Given the proper circumstances, a single migration to ASCII should be more cost-effective than repeated migrations through other evolving file formats. At its most fundamental level, migration of 1-2-3 files to ASCII converts the content of spreadsheet cells to values located in a matrix of x­y coordinates. The actual values of embedded functions, equations, or pointers to other cells will be retained, but the functions, equations, or pointers in those cells will be lost. Long text strings, essentially embedded metadata, are truncated at different lengths, depending on the conversion software. The formula, functions, pointers, and text strings could be recorded in an external conversion record, but for large collections, this might be impractical.

    Our assessment suggests that ASCII is a low-risk preservation option for the .wk1 format and could be adopted if dependence on backward compatibility or conversion to Excel .xls format were impossible.

    7. Metadata Risk

    There are six sources of metadata to coordinate:

    1. MARC record. Each of the 250+ data sets is cataloged and has a full MARC record in the campus online catalog.
    2. Gateway record. Each data set also has a MARC-like record for the Cornell University Library gateway to online resources.
    3. USDA Economics and Statistics System Database. Each data set also has a detailed record in a searchable database designed specifically for USDA System users.
    4. README documentation. Each data set has a separate ASCII text README file. The README documentation is a mix of descriptive, content, and administrative metadata.
    5. File hierarchy. Placement and selection of individual 1-2-3 files depend on descriptive data associated with directory names.
    6. File names. 1-2-3 files use the DOS 8.3 naming convention. A common name for a .wk1 file in the collection is table01.wk1. The file extension (.wk1 or .xls) is a necessary feature for the file to be read by a spreadsheet application.

    Migration of the 1-2-3 files to another format would require modification of two of these six metadata sources: the file name and the README files.

    • The file name modification would change the file name extension.
      For example: table01.wk1 > table01.xls
      The conversion software should implement the file name modification. A possible risk could be introduced if the file name were manually changed without processing by the conversion software.
    • README documentation would be modified to reflect the new file name extension. For instance, the following table example was extracted from a sample USDA data set, Feed Yearbook:
      Table12.wk1 12. Farm programs and participation, 1975–1998
      Table13.wk1 13. Average prices received by farmers, United States, by month, and loan rate, 1975/76–1998/99
      The file name extension in the documentation would be converted from
      Table12.wk1 > Table12.xls
      Table13.wk1 > Table13.xls

    If this conversion is required, the file extension conversion can be modified in a word processor using a find/replace feature. Using powerful search-and-replace functions on the documentation introduces a low-level risk. Modification of documentation will need thorough review before the migration process is considered complete. In addition, new administrative information describing the migration process will need to be added to the record. This would include, but not be restricted to:

    • date of migration or modification,
    • data set involved,
    • description of conversion process,
    • identification of conversion software,
    • identified risks and actions taken to manage them, and
    • person supervising the migration or modification.

    Because of the limited modifications required to file names and documentation and the lack of system scripts, the migration risk introduced to metadata is low.

    8. Summary

    From these findings, we believe we can draw the following conclusions:

    1. 1-2-3 format-specific migration risks associated with specific file conversion software can be identified and described. The use of a standard test file allows side-by-side comparison of file structure and content before and after conversion. Errors or significant changes to the information can be isolated and examined for their persistence or effect, or for both.
    2. Some measures of risk can be quantified. The Examiner program can measure the number and frequency of problematic file tags within a collection. Although the measurement is crude, it provides a general measure of risk for a collection, a step that has been lacking in digital file management. Further refinements to the program could provide greater resolution of problem file attributes.
    3. Migration paths for the .wk1 format can be mapped. We can describe the relationship between the source and target formats and the conversion software with reasonable certainty. For example:

      Source

      Black Box

      Target

      Risk

             

      .wk1

      Excel

      .xls

      Low

      .wk1

      DataJunction

      .xls

      Low

      .wk1

      Conversion Plus

      .xls

      High

    4. Cost-effective COTS migration software was not identified. The two programs analyzed in this case study were considered the best candidates for a COTS migration program. Both were found deficient in basic, but significant, tasks. Minor modifications to the programs would alleviate our reservations. We intend to bring our results to the attention of these developers.

    rule

    Links to other parts of this report:

    Table of Contents

    Risk Management of Digital Information

    Appendix A: Risk-Assessment Workbook

    Appendix B: Documentation for Format Migration Test File, Lotus 1-2-3, Release 2.2

    Appendix C: Documentation: Examiner and RiskEditor

    Appendix D: Case Study for Image File Format

    Appendix F: Migration Software Analysis, Software Assessment Sheet

    Appendix G: Specifications for the Cornell Digital Library Format

    Return to CLIR Home Page >>