Documentation: Examiner and RiskEditor

What Is This Software for?

File migration and “black-box” converters

A major risk in migrating collections of files is the conversion software used to translate the files from the original format to our chosen target format. We start with a file whose content we hope to translate without corruption. We send it through a “black box” and hope that the integrity of the content will be preserved. We can presume success if we know that the conversion software faithfully maps every property of the source format to corresponding features in the target format (assuming, of course, that the target format has a feature set that is rich enough to store the properties and data of the source). For example, if the document format we are converting has a way to indicate bold text, and the target format can also indicate bold text, we want to know that the conversion software correctly maps bold to bold. More important, in most cases, data values, whether numeric, image, or text, should also move from one format to the other intact.

Two ways of evaluating the black box

If we can examine the mapping process and the data-moving techniques of the conversion software, we can evaluate the correctness of both functions. This examination must be repeated for every combination of source and target formats with which we are working, because each combination has a unique mapping. Moreover, to attempt this method, we must have access to the source code of the converter and possess the expertise to evaluate the code. Our experience in obtaining source codes from commercial software vendors has not been fruitful. Even if it were, the resources necessary for evaluating a specific mapping for every combination of source and target formats make this an impractical method for creating a general and expandable technique for assessing the risk involved in migrating collections.

Another method is to compare a converted file with the original file. If the result meets our standard of success, whatever that standard may be, we can say that the conversion software has performed adequately. However, we can make that statement solely about the particular source file we converted. The ideal file for the test would be one that tested all the features of the source format and tested data values at the minimum and maximum of every range possible. If that file were run through the converter, the resulting file could be compared at every point with the original.

Our approach

For our own collection of Lotus 1-2-3 files, we created a test file in the .wk1 format. With it, we can evaluate potential conversion software by running the software on the test file and then comparing the converted file with the test file. Visual inspection and comparison of all the properties and values are necessary to identify differences; this took about two hours. Proprietary software codes and knowledge of an uncertain number of format-to-format mappings are not needed for the visual inspection method. Another benefit of the test file is that it provides a baseline against which to evaluate and compare multiple conversion applications.

Regardless of the method used to evaluate the conversion software, if any of the properties or data values are not the same in the source and target files, then we know that the conversion software has introduced one or more points of risk. Thinking about the whole collection of files to be migrated, we will want to know whether some of the files in the collection have any at-risk properties. We can then decide whether to find another converter, to refrain from migrating those files, or to consider some or all of the loss acceptable.

We wrote the Examiner software application to test a collection of files for the presence of particular properties. Using the RiskEditor application, we indicate the properties that are at risk. If desired, we can order them by the degree of importance or impact. Then we run the Examiner application on a part or all of the collection. Examiner produces a report that lists which files contain the properties in question. With this knowledge, we can make an informed decision about the technical risks introduced by the conversion software.

The Examiner application is written in Java, and both its user documentation and technical documentation are available as HTML files. Examiner is designed to be extendable to any file format that indicates properties as numbered tags, including Lotus 1-2-3 and TIFF, the formats of our case-study collections. A requirement for running the application is a Java interpreter on the computer holding the collection. We wrote a command-line version of the program to be used on our Unix servers, but the program could be easily extended to have a graphical user interface.

Installation

Requirements

  1. A JDK 1.1-compliant Java virtual machine installed on the same computer where your wk* files are stored.
  2. Adequate Unix or Windows privileges to create a directory and bestow write permission to files within it.

Installing

Unix
  1. Unzip and untar “examiner.tar.gz”.
  2. Give the user permission to run the files in “examiner/bin”.
  3. In the same directory, give the user permission to write to the files “defaultProperties”, “appProperties”, and any files ending with “.rsk”.
  4. Add the /examiner/bin directory’s path to the CLASSPATH environment variable in the user’s profile, or edit the “examiner” and “riskEdit” scripts to point to the appropriate path. Comments in the scripts explain what must be done. You may want to put them in a directory in the user’s executable PATH.
Windows
  1. Unzip “examiner.zip”.
  2. The users should have permission to write to files by default. If that is not the case with a particular user, give the user permission to write to the files “defaultProperties”, “appProperties”, and any files ending with “.rsk”.
  3. Add the “\examiner\bin” directory’s path to the CLASSPATH environment variable in the user’s profile, or edit “examiner.bat” and “riskEdit.bat” to point to the appropriate path. Comments in the batch files explain what must be done. Users may want to put them in a directory in their executable PATH.

Running Examiner and RiskEditor

If the environment variables CLASSPATH and PATH are set to include both the java files and the Examiner file, change to the directory with the Examiner class files, and type “java Examiner” or “java RiskEditor” on the command line. Then answer the prompts.

If the Unix scripts, examiner and riskEditor, or the DOS batch files examiner.bat and riskEditor.bat have been edited to include local directory information, type “examiner” or “riskEditor on the command line. Then answer the prompts.

Using RiskEditor

For the Examiner program to selectively identify risk or impact associated with individual tags, the user must first assign a value to the risk/impact of the presence of a particular tag in the files. RiskEditor enables users to mark tags with a value between 1 (low) and 5 (high). After having converted a test file into another format and having compared the data and functions of the two files, the user knows what attributes have not been converted successfully. Some failures may be more important than others. By comparing the features to a list of the tags in the source format, users can identify the tags they want to look for in their collection.

Here is an example of a RiskEditor session, with comments.

 

What file type would you like to edit? [wk1, wks] [Users are given a choice from among the file types for which there are .rsk files in the program’s working directory.]
Do you want to “browse” (move through the tags in sequence) or “specify” (edit specific tags)?
[The “browse” mode moves through the tags sequentially, while the “specify” mode simply asks for the number of a tag to be changed.]
Enter decimal number of tag to be changed, or “quit”: 14
Tag number: 14 Value: 5
[This is the “specify mode”. “Browse” mode shows only the tag number/value line.]
Enter a new value (1-5), “ClearAll” to reset every value to 1,”save” to save your changes, or “quit”
[“1” represents the lowest priority, “5” the highestquot;save” always writes over the appropriate .rsk filequot;quit” prompts you to save if you have changed some information.]

There is no need to assign values to all the tags. In practice, we have not had to mark more than three tags at one time. One of the tags was more important than the others because it represented a feature we felt we could not allow to be corrupted during migration; we marked it as a 5. The other two features represented risks we could accept; we gave each a risk/impact level of 4.

Using Examiner

If you have not assigned risk/impact levels to the tags you are interested in using RiskEditor, see the instructions for that application first. Then run the Examiner program on the collection of files you want to examine.

A sample session

Here is an example of a session, with comments. User input is in bold.

 

$ examiner [this session was run from a Unix shell script] Tue, Oct 12, 1999 02:45:39 PM [start time]
Examiner….
Please enter the file type to be examined [wk1, wks]: wk1
[File types for which there are tag descriptions are in bracketsfor example, .wk1, .wks.]
What is the starting directory? [/usr/local/Examiner] /usda/ftp/usda/data-sets
[The default is the directory from where the program is running.]
What is the minimum risk/impact level to be displayed? 5
[We are interested only in the highest level in this session.]
In which file should the report be stored?:
/usda/testdir/wk1.run.5.report
[The user must have permission to write a file.]
Working… [Lots of dots deleted]
[Dots march across the screen to indicate that something is happeningthe program hasn’t died.]
Number of files in the file hierarchy = 31268
Number of wk1 files examined = 8979
[These numbers appear in the report, not only on the screen.]
Tue, Oct 12, 1999 04:54:12 PM [Time when the program finished its work]
[Elapsed time for this run: about 2 hours, 8 minutes.]

A sample report

Here is a heavily edited version of the report for this run-the original had almost 38,000 lines.

The converter we were evaluating does not create a file that displays floating-point numbers consistently. Since we were interested only in one tag, we marked it as level 5 in RiskEditor. All the other tags were set to 1.

 

Examiner
—–
/usda/ftp/usda/data-sets/crops/94018/budget.wk1:
Risk Level 5
Tag 14: NUMBER: Floating point number -Qty: 584
[There are 584 cells with floating point numbers.]
—–
/usda/ftp/usda/data-sets/crops/94018/charactr.wk1:
Risk Level 5 There are no tags in this file at this level
[In this file, there are no floating-point numbers. We can trust the converter we are evaluating to convert this file successfully.]
—–
/usda/ftp/usda/data-sets/crops/94018/conf_int.wk1:
Risk Level 5
Tag 14: NUMBER: Floating point number – Qty: 59
—–[…Deleted lines…]ERROR:/usda/ftp/usda/datasets/crops/.district/.finderinfo /parsline.wk1 not a supported file type
[This file was a text file with information about the wk1 files in a directory.]
—–

[…Deleted lines…]

—–
/usda/ftp/usda/data-sets/crops/.district/parsline.wk1:
Risk Level 5 There are no tags in this file at this level
—–
/usda/ftp/usda/data-sets/livestock/89032/acheesu.wk1:
Risk Level 5
Tag 14: NUMBER: Floating point number -Qty: 182
—–
Number of files in the file hierarchy = 31268
Number of wk1 files examined = 8979

Summary of our approach

  1. We created a spreadsheet that exercises all of the .wk1 file’s attributes.
  2. 2To test a file conversion application’s capabilities, we converted the file from .wk1 to .xls.
  3. We visually compared the files, point by point, to uncover any inconsistencies between the two versions.
  4. We examined the specifications for the .wk1 file format to identify the internal tags governing the at-risk attributes.
  5. We used the RiskEditor program to configure the Examiner program, marking tags at risk.
  6. We examined the collection of files with the Examiner program, which returned a report detailing the files containing the at-risk tags.