 |
Appendix C
Documentation: Examiner and RiskEditor
What Is This Software for?
File migration and "black-box" converters
A major risk in migrating collections of files is the conversion
software used to translate the files from the original format to
our chosen target format. We start with a file whose content we hope
to translate without corruption. We send it through a "black
box" and hope that the integrity of the content will be preserved.
We can presume success if we know that the conversion software faithfully
maps every property of the source format to corresponding features
in the target format (assuming, of course, that the target format
has a feature set that is rich enough to store the properties and
data of the source). For example, if the document format we are converting
has a way to indicate bold text, and the target format can also indicate
bold text, we want to know that the conversion software correctly
maps bold to bold. More important, in most cases, data values, whether
numeric, image, or text, should also move from one format to the
other intact.
Two ways of evaluating the black box
If we can examine the mapping process and the data-moving techniques
of the conversion software, we can evaluate the correctness of both
functions. This examination must be repeated for every combination
of source and target formats with which we are working, because each
combination has a unique mapping. Moreover, to attempt this method,
we must have access to the source code of the converter and possess
the expertise to evaluate the code. Our experience in obtaining source
codes from commercial software vendors has not been fruitful. Even
if it were, the resources necessary for evaluating a specific mapping
for every combination of source and target formats make this an impractical
method for creating a general and expandable technique for assessing
the risk involved in migrating collections.
Another method is to compare a converted file with the original
file. If the result meets our standard of success, whatever that
standard may be, we can say that the conversion software has performed
adequately. However, we can make that statement solely about the
particular source file we converted. The ideal file for the test
would be one that tested all the features of the source format and
tested data values at the minimum and maximum of every range possible.
If that file were run through the converter, the resulting file could
be compared at every point with the original.
Our approach
For our own collection of Lotus 1-2-3 files, we created a test file
in the .wk1 format. With it, we can evaluate potential conversion
software by running the software on the test file and then comparing
the converted file with the test file. Visual inspection and comparison
of all the properties and values are necessary to identify differences;
this took about two hours. Proprietary software codes and knowledge
of an uncertain number of format-to-format mappings are not needed
for the visual inspection method. Another benefit of the test file
is that it provides a baseline against which to evaluate and compare
multiple conversion applications.
Regardless of the method used to evaluate the conversion software,
if any of the properties or data values are not the same in the source
and target files, then we know that the conversion software has introduced
one or more points of risk. Thinking about the whole collection of
files to be migrated, we will want to know whether some of the files
in the collection have any at-risk properties. We can then decide
whether to find another converter, to refrain from migrating those
files, or to consider some or all of the loss acceptable.
We wrote the Examiner software application to test a collection
of files for the presence of particular properties. Using the RiskEditor
application, we indicate the properties that are at risk. If desired,
we can order them by the degree of importance or impact. Then we
run the Examiner application on a part or all of the collection.
Examiner produces a report that lists which files contain the properties
in question. With this knowledge, we can make an informed decision
about the technical risks introduced by the conversion software.
The Examiner application is written in Java, and both its user documentation
and technical documentation are available as HTML files. Examiner
is designed to be extendable to any file format that indicates properties
as numbered tags, including Lotus 1-2-3 and TIFF, the formats of
our case-study collections. A requirement for running the application
is a Java interpreter on the computer holding the collection. We
wrote a command-line version of the program to be used on our Unix
servers, but the program could be easily extended to have a graphical
user interface.
Installation
Requirements
- A JDK 1.1-compliant Java virtual machine installed on the same
computer where your wk* files are stored.
- Adequate Unix or Windows privileges to create a directory and
bestow write permission to files within it.
Installing
Unix
- Unzip and untar "examiner.tar.gz".
- Give the user permission to run the files in "examiner/bin".
- In the same directory, give the user permission to write to the
files "defaultProperties", "appProperties",
and any files ending with ".rsk".
- Add the /examiner/bin directory's path to the CLASSPATH environment
variable in the user's profile, or edit the "examiner" and "riskEdit" scripts
to point to the appropriate path. Comments in the scripts explain
what must be done. You may want to put them in a directory in the
user's executable PATH.
Windows
- Unzip "examiner.zip".
- The users should have permission to write to files by default.
If that is not the case with a particular user, give the user permission
to write to the files "defaultProperties", "appProperties",
and any files ending with ".rsk".
- Add the "\examiner\bin" directory's path to the CLASSPATH
environment variable in the user's profile, or edit "examiner.bat" and "riskEdit.bat" to
point to the appropriate path. Comments in the batch files explain
what must be done. Users may want to put them in a directory in
their executable PATH.
Running Examiner and RiskEditor
If the environment variables CLASSPATH and PATH are set to include
both the java files and the Examiner file, change to the directory
with the Examiner class files, and type "java Examiner" or "java
RiskEditor" on the command line. Then answer the prompts.
If the Unix scripts, examiner and riskEditor, or the DOS batch files
examiner.bat and riskEditor.bat have been edited to include local
directory information, type "examiner" or "riskEditor" on
the command line. Then answer the prompts.
Using RiskEditor
For the Examiner program to selectively identify risk or impact
associated with individual tags, the user must first assign a value
to the risk/impact of the presence of a particular tag in the files.
RiskEditor enables users to mark tags with a value between 1 (low)
and 5 (high). After having converted a test file into another format
and having compared the data and functions of the two files, the
user knows what attributes have not been converted successfully.
Some failures may be more important than others. By comparing the
features to a list of the tags in the source format, users can identify
the tags they want to look for in their collection.
Here is an example of a RiskEditor session, with comments.
- What file type would you like to edit? [wk1, wks]
[Users are given a choice from among the file types for which
there are .rsk files in the program's working directory.]
- Do you want to "browse" (move through the tags in sequence)
or "specify" (edit specific tags)?
[The "browse" mode moves through the tags sequentially,
while the "specify" mode simply asks for the number of
a tag to be changed.]
- Enter decimal number of tag to be changed, or "quit":
14
- Tag number: 14 Value: 5
[This is the "specify mode". "Browse" mode
shows only the tag number/value line.]
- Enter a new value (1-5), "ClearAll" to reset every
value to 1,"save" to save your changes, or "quit"
["1" represents the lowest priority, "5" the
highestquot;save" always writes over the appropriate
.rsk filequot;quit" prompts you to save if you have
changed some information.]
There is no need to assign values to all the tags. In practice,
we have not had to mark more than three tags at one time. One of
the tags was more important than the others because it represented
a feature we felt we could not allow to be corrupted during migration;
we marked it as a 5. The other two features represented risks we
could accept; we gave each a risk/impact level of 4.
Using Examiner
If you have not assigned risk/impact levels to the tags you are
interested in using RiskEditor, see the instructions for that application
first. Then run the Examiner program on the collection of files you
want to examine.
A sample session
Here is an example of a session, with comments. User input is in bold.
- $ examiner [this session was run from a Unix shell script]
Tue, Oct 12, 1999 02:45:39 PM [start time]
Examiner....
- Please enter the file type to be examined [wk1, wks]: wk1
[File types for which there are tag descriptions are in bracketsfor
example, .wk1, .wks.]
- What is the starting directory? [/usr/local/Examiner]
/usda/ftp/usda/data-sets
[The default is the directory from where the program is running.]
- What is the minimum risk/impact level to be displayed? 5
[We are interested only in the highest level in this session.]
- In which file should the report be stored?:
/usda/testdir/wk1.run.5.report
[The user must have permission to write a file.]
- Working... [Lots of dots deleted]...
[Dots march across the screen to indicate that something is
happeningthe program hasn't died.]
- Number of files in the file hierarchy = 31268
Number of wk1 files examined = 8979
[These numbers appear in the report, not only on the screen.]
Tue, Oct 12, 1999 04:54:12 PM [Time when the program finished
its work]
[Elapsed time for this run: about 2 hours, 8 minutes.]
A sample report
Here is a heavily edited version of the report for this runthe
original had almost 38,000 lines.
The converter we were evaluating does not create a file that displays
floating-point numbers consistently. Since we were interested only
in one tag, we marked it as level 5 in RiskEditor. All the other
tags were set to 1.
- Examiner
-----
/usda/ftp/usda/data-sets/crops/94018/budget.wk1:
Risk Level 5
Tag 14: NUMBER: Floating point number Qty: 584
[There are 584 cells with floating point numbers.]
-----
/usda/ftp/usda/data-sets/crops/94018/charactr.wk1:
Risk Level 5 There are no tags in this file at this level
[In this file, there are no floating-point numbers. We can trust
the converter we are evaluating to convert this file successfully.]
-----
/usda/ftp/usda/data-sets/crops/94018/conf_int.wk1:
Risk Level 5
Tag 14: NUMBER: Floating point number Qty: 59
-----
[...Deleted lines...]
ERROR:
/usda/ftp/usda/datasets/crops/.district/.finderinfo /parsline.wk1
not a supported file type
[This file was a text file with information about the wk1
files in a directory.]
-----
[...Deleted lines...]
-----
/usda/ftp/usda/data-sets/crops/.district/parsline.wk1:
Risk Level 5 There are no tags in this file at this level
-----
/usda/ftp/usda/data-sets/livestock/89032/acheesu.wk1:
Risk Level 5
Tag 14: NUMBER: Floating point number Qty: 182
-----
Number of files in the file hierarchy = 31268
Number of wk1 files examined = 8979
Summary of our approach
- We created a spreadsheet that exercises all of the .wk1 file's
attributes.
- 2To test a file conversion application's capabilities, we converted
the file from .wk1 to .xls.
- We visually compared the files, point by point, to uncover any
inconsistencies between the two versions.
- We examined the specifications for the .wk1 file format to identify
the internal tags governing the at-risk attributes.
- We used the RiskEditor program to configure the Examiner program,
marking tags at risk.
- We examined the collection of files with the Examiner program,
which returned a report detailing the files containing the at-risk
tags.
Links to other parts of this report:
Table of Contents
Risk Management of Digital Information
Appendix A: Risk-Assessment Workbook
Appendix B: Documentation for Format Migration
Test File, Lotus 1-2-3, Release 2.2
Appendix D: Case Study for Image File Format
Appendix E: Case Study for Lotus 1-2-3
.wk1 Format
Appendix F: Migration Software Analysis,
Software Assessment Sheet
Appendix
G: Specifications for the Cornell Digital Library Format
Return to CLIR Home Page >> |