 |
Findings and Recommendations
Upon completing the steps defined in both the data and documentation
tracks, we carefully examined the processes of migrating hardware-
and software-dependent formats to independent formats. We also evaluated
the formats in relation to their utility and ease of use over time.
In both the data migration and the documentation migration processes,
it was imperative that the original content be preserved. Migration
that included recoding the content of the data files (changing
the character or numeric equivalents in a data file) proved to be
labor-intensive and error-prone, and produced unacceptable changes
to the original content of the data. Editing the text output from
the scanning process proved to be the same: error-prone, time-consuming,
and incomplete. Therefore, recoding as a part of migration is not
recommended. However, simply copying a file in its original format
from medium to medium (refreshing) is not enough.
Software-dependent data file formats, such as the original column
binary files examined in the project, cannot be read without specific
software routines. If standard software packages do not offer those
specific routines in the future, translation programs that emulate
the software's reading of the column binary format could provide
a solution. However, these emulation programs will themselves require
migration strategies over time. We offer another alternative for
the column binary format: convert the data out of the column
binary format into ASCII without changing the coded values of the
files. The spread ASCII format meets the criterion of software independence
while simultaneously preserving the original content of the data
set. It does, however, require a file-by-file migration strategy
that would be time-consuming for a large collection of files.
Finding a parallel solution for the documentation files is not possible
at this time. We can not accurately generate character-by-character
equivalents of the paper records. We can, however, scan the paper
into digital representations that could be used in future character
recognition technologies. The Adobe PDF image+text format does provide
an interim solution by producing digital versions of the image and
limited ASCII representation of the text. However, the types of documentation
files produced by the Adobe process are software dependent. If software
packages move away from the format used to store the image+text files,
translators will be necessary to search, print, and display the files.
We therefore recommend archiving the images of the printed pages
in both nonproprietary TIFF format and PDF image+text format.
User Evaluation
We asked faculty and graduate students to make an informal review
of the findings and sample output from the project. All our evaluators
had previous experience using data files from the Yale Roper Collection.
Regarding the data conversions, they expressed relief at not having
to use column binary input statements to read the data files. They
had no difficulty in using the chart that mapped column binary to
ASCII in order to locate variables in the spread ASCII version of
the data files, once it was explained that each variable mapped directly
to a 12-column equivalent and instructions were given for finding
single- and multiple-punch locations.
As for the documentation track, faculty and student reviewers found
viewing and browsing PDF format files acceptable. Since most users
had accessed PDF files on the Web, they seemed comfortable moving
from the Internet browser to the Adobe Reader to locate question
text and variable location information. This may not be the case
with inexperienced users. These users were eager to have more questionnaire
texts available for browsing and searching. The lack of a large sample
collection of questionnaires did not allow evaluation of question
text searching on a large scale.
Findings about Data Conversion
Column binary into SAS and SPSS. As long as software packages
can read the SAS and SPSS export formats, recoding the column binary
format into SAS and SPSS export files is an attractive option. These
file formats can be used easily, are transportable to multiple operating
systems and equipment configurations, and can be transformed into
other software-specific formats. They do, however, have a number
of drawbacks.
First, the original data file must be recoded, a process that is
lengthy and potentially error-prone and one that places great reliance
on the person doing the translation. If that person does not adequately
check for errors, annotate the documentation for irregular variables,
or properly recode the original patterns of punches, the translated
data set becomes inconsistent with the original. Also, some irregularities
in the original data set, which may be meaningful in the analysis
of the data, can become lost when the data set is cleaned up. For
instance, the original documentation might indicate that a question
has four possible responses: PUNCH.1 through PUNCH.3 for a rating
scale, and PUNCH.12 for a "don't know" response. On examination
of the xray, though, it is discovered that about 200 people had their
responses coded as PUNCH.11, not PUNCH.12. A decision must be made:
will those PUNCH.11 observations be given the same special missing
value code as that for PUNCH.12? This solution will put the responses
in the data set into accordance with the documentation by assuming
that the strange punches were simply due to errors in data entry.
On the other hand, the strange punches could have been intentionally
entered to mark out those observations for special reasons that are
not listed in the documentation. In this case, it would be better
to give the PUNCH.11 observations a special missing value different
from that for PUNCH.12. Once the recoding is done, future researchers
will be unable to re-create the original data set with its irregularities.
Second, this process of reading, recoding, and cleaning data files
to produce SAS (or SPSS) system files and export files is very time-consuming.
For example, it took 20 hours of work to write, debug, and double-check
a SAS program to recode the data set for Roper Report 9209, which
includes a split sample and a variety of variable types. Assuming
a wage of $15 an hour for an experienced SAS programmer, we would
expect a cost of approximately $600 per data set for a complete
job of data recoding. This estimate, of course, does not include
the cost of consultations with data archivists about the recoding
of particular variables or the cost of rewriting documentation to
reflect the new format of the data set.
Third, data files stored in SAS and SPSS (and other statistical
software) formats require proprietary software to read the information.
Although there has been an increase in programs that can read and
transfer data files from one program, and one version of a program,
to another, there is no guarantee that programs for specific versions
of software will be available in the future. U.S. Census data from
the 1970s were produced in compressed format (called Dualabs) that
relied on custom programs and can no longer be read on most of today's
computing platforms. Such system- and software-dependent formats
require expensive migration strategies to move them to future computing
technologies.
Spread ASCII format. If an archival standard is defined as
a non-column binary, nonproprietary format that faithfully reproduces
the content of the original files, only the spread ASCII format meets
these conditions. This spread format, however, is at least 600 percent
larger than the original file and requires converting the original
column binary structure. It also requires producing additional documentation,
since each punch listed in the original documentation must be assigned
a new column location in the spread ASCII data. We recommend that
each file be converted to a standard spread ASCII format so that
a single conversion map may be used for all the data files (see Appendix
5 for an example from this project). Producing such an ASCII data
set has several advantages. Information is not lost from the original
data set, because the pattern of 0s and 1s remains conceptually the
same across data files. It is unnecessary for a data translator to
interpose herself between the original data and the final user.
However, the spread ASCII format is not perfect. The storage requirements
for spread ASCII data are on a par with the size requirements for
a SAS data set containing both recoded and intermediate variables.
For the overhead in storage cost, the SAS data set at least provides
internally referenced variable names and meaningful variable values.
The size of the spread ASCII data becomes even more apparent when
it is contrasted with the size of a recoded ASCII data set. In a
recoded data set, each unused bit can be left out of the final data;
particular bits in a column need not even be input if they contain
no values for the variable, and columns without variables may also
be skipped. In contrast, in the spread ASCII data, each bit is input
and translated to a character, whether it is used or not.
The spread ASCII format requires that users must know how the variables,
particularly the multiple-response variables, relate to the 12-column
equivalent. A spread ASCII data set is not as easily used as a fully
recoded one; after all, recoding the 0s and 1s into usable variable
values will fall to the end user with ASCII data, just as it currently
does with column binary data. Finally, each punch listed in the original
documentation must be assigned a new column location in the spread
ASCII data. Users must refer to an additional piece of documentation,
the ASCII data map, to locate data of interest, and this extra step
inevitably creates some initial confusion.
Hybrid spread ASCII. The hybrid spread ASCII format, distributed
by the Roper Center and the Institute for Research in Social Science
at University of North Carolina, offers another alternative. The
original data are stored in column binary format in the first horizontal
layer of the file. This format preserves the original structure of
the data file in the first part of a new record. A second horizontal
layer of converted data, in spread ASCII format, is added to the
new record. The primary advantage of this hybrid spread ASCII data
file format is that users can access the nonbinary portions of the
original data file in their original column locations as indicated
on the questionnaires. However, users have to know whether the question
and variables of interest were coded in the binary format in order
to determine whether to read the first part of the record or the
second. The ASCII codes in the first horizontal layer are readable,
but any binary coding in that first horizontal layer are not, since
the binary coding is converted to ASCII to avoid problems while reading
the data with statistical software. If users want to read data that
were coded in binary in the original file, they can read the spread
ASCII version of multiple-punched equivalents in the second horizontal
layer without having to learn the column binary input statements
to read the data.
We chose to produce converted ASCII files that did not have this
two-part structure. To use the spread ASCII data files we constructed,
users first determine the original column location of a particular
variable from the questionnaire, and then use a simple data map to
locate the column location in the new spread ASCII file (see Appendix
5).
Original column binary format. The column binary format itself
turned out to be more attractive as a long-term archival standard
than we had anticipated. It conserves space, it preserves the original
coding of data and matches the column location information in the
documentation, it can be transferred among computers in standard
binary, and it can be read by standard statistical packages on all
of the platforms we used in testing. Unfortunately, it is difficult
to locate and decipher information about how to read column binary
data with SAS and SPSS, as the latest manuals (for PC versions of
the software) no longer contain supporting information about this
format. This lack of documentation support indicates the possibility
that the input formats will not be offered in subsequent software
versions.
On the other hand, as long as the format exists, there seems to
be some level of commitment to support it. As stated in the SAS
Language: Reference text, "because multipunched decks
and card-image data sets remain in existence the SAS System provides
informats for reading column-binary data." (SAS 1990, 38-9).
If the column binary format is refreshed onto new media and preserved
only in its original form, we recommend that sample programs for
reading the data with standard statistical packages, or a stand-alone
translation program, be included in the collection of supporting
files. But even with these supplemental programs, accessing and converting
the files will continue to present significant challenges to researchers.
Given these considerations, we do not recommend this format over
the spread ASCII alternative.
Findings about Documentation Conversion
The OCR output files from TextBridge Pro did not provide us with
an adequate means for archiving the textual material. The questionnaires
we scanned had an unacceptable rate of character recognition, including
incorrect location information necessary for manipulating the accompanying
data files. Handwritten notes were completely lost. Although the
format allowed for searching of a particular text that was successfully
recognized, the amount of editing required to produce a legible version
of the original, review the output, and correct all errors was found
to be prohibitive. Subsequent viewing was poor without the formatting
capabilities of proprietary word processing software.
The PDF format provided solutions to some of the documentation distribution
and preservation problems we faced, but it did not meet all of our
needs. For one thing, the format does not go far enough in providing
internal structure for the manipulation, output, and analysis of
the metadata. Like a tagged MARC record in an online public access
catalog, full-text documentation for numeric data requires specific
content tagging to allow search, retrieval, manipulation, and reformatting
of individual sections of the information. Another drawback is that
PDF files are produced and stored in a format that may be difficult
to read and search in the future. The PDF format, although a published
standard, depends on proprietary software that may not be available
in future computing environments. (A similar problem can be seen
with dBase data files that are rapidly becoming outmoded, causing
major problems with large collections of CD-ROM products distributed
by the U.S. government.) We see increasing numbers of PDF documents
distributed on the Internet and the format will be used by ICPSR
for the distribution of machine-readable documentation to its member
institutions. So, given the large number of PDF files in distribution,
software for conversion will most likely be developed over time.
However, the current popularity of the PDF format does not guarantee
that software to read it will continue to be available throughout
the future of technological evolution.
The TIFF graphic image file format is useful for viewing and for
distribution on the Web and as an intermediate archival format, allowing
storage of files until they can be processed in the future using
more advanced OCR technology. Even though this format does not allow
text searching, tagging/mark up, or editing, it moves the endangered
material into digital format. ICPSR has decided to retain such digital
images of its printed documentation collection for reprocessing as
OCR technology evolves.
It is our recommendation that both PDF/Adobe Capture edited output
and scanned image files be produced and archived. The PDF/Adobe image+text
files allow searching and viewing of text in original formatting.
The scanned image files can be archived for future character recognition
and enhancement. We also want to emphasize the importance of the
long-term development of tagged documentation using the DDI format.
This is by far the most desirable format, albeit one that is difficult
to produce from printed documentation, given the inadequacy of OCR
technology and the costs of subsequent editing. We urge future producers
and distributors of numeric data to help develop and adhere to standard
system- independent documentation.
Recommendations to Data Producers
Design affects maintenance costs and long-term preservation. Producers
of statistical data files need to be cognizant of preservation strategies
and the importance of system-independent formats that can be migrated
through generations of media and technological applications. In this
project, we had significant problems with the column binary format
of the data. Had long-term maintenance plans been considered, and
the costs of migration been taken into consideration, the creators
of the data format might not have chosen the column binary format.
At the time, however, it was the most compact format to use and standard
software could then be adapted to the format.
One thing is very clear: data producers would be advised and should
be persuaded to take long-term maintenance and preservation considerations
into account as they create data files and as they design value-added
systems. Our experience shows that the most simple format is the
best long-term format: the flat ASCII data file. We would urge producers
to provide column-delimited ASCII files, accompanied by complete
machine-readable documentation in nonproprietary and nonplatform-specific
formats. Programming statements (also known as control cards) for
SAS and SPSS are also highly recommended so that users can convert
the raw data into system files. The control card files can be modified
for later versions of statistical software and used for other programming
applications and indexing.
In accordance with emerging standards for resource discovery, data
files should contain a standard electronic header or be accompanied
by machine-readable metadata identification information. This information
should include complete citations to all the parts of a particular
study (data files, documentation, control card files, and so forth)
and serve as a study-level record of contents and structure.
Metadata standards. Not only must standards be considered
for the structure of the numeric data files, but the metadatainformation
describing the datamust also conform to content and format standards
for current use and for long-term preservation applications. Content
standards require common elements, or a common set of "containers" that
hold specific types of information. Standards for coding variables
should be followed so that linkages and cross-study analysis are
enhanced, and common thesauri for searching and mining data collections
also need to be produced. As for format standards, metadata should
be produced in system-independent formats that provide standard structures
for the common elements and coding schemes. These format standards
should provide consistent tagging of elements that can be mapped
to resource discovery and viewing software and to statistical analysis
and database systems.
For most social science data files, machine-readable documentation
should be supplied in ASCII format that conforms to standard guidelines
for both content and format. With some very large surveys using complex
survey instruments, files in this ASCII format may be so big that
complex structures in nonproprietary format need to be developed
to reduce storage requirements. If paper documentation is distributed,
it should be produced using high-quality duplication techniques with
simple fonts, no underlining, no handwritten notations, and plenty
of white space.
Of particular interest is the Data Documentation Initiative (DDI)
DTD (Document Type Definition in XML), which is a developing standard
for documentation describing numeric databases. It provides both
content and format standards for creating digitized documentation
in XML. Data producers in the United States, Canada, and Europe will
be testing the DTD as part of their documentation production process.
Data archives will be converting their digitized documentation into
this format, and some will be scanning paper documentation and tagging
the content with the DDI.
Complex statistical systems. We must also be concerned about
the long-term preservation plans for complex systems. As we see more
efforts to integrate data and documentation within linked systems,
we see a growing tension between access and preservation. Complex
database management systems such as ORACLE or SQL present us with
more complex questions: What parts of the system need to be preserved
to save the content of the information as well as its integrity?
Will snapshots of the system provide future users with enough information
to simulate access as we see it today? Is the content usable outside
the context of the system? These are the database preservation challenges
of the future.
Next Previous
Return to CLIR Home Page >> |