Findings and Recommendations • CLIR

Upon completing the steps defined in both the data and documentation tracks, we carefully examined the processes of migrating hardware- and software-dependent formats to independent formats. We also evaluated the formats in relation to their utility and ease of use over time. In both the data migration and the documentation migration processes, it was imperative that the original content be preserved. Migration that included recoding the content of the data files (changing the character or numeric equivalents in a data file) proved to be labor-intensive and error-prone, and produced unacceptable changes to the original content of the data. Editing the text output from the scanning process proved to be the same: error-prone, time-consuming, and incomplete. Therefore, recoding as a part of migration is not recommended. However, simply copying a file in its original format from medium to medium (refreshing) is not enough.

Software-dependent data file formats, such as the original column binary files examined in the project, cannot be read without specific software routines. If standard software packages do not offer those specific routines in the future, translation programs that emulate the software’s reading of the column binary format could provide a solution. However, these emulation programs will themselves require migration strategies over time. We offer another alternative for the column binary format: convert the data out of the column binary format into ASCII without changing the coded values of the files. The spread ASCII format meets the criterion of software independence while simultaneously preserving the original content of the data set. It does, however, require a file-by-file migration strategy that would be time-consuming for a large collection of files.

Finding a parallel solution for the documentation files is not possible at this time. We can not accurately generate character-by-character equivalents of the paper records. We can, however, scan the paper into digital representations that could be used in future character recognition technologies. The Adobe PDF image+text format does provide an interim solution by producing digital versions of the image and limited ASCII representation of the text. However, the types of documentation files produced by the Adobe process are software dependent. If software packages move away from the format used to store the image+text files, translators will be necessary to search, print, and display the files. We therefore recommend archiving the images of the printed pages in both nonproprietary TIFF format and PDF image+text format.

User Evaluation

We asked faculty and graduate students to make an informal review of the findings and sample output from the project. All our evaluators had previous experience using data files from the Yale Roper Collection. Regarding the data conversions, they expressed relief at not having to use column binary input statements to read the data files. They had no difficulty in using the chart that mapped column binary to ASCII in order to locate variables in the spread ASCII version of the data files, once it was explained that each variable mapped directly to a 12-column equivalent and instructions were given for finding single- and multiple-punch locations.

As for the documentation track, faculty and student reviewers found viewing and browsing PDF format files acceptable. Since most users had accessed PDF files on the Web, they seemed comfortable moving from the Internet browser to the Adobe Reader to locate question text and variable location information. This may not be the case with inexperienced users. These users were eager to have more questionnaire texts available for browsing and searching. The lack of a large sample collection of questionnaires did not allow evaluation of question text searching on a large scale.

Findings about Data Conversion

Column binary into SAS and SPSS. As long as software packages can read the SAS and SPSS export formats, recoding the column binary format into SAS and SPSS export files is an attractive option. These file formats can be used easily, are transportable to multiple operating systems and equipment configurations, and can be transformed into other software-specific formats. They do, however, have a number of drawbacks.

First, the original data file must be recoded, a process that is lengthy and potentially error-prone and one that places great reliance on the person doing the translation. If that person does not adequately check for errors, annotate the documentation for irregular variables, or properly recode the original patterns of punches, the translated data set becomes inconsistent with the original. Also, some irregularities in the original data set, which may be meaningful in the analysis of the data, can become lost when the data set is cleaned up. For instance, the original documentation might indicate that a question has four possible responses: PUNCH.1 through PUNCH.3 for a rating scale, and PUNCH.12 for a “don’t know” response. On examination of the xray, though, it is discovered that about 200 people had their responses coded as PUNCH.11, not PUNCH.12. A decision must be made: will those PUNCH.11 observations be given the same special missing value code as that for PUNCH.12? This solution will put the responses in the data set into accordance with the documentation by assuming that the strange punches were simply due to errors in data entry. On the other hand, the strange punches could have been intentionally entered to mark out those observations for special reasons that are not listed in the documentation. In this case, it would be better to give the PUNCH.11 observations a special missing value different from that for PUNCH.12. Once the recoding is done, future researchers will be unable to re-create the original data set with its irregularities.

Second, this process of reading, recoding, and cleaning data files to produce SAS (or SPSS) system files and export files is very time-consuming. For example, it took 20 hours of work to write, debug, and double-check a SAS program to recode the data set for Roper Report 9209, which includes a split sample and a variety of variable types. Assuming a wage of $15 an hour for an experienced SAS programmer, we would expect a cost of approximately $600 per data set for a complete job of data recoding. This estimate, of course, does not include the cost of consultations with data archivists about the recoding of particular variables or the cost of rewriting documentation to reflect the new format of the data set.

Third, data files stored in SAS and SPSS (and other statistical software) formats require proprietary software to read the information. Although there has been an increase in programs that can read and transfer data files from one program, and one version of a program, to another, there is no guarantee that programs for specific versions of software will be available in the future. U.S. Census data from the 1970s were produced in compressed format (called Dualabs) that relied on custom programs and can no longer be read on most of today’s computing platforms. Such system- and software-dependent formats require expensive migration strategies to move them to future computing technologies.

Spread ASCII format. If an archival standard is defined as a non-column binary, nonproprietary format that faithfully reproduces the content of the original files, only the spread ASCII format meets these conditions. This spread format, however, is at least 600 percent larger than the original file and requires converting the original column binary structure. It also requires producing additional documentation, since each punch listed in the original documentation must be assigned a new column location in the spread ASCII data. We recommend that each file be converted to a standard spread ASCII format so that a single conversion map may be used for all the data files (see Appendix 5 for an example from this project). Producing such an ASCII data set has several advantages. Information is not lost from the original data set, because the pattern of 0s and 1s remains conceptually the same across data files. It is unnecessary for a data translator to interpose herself between the original data and the final user.

However, the spread ASCII format is not perfect. The storage requirements for spread ASCII data are on a par with the size requirements for a SAS data set containing both recoded and intermediate variables. For the overhead in storage cost, the SAS data set at least provides internally referenced variable names and meaningful variable values. The size of the spread ASCII data becomes even more apparent when it is contrasted with the size of a recoded ASCII data set. In a recoded data set, each unused bit can be left out of the final data; particular bits in a column need not even be input if they contain no values for the variable, and columns without variables may also be skipped. In contrast, in the spread ASCII data, each bit is input and translated to a character, whether it is used or not.

The spread ASCII format requires that users must know how the variables, particularly the multiple-response variables, relate to the 12-column equivalent. A spread ASCII data set is not as easily used as a fully recoded one; after all, recoding the 0s and 1s into usable variable values will fall to the end user with ASCII data, just as it currently does with column binary data. Finally, each punch listed in the original documentation must be assigned a new column location in the spread ASCII data. Users must refer to an additional piece of documentation, the ASCII data map, to locate data of interest, and this extra step inevitably creates some initial confusion.

Hybrid spread ASCII. The hybrid spread ASCII format, distributed by the Roper Center and the Institute for Research in Social Science at University of North Carolina, offers another alternative. The original data are stored in column binary format in the first horizontal layer of the file. This format preserves the original structure of the data file in the first part of a new record. A second horizontal layer of converted data, in spread ASCII format, is added to the new record. The primary advantage of this hybrid spread ASCII data file format is that users can access the nonbinary portions of the original data file in their original column locations as indicated on the questionnaires. However, users have to know whether the question and variables of interest were coded in the binary format in order to determine whether to read the first part of the record or the second. The ASCII codes in the first horizontal layer are readable, but any binary coding in that first horizontal layer are not, since the binary coding is converted to ASCII to avoid problems while reading the data with statistical software. If users want to read data that were coded in binary in the original file, they can read the spread ASCII version of multiple-punched equivalents in the second horizontal layer without having to learn the column binary input statements to read the data.

We chose to produce converted ASCII files that did not have this two-part structure. To use the spread ASCII data files we constructed, users first determine the original column location of a particular variable from the questionnaire, and then use a simple data map to locate the column location in the new spread ASCII file (see Appendix 5).

Original column binary format. The column binary format itself turned out to be more attractive as a long-term archival standard than we had anticipated. It conserves space, it preserves the original coding of data and matches the column location information in the documentation, it can be transferred among computers in standard binary, and it can be read by standard statistical packages on all of the platforms we used in testing. Unfortunately, it is difficult to locate and decipher information about how to read column binary data with SAS and SPSS, as the latest manuals (for PC versions of the software) no longer contain supporting information about this format. This lack of documentation support indicates the possibility that the input formats will not be offered in subsequent software versions.

On the other hand, as long as the format exists, there seems to be some level of commitment to support it. As stated in the SAS Language: Reference text, “because multipunched decks and card-image data sets remain in existence the SAS System provides informats for reading column-binary data.” (SAS 1990, 38-9). If the column binary format is refreshed onto new media and preserved only in its original form, we recommend that sample programs for reading the data with standard statistical packages, or a stand-alone translation program, be included in the collection of supporting files. But even with these supplemental programs, accessing and converting the files will continue to present significant challenges to researchers. Given these considerations, we do not recommend this format over the spread ASCII alternative.

Findings about Documentation Conversion

The OCR output files from TextBridge Pro did not provide us with an adequate means for archiving the textual material. The questionnaires we scanned had an unacceptable rate of character recognition, including incorrect location information necessary for manipulating the accompanying data files. Handwritten notes were completely lost. Although the format allowed for searching of a particular text that was successfully recognized, the amount of editing required to produce a legible version of the original, review the output, and correct all errors was found to be prohibitive. Subsequent viewing was poor without the formatting capabilities of proprietary word processing software.

The PDF format provided solutions to some of the documentation distribution and preservation problems we faced, but it did not meet all of our needs. For one thing, the format does not go far enough in providing internal structure for the manipulation, output, and analysis of the metadata. Like a tagged MARC record in an online public access catalog, full-text documentation for numeric data requires specific content tagging to allow search, retrieval, manipulation, and reformatting of individual sections of the information. Another drawback is that PDF files are produced and stored in a format that may be difficult to read and search in the future. The PDF format, although a published standard, depends on proprietary software that may not be available in future computing environments. (A similar problem can be seen with dBase data files that are rapidly becoming outmoded, causing major problems with large collections of CD-ROM products distributed by the U.S. government.) We see increasing numbers of PDF documents distributed on the Internet and the format will be used by ICPSR for the distribution of machine-readable documentation to its member institutions. So, given the large number of PDF files in distribution, software for conversion will most likely be developed over time. However, the current popularity of the PDF format does not guarantee that software to read it will continue to be available throughout the future of technological evolution.

The TIFF graphic image file format is useful for viewing and for distribution on the Web and as an intermediate archival format, allowing storage of files until they can be processed in the future using more advanced OCR technology. Even though this format does not allow text searching, tagging/mark up, or editing, it moves the endangered material into digital format. ICPSR has decided to retain such digital images of its printed documentation collection for reprocessing as OCR technology evolves.

It is our recommendation that both PDF/Adobe Capture edited output and scanned image files be produced and archived. The PDF/Adobe image+text files allow searching and viewing of text in original formatting. The scanned image files can be archived for future character recognition and enhancement. We also want to emphasize the importance of the long-term development of tagged documentation using the DDI format. This is by far the most desirable format, albeit one that is difficult to produce from printed documentation, given the inadequacy of OCR technology and the costs of subsequent editing. We urge future producers and distributors of numeric data to help develop and adhere to standard system- independent documentation.

Recommendations to Data Producers

Design affects maintenance costs and long-term preservation. Producers of statistical data files need to be cognizant of preservation strategies and the importance of system-independent formats that can be migrated through generations of media and technological applications. In this project, we had significant problems with the column binary format of the data. Had long-term maintenance plans been considered, and the costs of migration been taken into consideration, the creators of the data format might not have chosen the column binary format. At the time, however, it was the most compact format to use and standard software could then be adapted to the format.

One thing is very clear: data producers would be advised and should be persuaded to take long-term maintenance and preservation considerations into account as they create data files and as they design value-added systems. Our experience shows that the most simple format is the best long-term format: the flat ASCII data file. We would urge producers to provide column-delimited ASCII files, accompanied by complete machine-readable documentation in nonproprietary and nonplatform-specific formats. Programming statements (also known as control cards) for SAS and SPSS are also highly recommended so that users can convert the raw data into system files. The control card files can be modified for later versions of statistical software and used for other programming applications and indexing.

In accordance with emerging standards for resource discovery, data files should contain a standard electronic header or be accompanied by machine-readable metadata identification information. This information should include complete citations to all the parts of a particular study (data files, documentation, control card files, and so forth) and serve as a study-level record of contents and structure.

Metadata standards. Not only must standards be considered for the structure of the numeric data files, but the metadatainformation describing the datamust also conform to content and format standards for current use and for long-term preservation applications. Content standards require common elements, or a common set of “containers” that hold specific types of information. Standards for coding variables should be followed so that linkages and cross-study analysis are enhanced, and common thesauri for searching and mining data collections also need to be produced. As for format standards, metadata should be produced in system-independent formats that provide standard structures for the common elements and coding schemes. These format standards should provide consistent tagging of elements that can be mapped to resource discovery and viewing software and to statistical analysis and database systems.

For most social science data files, machine-readable documentation should be supplied in ASCII format that conforms to standard guidelines for both content and format. With some very large surveys using complex survey instruments, files in this ASCII format may be so big that complex structures in nonproprietary format need to be developed to reduce storage requirements. If paper documentation is distributed, it should be produced using high-quality duplication techniques with simple fonts, no underlining, no handwritten notations, and plenty of white space.

Of particular interest is the Data Documentation Initiative (DDI) DTD (Document Type Definition in XML), which is a developing standard for documentation describing numeric databases. It provides both content and format standards for creating digitized documentation in XML. Data producers in the United States, Canada, and Europe will be testing the DTD as part of their documentation production process. Data archives will be converting their digitized documentation into this format, and some will be scanning paper documentation and tagging the content with the DDI.

Complex statistical systems. We must also be concerned about the long-term preservation plans for complex systems. As we see more efforts to integrate data and documentation within linked systems, we see a growing tension between access and preservation. Complex database management systems such as ORACLE or SQL present us with more complex questions: What parts of the system need to be preserved to save the content of the information as well as its integrity? Will snapshots of the system provide future users with enough information to simulate access as we see it today? Is the content usable outside the context of the system? These are the database preservation challenges of the future.