Background and Project Description • CLIR

In December 1994, the Commission on Preservation and Access and the Research Libraries Group created the Task Force on Archiving of Digital Information. The purpose of the task force was “to investigate the means of ensuring continued access indefinitely into the future of records stored in digital electronic form.” Digital media are more fragile than paper and become unreadable more quickly because of changes in operating systems and applications software and the deterioration of physical media, and because no organization has accepted responsibility for preservation. In its definitive 1996 report, Preserving Digital Information, the task force warned that “owners or custodians who can no longer bear the expense and difficulty of migration will deliberately or inadvertently, through a simple failure to act, destroy the objects without regard for future use.”

The task force’s warning echoed the growing realization by researchers who were using social science statistical data in digital form and specialists who were archiving these data that major rescue efforts to identify, locate, and preserve computer files produced with rapidly outmoded technology could not be postponed. Because access to social science numeric data requires metadataaccompanying paper or machine-readable recordsthe loss of the metadata can also mean the loss of the data file.

The two approaches to preservation of digital files under evaluation in the early 1990s were refreshing and migration. Refreshing refers to the copying of information from one medium to another without changing the format or internal structure of the records in the files. Refreshing digital information will suffice as long as software exists to manipulate the format of the files. Since digital information is produced in varying degrees of dependence upon particular hardware and software, refreshing cannot serve as a general solution for preserving digital information. The task force emphasized migration of digital information, “designed to achieve the periodic transfer of digital materials from one hardware/software configuration to another, or from one generation of computer technology to a subsequent generation.” Migration includes refreshing the media but also addresses the internal structure of the files so that the information within can be read on subsequent computer platforms, operating systems, and software.

The Yale Social Science Data Preservation Project

In 1996, the Commission on Preservation and Access commissioned the Social Science Library and the Social Science Statistical Laboratory at Yale (Statlab) to identify and evaluate the formats that would most likely provide the ability to migrate social science statistical data and accompanying documentation¹ into future technical environments. The Yale University Library, one of the first academic libraries to form a collection of machine-readable data, began acquiring social science numeric data in 1972. Over the years, Yale has copied its data from one form of digital storage to another as mainframe computer technology has dictated. The copying of data, while labor-intensive, was straightforward in creating exact logical copies from out-of-date media in newer data storage formats. In the mid-1990s, as data use was moving from the mainframe to distributed computing systems and from one hardware/software configuration to another, digital formats began to require not just simple duplication, but restructuring. Files produced by standard statistical software on mainframes had to be converted into platform-independent formats before moving to personal computers. In addition, data stored on magnetic tapes had to be moved to new media as access to and support in using the Yale mainframe was discontinued.

[¹ At its first occurance, a word defined in the Glossary is shown in bold].

Our social science data preservation project team was headed by Ann Green (director, Statlab) and JoAnn Dionne (data librarian, Social Science Library) with the assistance of Martin Dennis (consultant, Statlab, and graduate student, psychology). We began our work in June 1996, during a time when many academic institutions were in the process of transferring numeric social science data sets from mainframe environments to PC- and UNIX-based networks. Large collections of numeric data had been successfully moved across these platforms. Considerably less attention had been directed toward the greater problem of developing system-independent archival formats, while also preserving and digitizing the accompanying paper records (metadata) that must be available to analyze the data sets.

We were thus faced with a two-track preservation approach: converting deteriorating paper (the documentation) to digital form, and migrating digitized numeric data to an archival format that can be read by future operating systems and applications software. The Yale University Library had taken a lead in digitizing for preservation (Conway 1996), and we built on that base in digitizing the paper records accompanying the data file. The Statlab had taken a lead in migrating data collections from mainframe-dependent tape storage to networked online storage, and we built on that base in restructuring and migrating the numeric files.

On the documentation track, we scanned printed textual material for 10 surveys selected from the Yale Roper Collection and evaluated the outcomes of applying optical character recognition (OCR), creating image files, and producing Adobe Portable Document Format (PDF) files. On the data track, we investigated diligently and in detail the implications of preserving data in their original format vs. migrating to restructured formats. We evaluated the alternative formats for migrating the original data files from tape and focused upon the benefits and drawbacks of each alternative. Details are covered in the Findings and Recommendations section of this report.

While evaluations of computer storage media should not be ignored in an overall strategy for planning the future costs and viability of data collections, we did not include media evaluations in this project. Nor did we research the intellectual property issues involved in conversion, leaving that to a later discussion. However, there is a long-established ethic in the social science data community that data documentation should be shared freely. For example, the Inter-university Consortium for Political and Social Research (ICPSR) recently began making all its machine-readable documentation freely accessible on the Internet.

At the end of the project in the fall of 1997, we developed a collection of information, including sample programs and documents, relevant to the project and made it available at the Statlab Web page of the Yale Web site. The collection of information has since moved to the Council on Library and Information Resources’ Web site at https://www.clir.org/pubs/reports/pub83/statlab. Included in the materials accessible at this site are:

a link to the Interim Report to the Commission on Preservation and Access
programs to create spread ASCII data files
spread ASCII data file example
sample data map for spread ASCII data file
SAS programs for recoding data and producing ASCII data from SAS data files
link for downloading Adobe Acrobat Reader
multiple examples of Adobe PDF files

The Roper Collection at Yale

The Yale Roper Collection contains materials from the Roper Center for Public Opinion Research (the Roper Center), whose data sets comprise a rich resource for research in political psychology and sociology. They provide a record of public opinion research in the United States from 1935 to the present, along with surveys conducted abroad since the 1940s. In addition to the data files, the Yale Roper Collection includes paper records such as questionnaires, information on sample sizes, and other notes necessary for use of the data files. Many of the paper records are brittle, have handwritten notes, and were produced through unstable copying technologies such as mimeography.

The first step in the project was to select a representative group of documents and accompanying data files from the collection. Our initial discussions led us to select the Roper Reports, a significant, heavily used part of the Yale Roper Collection. The Roper Reports have been produced since 1973 by the Roper Organization, a commercial polling company now known as Roper Starch Worldwide, Inc. The Roper Reports have 1,500-2,000 respondents, 200-300 variables, and polling for the reports is conducted 10 times per year in the United States. Data files contain demographic information such as age, sex, race, economic level, education, marital status, union membership, religious and political affiliation, and responses to questions on a broad array of issues facing society such as energy, politics, media, health and medical care, consumer behavior, education, and foreign policy.

The Roper Reports in the Yale Roper Collection do not have machine-readable documentation supplied with the data files. The documentation consists of paper photocopies of questionnaires and computer output. Some parts of the documentation are poorly duplicated copies with blurred text on a gray background and some questionnaires have handwritten notes in the margins. Most of the questionnaires are printed in multiple columns on a page with no standard format or layout. The Roper Reports documentation collection thus represents the problems inherent in the rest of the Yale Roper Collection. Of the 200 Roper Reports in the Yale Collection at the time of the project, 10 studies were selected across the full span of years to include any differences in format or documentation.

Our selection of the Roper Report data files was particularly important in the context of migrating data files. The files were stored in column binary format with portions of the files coded in an archaic format based upon the IBM punch card. The responses of a single case or individual interview were represented on one or more punch cards. Each punch card had 80 columns and 12 rows. The non-column binary format allowed a maximum of one character per column and a maximum of 80 variables per card. The column binary format, however, made it possible to store more than one variable in the same column. With punches allowed in each of the 12 rows, the maximum number of items was increased by up to a factor of 12. This column binary format was especially popular in the 1960s and 1970s when information was stored almost exclusively on computer cards, making it desirable to compress the data into as small a space as possible, because it provided space for multiple answers to a single question.

Special instructions must be given in software programs to define this unique column and row structure. Since the format is based upon old technology, knowledge about its use and software input formats to read it are increasingly rare. Our challenge was to find a new format that preserved the full intellectual content of the binary coding while allowing current and future technology to read the data and convert the computer card punches into meaningful values.

Literature Search

We reviewed the library literature for this project and conducted searches of the Internet. Discussion of the issues involved in archiving digital information had been well detailed in Preserving Digital Information, so we limited our search to topics specific to the preservation of social science numeric data and documentation. The literature search revealed much information on imaging as a preservation technique for books but little on preserving documentation for data files (see Reference List). We uncovered no previously published material on methods of preservation of electronic materials, other than duplicate copies moved from one storage medium to another. We found little information on the subject of copying data files and changing the way they are coded. We searched for reports on the conversion of multiple-punched data to other formats but found nothing. Nor did we find any discussion of standards for such conversions nor of the validity of various numeric data storage formats as archival media.

In addition, we inquired of the Center for Electronic Records of the National Archives and Records Administration (NARA), Archival Research and Evaluation Staff, to identify any standards they follow internally. NARA retains numeric data in the format they are received but will transform them on request (Adams 1996). There had been discussion among members of the data archive community about whether column binary was an acceptable archival format, but we found no published discussion of this issue. We also searched for reports on the use of proprietary formats in archiving electronic records. Again, we found almost no mention of numeric data in the published literature.

On the Web site of the ICPSR, we found one discussion of the conversion of questionnaire-type information from paper to electronic formats using OCR as opposed to imaging. This type of information may also be found in the business records management literature, which we did not review. JSTOR, the Journal Storage Project, was making images of journal pages available to subscribers via the Web and using OCR to index the pages (JSTOR 1996). This approach seemed to overcome the limitations of using either imaging or OCR technology alone.

We concluded that we needed to extrapolate from the more general literature on archiving textual data, which emphasized the desirability of storing information in formats independent of hardware and software (NARA 1990). The perils of using formats that depend on hardware and software in the case of textual data had been described by Jeff Rothenberg (1995). We had no reason to expect that numeric data would be any different.

During 1995 and 1996, we followed discussions on the informal list for ICPSR Official Representatives and the listserv for members of the International Association for Social Science Information Service and Technology, especially on the use of PDF for storage and distribution of codebooks. The discussions focused on the concern that PDF was not an acceptable archival format and would require reformatting during the lifetime of the documents. ICPSR also published a discussion of this issue (1996).