Contents • CLIR

Preserving the Whole:

A Two-Track Approach to Rescuing Social Science Data and Metadata

by Ann Green, JoAnn Dionne, and Martin Dennis
June 1999

Copyright 1999 by the Council on Library and Information Resources. No part of this publication may be reproduced or transcribed in any form without permission of the publisher. Requests for reproduction for noncommercial purposes, including educational advancement, private study, or research, will be granted. Full credit must be given to both the author and the Council on Library and Information Resources.

About the Authors

Acknowledgments

Preface

Background and Project Description

The Data Track

1. Identify Equipment

2. Copy Files

3. Examine Documentation

4. Define Format

5. Develop Standard Classifications

6. Read in Data

7. Identify Migration Formats

8. Recode Data Files

9. Create Spread ASCII Data Files

The Documentation Track

Software and Equipment

TextBridge Pro Optical Character Recognition

PDF Files from Adobe Capture

HTML and SGML/XML Marked-up Files

Findings and Recommendations

User Evaluation

Findings about Data Conversion

Findings about Documentation Conversion

Recommendations to Data Producers

Glossary

Reference List

Appendixes

1. Roper Report documentation page 3W: Questions 7-9 (photocopy)

2. Sample SAS input and recode statements

3. Data conversion formats and storage requirements

4. Programs to create spread ASCII datasets

5. Data map for column binary spread data

6a. Roper Report documentation page 4 W/Y: Question 10: photocopy

6b. Roper Report documentation page 4 W/Y: Question 10: TextBridge Pro

6c. Roper Report documentation page 4 W/Y: Question 10: PDF in Acrobat Exchange

Preserving the Whole: Supplementary Materials

The Digital Library Federation

On May 1, 1995, 16 institutions created the Digital Library Federation (additional partners have since joined the original 16). The DLF partners have committed themselves to “bring together-from across the nation and beyond-digitized materials that will be made accessible to students, scholars, and citizens everywhere.” If they are to succeed in reaching their goals, all DLF participants realize that they must act quickly to build the infrastructure and the institutional capacity to sustain digital libraries. In support of DLF participants’ efforts to these ends, DLF launched this publication series in 1999 to highlight and disseminate critical work.

DONALD J. WATERS
Director
Digital Library Federation

About the Authors

Ann Green is director of the Social Science Statistical Laboratory at Yale University, where she oversees social science research and instructional technologies, facilities, and support services. She has participated in the development of standards for social science metadata through the Data Documentation Initiative. She is vice president of the International Association for Social Science Information Service and Technology (IASSIST). From 1989 to 1996, she was consultant and technical manager of the Social Science Data Archive at Yale. She was data archivist from 1985 to 1989 at the Survey Research Center, University of California at Berkeley.

JoAnn Dionne is the social science data librarian at the University of Michigan Library where she is developing and providing data services for the campus. From 1977 to 1998, she was the social science data librarian at Yale University where the Yale Roper Collection was an integral part of her responsibilities.

Martin Dennis is a Ph.D. candidate in psychology at Yale University. His main area of research is in human reasoning in general and causal induction in particular. In addition to his studies, he works as a part-time statistics and computer consultant at the Yale Social Science Statistical Laboratory, where his responsibilities include helping users to access Yale’s collection of public use data sets.

Acknowledgments

We wish to thank Scott Redinius of the Yale Economics Department for his work on the TextBridge Pro portion of the project and his advice on OCR software, our scanning workstation, and developing evaluation procedures. Soo Yeon Kim of the Yale Political Science Department provided welcome editorial comments. Thanks also goes to David Sheaves at the University of North Carolina’s Institute for Social Science Research, for helping us evaluate column binary data options and spread ASCII formats and for providing very useful SAS programs. We also wish to acknowledge the help of Marilyn Potter and Marc Maynard, from the Roper Center for Public Opinion Research, for answering questions, rushing us replacement copies of data sets, and sharing their xray program. We also acknowledge Kathleen Eisenbeis, former director of the Yale Social Science Libraries and Information Services, for her contributions to the early stages of the project. And lastly, to Donald Waters for his encouragement, advice, and supportthank you.

Preface

Quantitative data, including social survey results, test measurements, economic and financial series, and government statistics, are vital resources for research and education in a variety of disciplines concerned with advancing the study of individuals and society. For decades, these data have been encoded, stored, and used primarily in digital form. Custodians who have collected, maintained, and provided access to numeric data resources thus have been building and managing digital librariesand scholars and students have been effectively using them in the pursuit of historical, social, and scientific studies long before the term digital library came into wide currency.

Those who are grappling with an explosion of digital information in a dizzying range of formats have much to learn from social science data librarians and users who have relatively long experience in managing and working with digital resources. Data producers, librarians, and scholarly users have come to invest in very sophisticated mechanisms for storing and distributing social science data. They have achieved valuable economies of scale in data storage and delivery through consortial developments, such as the data archives held by the Inter-university Consortium for Political and Social Research (ICPSR). Through years of experience with repeated changes in storage technologies and in the software for encoding and using the data, they have become particularly adept at the long-term maintenance of information in digital form.

In 1996, the Task Force on Archiving of Digital Information highlighted the difficulties of preserving digital information over long periods of time. As a way of addressing these difficulties, the task force recommended in part that its sponsors, the Commission on Preservation and Access and the Research Libraries Group, seek to document the experiences of communities already well practiced in the preservation of digital information. Responding to this recommendation, the Commission, which has since merged with the Council on Library Resources to become the Council on Library and Information Resources (CLIR), sought out the expertise of those managing university-based data archives. It contracted for the development of this paper with the authors, who at the time worked together in managing the Social Science Data Archives at Yale University, one of the oldest data archives in American universities.

Preserving the Whole appears as the second publication of the Digital Library Federation and reflects the Federation’s interests both in advancing the state of the art of social science data archives and in building the infrastructure necessary for the long-term maintenance of digital information. The paper is especially valuable as a meticulously detailed case study of migration as a preservation strategy. It explores the options available for migrating both data stored in a technically obsolete format and their associated documentation stored on paper, which may itself be rapidly deteriorating. The obsolete data format known as column binary was born in the same era of creatively parsimonious coding techniques that have given rise to the widely publicized Year 2000 (Y2K) computer problems.

Beyond its contributions to our understanding of migration as a particular strategy for the long-term maintenance of digital information, Preserving the Whole also provides more general lessons. It is a remarkable finding of this study that the column binary format, although technically obsolete, is so well documented that numerous options exist not just for migrating column binary files to other formats, but also for reading them in their native format. Moreover, the authors make the important observation that data sets will be indecipherable and cannot survive at all, regardless of the file format in which they are stored, if there is no effort made also to preserve their codebooks. A codebook is essential documentation that relates the numeric data to meaningful fields and values of information.

From more theoretical perspectives, Jeff Rothenberg (1999) and David Bearman (1999) both emphasize the critical importance of documentation, or metadata, for preserving digital information. The value of Preserving the Whole is that it makes a similar argument, but concretely and from the long experience of the data community in effectively managing digital information.

Donald J. Waters