The Data Track • CLIR

The preservation project’s goals for the data track were to develop and evaluate a process of migrating digital numeric information from computer tape to hardware- and software-independent formats and to evaluate the utility of the resultant formats. The process of migrating was broken down into a series of nine steps.

Identify equipment
Copy files from mainframe-based media to local hard disks
Examine the documentation
Define the column binary format
Develop standard variable-naming classifications
Read in the data files with SAS and SPSS
Identify migration formats
Recode data files with SAS
Create spread ASCII data files without recoding

1. Identify Equipment

The computing environment used for the project consisted of IBM mainframes (all commands were submitted as JCL batch jobs issued from a CMS-based IBM mainframe to an MVS mainframe with tape access), and PC/Intel (Pentium 90) computers running Microsoft Windows for Workgroups on a Novell network.

2. Copy Files

We copied the column binary data files from old round reel tapes to new 3480 IBM cartridges on the Yale mainframe. This was the first step in refreshing the data from the old tape medium to a more stable magnetic medium. Next, the data were copied from the cartridges to mainframe disk and moved via standard file transfer protocols (in binary mode) to the Statlab Novell server, the home for the SAS program writing, record keeping, and output storage.

Once transferred to disks connected to the server, each data set was checked with a simple program that read in one variablethe deck or card numberfor each observation in the original data file. The number of observations for each deck number was then compared with information in the documentation. Any discrepancies were noted and data files reordered from the Roper Center if errors were confirmed.

3. Examine Documentation

The primary document for a particular Roper Report was simply a copy of the original survey questionnaire containing the questions asked in the survey. The card number and column number (or numbers) for locating the question in the record for each respondent were given for each question. Furthermore, the punch location was listed for each response option in the question. All of this informationcard number, column location, and punch locationwas necessary for reading the original column binary data.

The questionnaires varied in length and, in some cases, multiple versions of a questionnaire were provided. For example, some of the surveys used a split sample, either to ask different questions or to ask the same questions in different sequence. Often, the questions asked of the two groups in the split sample were not identical. In some cases, the column location and format of the different variables were identical. The samples could differ in the order of items in a multiple-response question, or in using slightly different sets of items. In other cases, the format of the questions (along with their column and row locations) changed radically between samples.

The xray, a special form of printed output supplied by the Roper Center, provided a response frequency used to check the data during the migration processes. Organized by card, column, and row, the xray gave the total number of punched bits across observations for each variable in the data set. The xray also provided useful information about the types of questions being answered and the complexity of reading the column binary format. Because each question was usually encoded in its own column, the sum of all of the responses (plus any blank observations), for most types of questions, should add up to the total number of people in the survey. If it did not, then the question allowed multiple responses and would need to be read in a different manner from single response variables.

4. Define Format

The structure of the column binary format is illustrated in table 1. The sample question shown in the table uses two columns (47 and 48) to cover all the possible answers. The full text of this question may also be seen in Appendix 1. Each card contains 12 rows numbered from top to bottom, beginning with 1 at the top. Punch numbers are assigned in a different way: a 12 punch goes in the top row, an 11 punch in the second row, and the third through twelfth rows are reserved for 0 through 9 punches. The 11 and 12 punches are often used for “don’t know,” “no answer,” or “not applicable” responses. In this example, a respondent chose the values 1, 2, and 4 (“living in poverty,” “being abused as a child,” and “drug abuse”), so punches were made in rows 4, 5, and 7, as shown in the last column of the table.

In this question, both columns 47 and 48 may have multiple punches representing the choices people were allowed to select from the list of causes of violent crime. Alternatively, in the non-column binary scheme allowing only one punch per column, each possible selection would have to be coded as a separate variable and therefore as a separate column, so coding of this question would take up 18 columns rather than two.

Table 1. Question Coding Example

Roper Report 9309, Question 8W:”…which three or four things do you think are the main causes of people committing violent crimes?”

Responses

Column location

Punch number

Punched

a. Living in poverty

b. Parents not teaching right from wrong

c. Being abused as a child

d. Drug abuse

e. What people see in TV programs

f. A lack of morals

g. A lack of education

h. A person not seeing any harm in it

i. A person being irresponsible

j. Influence of friends

k. Alcohol abuse

l. What people see in movies

11 (or X)

12 (or Y)

m. The advertising and marketing of toy guns

n. Guns being too easy to get

o. Low chance of being punished

p. Seeing pornography

None of these

Dont know

12 (orY)

5. Develop Standard Classifications

A standard classification of types of variables, based upon the types of questions asked in the surveys, was used in the construction of data set translation. The classification provided a scheme for the creation of standard templates and the logic behind the variable types used in recoding the data. In addition, the classification made it easier to construct appropriate variable names and to ensure consistent naming across data sets. Each of the variable types we defined required different variable-naming, formatting, and recoding procedures.

The four basic types of variables are regular numeric variables, numeric variables with special missing values, multiple-response questions, and single-response questions. Each variable type was associated with a different template for code creation, a specific form of variable names, and certain projected difficulties in recoding. For instance, a multiple-response variable was read in with a simple list of single-punch variables, was labeled simply with letter suffixes, and was not recoded. In contrast, an aggregated single-response variable was read in with a list of single-punch intermediate variables, was labeled with intermediate number suffixes, and was recoded into a final variable for the entire question. Retrieving the variable type directly from the documentation provided a guide to the particular piece of program code that was necessary for inputting and recoding the variable.

Regular numeric variables. Regular numeric variables range from 0 to 9. The values are computed by filling in each digit with the numbers in the appropriate columns. For example, Roper Report 9309, Question 68Y, asked “Do you think government lotteries produce an unwholesome gambling spirit in this country? Yes=1, No=2″

Numeric variables with special missing values. These variables are identical to regular numeric variables, except that there can be one or two special missing values (such as “don’t know”), recorded at punch 11 and/or punch 12. For example, Roper Report 9309, Question 2X, asked, “Do you feel things in this country are generally going in the right direction today, or do you feel that things have pretty seriously gotten off on the wrong track? Right track=1, Wrong track=2, Don’t know=Y”

Multiple-response questions. Multiple-response questions generally ask respondents to choose more than one option from a list of items, as in Roper Report 9309, Question 8W illustrated in table 1. If an item was checked, it was coded as 1 in the recoded data set; if it was not checked, it was coded as 0. Therefore, for any such question, there would be multiple binary variables corresponding to all of the possible responses (including special missing values), so that each possible response became a variable in a final recoded data set.

Single-response questions. Single-response questions allow one, and only one, response per item. They frequently include a special missing value at punch number 12, with several answer options between punch numbers 1 and 9. These questions require appropriate recoding for the final variable. For example, Roper 9309, Question 7Y, asked, “And thinking about crime in the United States, what one type of crime do you feel presents the biggest threat for you and your family today?” An example of multiple-column storage of a single-response question appears in table 2.

Table 2. Example of multiple-column storage of single-response question

Response

Punch number

Column location

a. Burglary, robbery, auto theft

b. Vandalism/hooliganism

c. Official corruption, government bribe-taking

d. Murder

e. Rape

f. Assault

g. Racketeering, extortion

h. Drugs

i. Prostitution

j. Speculation and swindling

k. Other

None of the above

Dont know

6. Read in Data

The first step in the SAS programming process was reading in the data using one of the three SAS column binary informats: PUNCH.d, CBw.d, and ROWw.d. The informat PUNCH.d reads a specific bit (with a value of either 0 or 1) in the original data set. The d value indicates which row in a column of data is to be read. The informat CBw.d, on the other hand, looks to see which one of the 10 numeric bits (0 through 9) has been punched, and returns that number as a value for the column. ROWw.d begins reading at a user-specified bit, looks to see which one of a user-specified number of bits (after the first) has been punched, and returns a number equal to the relative location of the punched bit. For instance, a ROW5.5 informat would start reading at PUNCH.5 and continue for four more bits through PUNCH.9; if bit 8 was punched, then the ROW5.5 informat would return a 4. The PUNCH.d informat was the most appropriate for this project. For clarification of its use, refer to the sample SAS program in Appendix 2. Depending on the final form of the question, the pattern of punches in a column usually had to be logically recoded later in the program from an intermediate variable to a final variable that matched the response options in the original documentation.

The Statistical Package for the Social Sciences (SPSS) is also able to read data, including multiple-response question responses, from a column binary data file by using the keywords MODE=MULTI-PUNCH on the FILE HANDLE command. An example of an SPSS program that reads a single variable from Roper Report 9309 data with SPSS is:

file handle test / name=’h:/roper/rprr9309.bin’ /recform=fixed / mode=multipunch/lrecl=160.

data list file=test records=9 /1 feddep 24 (A) /9.

execute.

In this example, a variable called FEDDEP is read in column 24. It has a possible Y coded as “don’t know”, requiring that SPSS read this variable as a character string (see also SPSS 1988, 84-86).

7. Identify Migration Formats

We selected the following formats to test the migration process:

SAS system files of recoded column binary data, with and without intermediate variables
SAS system files with shortened integer byte lengths
SAS export files of recoded column binary data
ASCII files produced from recoded column binary data
ASCII files of the binary data patterns in the original file, called spread ASCII

These formats were selected on the basis of the software’s ability to read the column binary format, the availability of project staff programming expertise, the transportability of output formats, storage requirements (size of output data sets), and long-term archival implications. While both SAS and SPSS software are able to read column binary data, the staff members working on the project had more experience with SAS, so we chose to work with that statistical package. With the exception of spread ASCII data files, each format we selected for testing contained completely recoded and renamed variables. Storage requirements were compared on the different platforms for many of the different types of SAS data files, as shown in Appendix 3.

8. Recode Data Files

Create intermediate variables and code missing values. After reading in the data with the SAS informat statements, the second step in the SAS programming process was to produce a set of if-then statements for recoding the individual punch data (the intermediate variables) into final variables. Each intermediate variable needed to have a column location, row location, variable name, and SAS informat instructions specified in the input statement, as seen in the example shown in table 3. The only things that changed while inputting this variable Q14 were the suffix and punch location for each intermediate variable. This redundancy increased when there were more intermediate variables, and especially with aggregated single-response variables. The logical statements used for recoding the intermediate variables also contained many repetitions of the same if-then commands.

Table 3. Example of variable location, name, and SAS informat statement

Column location	Variable name	Row location and informat
	Q14A_1	PUNCH.1
	Q14A_2	PUNCH.2
	Q14A_3	PUNCH.3
	Q14A_Y	PUNCH.12
	Q14B_1	PUNCH.1
	Q14B_2	PUNCH.2
	Q14B_3	PUNCH.3
	Q14B_Y	PUNCH.12

Since the creation of these translation programs in SAS involved a large amount of repetitive typing, we created templates for both the input and recoding statements. Templates were created for single-response variables (simple and aggregated), multiple-response variables, and numeric variables with special missing codes. Each template included generic characters for column locations, variable names, and SAS informats, as well as pre-formed logical recoding statements. Furthermore, the templates contained repeated lines of code for different numbers of punches or aggregate variables so that it would not be necessary to enter the redundant information for the different variables.

These templates considerably sped up translation code creation. Once the type of the current variable had been determined, the appropriate input and recode statements were copied from the template file, pasted into the program code, and then the key characters (such as column location and variable name) were replaced using a find/replace function in the text editor. For example, to create the code for variable Q14, a piece of the template would have been copied from the template file and pasted into the program, as illustrated in table 4. The aggregated single-response variable template included input and recode statements for sub-questions from A through Z; only A and B are shown in table 4.

Table 4. Examples of SAS input statements and recoding statements

/* Three options + NR */
(for INPUT command)
@COL1 Q000A_1 PUNCH.1
@COL1 Q000A_2 PUNCH.2
@COL1 Q000A_3 PUNCH.3
@COL1 Q000A_Y PUNCH.12
@COL2 Q000B_1 PUNCH.1
@COL2 Q000B_2 PUNCH.2
@COL2 Q000B_3 PUNCH.3
@COL2 Q000B_Y PUNCH.12(for recoding statements)
IF Q000A_1=1 AND Q000A_2=0 AND Q000A_3=0 AND Q000A_Y=0
THEN Q000A=1;
IF Q000A_1=0 AND Q000A_2=1 AND Q000A_3=0 AND Q000A_Y=0
THEN Q000A=2;
IF Q000A_1=0 AND Q000A_2=0 AND Q000A_3=1 AND Q000A_Y=0
THEN Q000A=3;
IF Q000A_1=0 AND Q000A_2=0 AND Q000A_3=0 AND Q000A_Y=1
THEN Q000A=9;
IF Q000B_1=1 AND Q000B_2=0 AND Q000B_3=0 AND Q000B_Y=0
THEN Q000B=1;
IF Q000B_1=0 AND Q000B_2=1 AND Q000B_3=0 AND Q000B_Y=0
THEN Q000B=2;
IF Q000B_1=0 AND Q000B_2=0 AND Q000B_3=1 AND Q000B_Y=0
THEN Q000B=3;
IF Q000B_1=0 AND Q000B_2=0 AND Q000B_3=0 AND Q000B_Y=1
THEN Q000B=9;

Then the string Q000 would be replaced with Q14, and COL1 and COL2 would be replaced with 30 and 31, respectively, resulting in the final code for inputting and recoding the variables Q14A and Q14B.

Create macro programs to recode data files. Given that we could define fairly well the different types of variables in a data set, and that we could create input and recode template statements for these different variable types, it was tempting to think that we would be able to write programs to automate the whole procedure. That is, we might imagine a program that could perform all of the operations described above with minimal effort from the human operators. Although such automation may be possible in the future, the irregularity of the data files presented a major obstacle. Not only would an automatic translation program have to deal with some of the complexities of, for example, split samples with different variable types in the same column, but it would also have to handle the different types of errors that occur in the original data files.

For example, one fairly straightforward solution (though quite a lot of work) would have been to create a program that received some sort of variable list from a human operator, set up logical conditions to create code around split samples, and then write out a SAS program to translate and recode the data file. More sophisticated error-checking would have been necessary, however, to handle simple events such as a variable being incorrectly marked as a single-response type in the documentation, but actually having multiple responses coded in the data. Because the recoded data files needed to be of archival quality, this error-checking would have had to be quite rigorous. Furthermore, judgment calls would have sometimes arisen when dealing with irregularities (that is, should the variable be recoded differently or should the irregularity be ignored), so that leaving the decision solely to a computer program was not advisable. In short, creating an automatic translation program to recode the data would have involved several compromises that we would not recommend.

Debug programs and check data. Several types of errors occurred in the SAS programming: typographic errors, invalid informats, and unexpected changes in variable types (where the questionnaire did not match the data). The key indicators for these problems were INVALID DATA warnings that appeared in the SAS log.

Once the programs ran without INVALID DATA messages, the accuracy of the translation needed to be checked. Complete frequencies for all of the variables in the final data set had to be compared to the frequencies in the xray.

For example, if variable Q9 (deck 1, column 30) had the following frequencies (in this case, PUNCH.12 was recoded to the missing value of 9):

Value

Frequency

then the xray for deck 1, column 30 should look like this:

Punch location

It was necessary to check all of the variables in the final data set, because one badly coded variable in the conversion job could compromise the archival integrity of the entire data set. Unfortunately, a random subset of variables may miss the errors. If there was a mismatch between the xray and the frequencies, the difference needed to be tracked down and the SAS job edited. This frequency check was also the last chance to resolve ambiguities between the type of variable listed in the documentation and that shown in the xray. One final point to note during this check was the maximum number of possible digits in a recoded variable. If the data set was to be output in ASCII format, then the number of digits in the special missing value should be equal to the number of digits in the regular values. That is, if a variable had values 1 through 3, then the special missing value should be 9; if it had values 1 through 12, then the special missing should be 99; and so on.

Save recoded data. The final step in the reading and recoding process was to write to disk the resultant SAS data sets. For recoded SAS to SAS export files, we saved the files as SAS system files and export files with all intermediate and recoded variables, and as SAS system files and export files with just the recoded variables (see Appendix 3).

For recoded SAS to ASCII, once the SAS data set had been completely debugged and double-checked, PUT statements could be written to create an ASCII version of the recoded data (see Appendix 2). The simplest way to write these PUT statements was to copy the INPUT statements from the original job (or from the recoding statements), strip off the column binary informat information, and manually type in the new column locations for each variable. These PUT statements had two advantages: unlike automatic translation programs, they created ASCII data files with minimum wasted space and the PUT statement itself could be distributed as a data dictionary.

For the first data set translated, an ASCII version of the recoded SAS data set was created using an automatic translation program called DBMSCOPY, version 5.10. However, this ASCII data set was not satisfactory. Although the translation program automatically created a data dictionary (based on the variable names in the SAS data set), the ASCII data set was not compact. The translation program seemed to be incapable of concatenating one variable’s width of columns against the previous variable’s columns, so that there would be no wasted space within the flat data set. Instead, the program inserted multiple spaces between each variable, which allowed a differing number of characters within each variable, but also inserted approximately 10 ASCII space characters between each variable. Because of the enormously increased storage requirements this insertion causes, this approach to creating ASCII data files was abandoned and program code for creating compact ASCII data files was written in SAS.

9. Create Spread ASCII Data Files

The original column binary data files did not necessarily have to be recoded during the translation process. Another possible method of translation was simply to convert, or spread, the column binary into ASCII data. Spread ASCII data files keep the binary structure of the column binary data, but encode each 0 or 1 as an ASCII character. For example, the following column in the original data (each 0 or 1 is one bit)

0
0
0
0
1
0
0
0
0
0
0
0

would become 000010000000 in the spread version. While this example uses a single-response variable, multiple-response questions may have more than one bit with a value of 1 in the column. To create a spread version of the data, each bit (or punch) in the column binary file could be read (via SAS in our case) as an individual variable; each variable was then written to a new ASCII data set. Because the punches in the column binary data were rectangular, and because the variables being written out did not have to be meaningfully recoded, the SAS code itself was largely a simple iteration, over columns and rows, of INPUT and PUT statements.

The iterative nature of the SAS program suggested that a macro program could be written to automate the creation of SAS code. A simple C program was written based on this iteration over columns and rows. After prompting the user for unique informationthe name of the SAS program file to be created, the name of the data set to be created, the name and path of the column binary file, and the number of decks and record length (LRECL) of the original datathe C program simply looped over columns and rows. The program passed through the loop twice. In the first pass, the program created an INPUT statement that would read in each punch (looping over columns and rows) as a new variable. The second pass created a PUT statement in which each new variable would be written to an ASCII file. The C program was not itself reading or writing the data files; instead, it created many repetitive lines of SAS code to read and write the data files (see Appendix 4).

Each card of column binary data was translated to a line of ASCII data. Since each card contained 960 data points (80 columns by 12 rows), each line of ASCII data contained 960 characters. The spread ASCII data set expanded about 600 percent from the original size. It took about 15 minutes to create the spread ASCII data on the IBM PC network, including all steps from running the C program to writing out the ASCII data.

We investigated the formats currently distributed by the Roper Center and the Institute for Research in the Social Sciences at the University of North Carolina at Chapel Hill (IRSS) to determine their utility over time, and to evaluate them in light of our findings. Both distributors were producing what we call the hybrid spread ASCII data files from the original column binary files. These files are produced by horizontally concatenating the data into a new format. The first 80 columns of the data set contained the 80 columns of the original file; however, any binary encoding of variables with multiple-punch coding was converted to ASCII. The entire file was then converted to spread ASCII and the resultant data were appended to the end of the record. Users could access the non-binary data in the original column locations; if they needed to access data that were originally punched in the binary mode, they could read those data from the spread ASCII portion of the record. Data dictionaries were produced that mapped the location of variables from both the original 80 columns and the spread ASCII portion of the record.

We obtained sample programs from IRSS to help us evaluate these hybrid spread file formats. We also acquired sample data files from IRSS and the Roper Center, with their enhanced data dictionaries, to evaluate their utility. We then adapted a SAS program from IRSS to produce the data map showing the new column locations of the data items. This data map allowed for quick translation from the column-by-row information necessary to read column binary data, to the column-only information necessary to read the spread ASCII data. A note at the top of the data map showed the order of punches in the spread ASCII data. For example, for a response located at column 56, row 1 in the original data set, the data map showed the same response’s location as column 661 in the ASCII data set; a response at column 56, row 5 would be located at column 665; and a response at column 56, row Y(12) would be located at column 672 (see Appendix 5).