 |
The Data Track
The preservation project's goals for the data track were to develop
and evaluate a process of migrating digital numeric information
from computer tape to hardware- and software-independent formats and
to evaluate the utility of the resultant formats. The process
of migrating was broken down into a series of nine steps.
- Identify equipment
- Copy files from mainframe-based media to local hard disks
- Examine the documentation
- Define the column binary format
- Develop standard variable-naming classifications
- Read in the data files with SAS and SPSS
- Identify migration formats
- Recode data files with SAS
- Create spread ASCII data files without recoding
1. Identify Equipment
The computing environment used for the project consisted of IBM mainframes
(all commands were submitted as JCL batch jobs issued from a CMS-based
IBM mainframe to an MVS mainframe with tape access), and PC/Intel (Pentium
90) computers running Microsoft Windows for Workgroups on a Novell
network.
2. Copy Files
We copied the column binary data files from old round reel tapes to
new 3480 IBM cartridges on the Yale mainframe. This was the first step
in refreshing the data from the old tape medium to a more stable magnetic
medium. Next, the data were copied from the cartridges to mainframe
disk and moved via standard file transfer protocols (in binary mode)
to the Statlab Novell server, the home for the SAS program writing,
record keeping, and output storage.
Once transferred to disks connected to the server, each data set was
checked with a simple program that read in one variablethe deck or card numberfor
each observation in the original data file. The number of observations
for each deck number was then compared with information in the documentation.
Any discrepancies were noted and data files reordered from the Roper
Center if errors were confirmed.
3. Examine Documentation
The primary document for a particular Roper Report was simply a copy
of the original survey questionnaire containing the questions asked
in the survey. The card number and column number (or numbers) for locating
the question in the record for each respondent were given for each
question. Furthermore, the punch location was listed for each response
option in the question. All of this informationcard number, column
location, and punch locationwas necessary for reading the original
column binary data.
The questionnaires varied in length and, in some cases, multiple versions
of a questionnaire were provided. For example, some of the surveys
used a split sample, either to ask different questions or to
ask the same questions in different sequence. Often, the questions
asked of the two groups in the split sample were not identical. In
some cases, the column location and format of the different variables
were identical. The samples could differ in the order of items in a
multiple-response question, or in using slightly different sets of
items. In other cases, the format of the questions (along with their
column and row locations) changed radically between samples.
The xray, a special form of printed output supplied by the
Roper Center, provided a response frequency used to check the data
during the migration processes. Organized by card, column, and row,
the xray gave the total number of punched bits across observations
for each variable in the data set. The xray also provided useful information
about the types of questions being answered and the complexity of reading
the column binary format. Because each question was usually encoded
in its own column, the sum of all of the responses (plus any blank
observations), for most types of questions, should add up to the total
number of people in the survey. If it did not, then the question allowed
multiple responses and would need to be read in a different manner
from single response variables.
4. Define Format
The structure of the column binary format is illustrated in table
1. The sample question shown in the table uses two columns (47 and
48) to cover all the possible answers. The full text of this question
may also be seen in Appendix 1. Each card contains 12 rows numbered
from top to bottom, beginning with 1 at the top. Punch numbers are
assigned in a different way: a 12 punch goes in the top row, an 11
punch in the second row, and the third through twelfth rows are reserved
for 0 through 9 punches. The 11 and 12 punches are often used for "don't
know," "no answer," or "not applicable" responses.
In this example, a respondent chose the values 1, 2, and 4 ("living
in poverty," "being abused as a child," and "drug
abuse"), so punches were made in rows 4, 5, and 7, as shown in
the last column of the table.
In this question, both columns 47 and 48 may have multiple punches
representing the choices people were allowed to select from the list
of causes of violent crime. Alternatively, in the non-column binary
scheme allowing only one punch per column, each possible selection
would have to be coded as a separate variable and therefore as a separate
column, so coding of this question would take up 18 columns rather
than two.
Table 1. Question Coding Example
Roper Report 9309, Question 8W:"...which
three or four things do you think are the main causes of people
committing violent crimes?"
|
Responses
|
Column location
|
Row
|
Punch number
|
Punched
|
|
a. Living in poverty
b. Parents not teaching right from wrong
c. Being abused as a child
d. Drug abuse
e. What people see in TV programs
f. A lack of morals
g. A lack of education
h. A person not seeing any harm in it
i. A person being irresponsible
j. Influence of friends
k. Alcohol abuse
l. What people see in movies
|
47
|
4
5
6
7
8
9
20
11
12
3
2
1
|
1
2
3
4
5
6
7
8
9
0
11 (or X)
12 (or Y)
|
X
X
X
|
|
m. The advertising and marketing of toy guns
n. Guns being too easy to get
o. Low chance of being punished
p. Seeing pornography
None of these
Dont know
|
48
|
4
5
6
7
8
1
|
1
2
3
4
5
12 (orY)
|
|
5. Develop Standard Classifications
A standard classification of types of variables, based upon the types
of questions asked in the surveys, was used in the construction of
data set translation. The classification provided a scheme for the
creation of standard templates and the logic behind the variable types
used in recoding the data. In addition, the classification made it
easier to construct appropriate variable names and to ensure consistent
naming across data sets. Each of the variable types we defined required
different variable-naming, formatting, and recoding procedures.
The four basic types of variables are regular numeric variables, numeric
variables with special missing values, multiple-response questions,
and single-response questions. Each variable type was associated with
a different template for code creation, a specific form of variable
names, and certain projected difficulties in recoding. For instance,
a multiple-response variable was read in with a simple list of single-punch variables,
was labeled simply with letter suffixes, and was not recoded. In contrast,
an aggregated single-response variable was read in with a list of single-punch intermediate
variables, was labeled with intermediate number suffixes, and was
recoded into a final variable for the entire question. Retrieving the
variable type directly from the documentation provided a guide to the
particular piece of program code that was necessary for inputting and
recoding the variable.
Regular numeric variables. Regular numeric variables range
from 0 to 9. The values are computed by filling in each digit with
the numbers in the appropriate columns. For example, Roper Report 9309,
Question 68Y, asked "Do you think government lotteries produce
an unwholesome gambling spirit in this country? Yes=1, No=2"
Numeric variables with special missing values. These variables
are identical to regular numeric variables, except that there can be
one or two special missing values (such as "don't know"),
recorded at punch 11 and/or punch 12. For example, Roper Report 9309,
Question 2X, asked, "Do you feel things in this country are
generally going in the right direction today, or do you feel that things
have pretty seriously gotten off on the wrong track? Right track=1,
Wrong track=2, Don't know=Y"
Multiple-response questions. Multiple-response questions generally
ask respondents to choose more than one option from a list of items,
as in Roper Report 9309, Question 8W illustrated in table 1. If an
item was checked, it was coded as 1 in the recoded data set; if it
was not checked, it was coded as 0. Therefore, for any such question,
there would be multiple binary variables corresponding to all of the
possible responses (including special missing values), so that each
possible response became a variable in a final recoded data set.
Single-response questions. Single-response questions allow
one, and only one, response per item. They frequently include a special
missing value at punch number 12, with several answer options between
punch numbers 1 and 9. These questions require appropriate recoding
for the final variable. For example, Roper 9309, Question 7Y, asked, "And
thinking about crime in the United States, what one type of crime do
you feel presents the biggest threat for you and your family today?" An
example of multiple-column storage of a single-response question appears
in table 2.
Table 2. Example of multiple-column storage
of single-response question
|
Response
|
Punch number
|
Column location
|
|
a. Burglary, robbery, auto theft
b. Vandalism/hooliganism
c. Official corruption, government bribe-taking
d. Murder
e. Rape
f. Assault
g. Racketeering, extortion
h. Drugs
i. Prostitution
j. Speculation and swindling
k. Other
None of the above
Dont know
|
1
2
3
4
5
6
7
8
9
0
1
2
Y
|
45
46
|
6. Read in Data
The first step in the SAS programming process was reading in the data
using one of the three SAS column binary informats: PUNCH.d, CBw.d,
and ROWw.d. The informat PUNCH.d reads a specific bit (with
a value of either 0 or 1) in the original data set. The d value
indicates which row in a column of data is to be read. The informat
CBw.d, on the other hand, looks to see which one of the 10 numeric
bits (0 through 9) has been punched, and returns that number as a value
for the column. ROWw.d begins reading at a user-specified bit, looks
to see which one of a user-specified number of bits (after the first)
has been punched, and returns a number equal to the relative location
of the punched bit. For instance, a ROW5.5 informat would start reading
at PUNCH.5 and continue for four more bits through PUNCH.9; if bit
8 was punched, then the ROW5.5 informat would return a 4. The PUNCH.d
informat was the most appropriate for this project. For clarification
of its use, refer to the sample SAS program in Appendix 2. Depending
on the final form of the question, the pattern of punches in a column
usually had to be logically recoded later in the program from an intermediate
variable to a final variable that matched the response options in the
original documentation.
The Statistical Package for the Social Sciences (SPSS) is also able
to read data, including multiple-response question responses, from
a column binary data file by using the keywords MODE=MULTI-PUNCH on
the FILE HANDLE command. An example of an SPSS program that reads a
single variable from Roper Report 9309 data with SPSS is:
file handle test / name='h:/roper/rprr9309.bin' /recform=fixed /
mode=multipunch/lrecl=160.
data list file=test records=9 /1 feddep 24 (A) /9.
execute.
In this example, a variable called FEDDEP is read in column 24. It
has a possible Y coded as "don't know", requiring that SPSS
read this variable as a character string (see also SPSS 1988, 84-86).
7. Identify Migration Formats
We selected the following formats to test the migration process:
- SAS system files of recoded column binary data, with and
without intermediate variables
- SAS system files with shortened integer byte lengths
- SAS export files of recoded column binary data
- ASCII files produced from recoded column binary data
- ASCII files of the binary data patterns in the original file, called
spread ASCII
These formats were selected on the basis of the software's ability
to read the column binary format, the availability of project staff
programming expertise, the transportability of output formats, storage
requirements (size of output data sets), and long-term archival implications.
While both SAS and SPSS software are able to read column binary data,
the staff members working on the project had more experience with SAS,
so we chose to work with that statistical package. With the exception
of spread ASCII data files, each format we selected for testing contained
completely recoded and renamed variables. Storage requirements were
compared on the different platforms for many of the different types
of SAS data files, as shown in Appendix 3.
8. Recode Data Files
Create intermediate variables and code missing values. After
reading in the data with the SAS informat statements, the second step
in the SAS programming process was to produce a set of if-then statements
for recoding the individual punch data (the intermediate variables)
into final variables. Each intermediate variable needed to have a column
location, row location, variable name, and SAS informat instructions
specified in the input statement, as seen in the example shown in table
3. The only things that changed while inputting this variable Q14 were
the suffix and punch location for each intermediate variable. This
redundancy increased when there were more intermediate variables, and
especially with aggregated single-response variables. The logical statements
used for recoding the intermediate variables also contained many repetitions
of the same if-then commands.
Table 3. Example of variable location, name,
and SAS informat statement
|
Column location
|
Variable name
|
Row location and informat
|
|
@30
|
Q14A_1
|
PUNCH.1
|
|
@30
|
Q14A_2
|
PUNCH.2
|
|
@30
|
Q14A_3
|
PUNCH.3
|
|
@30
|
Q14A_Y
|
PUNCH.12
|
|
@31
|
Q14B_1
|
PUNCH.1
|
|
@31
|
Q14B_2
|
PUNCH.2
|
|
@31
|
Q14B_3
|
PUNCH.3
|
|
@31
|
Q14B_Y
|
PUNCH.12
|
Since the creation of these translation programs in SAS involved a
large amount of repetitive typing, we created templates for both the
input and recoding statements. Templates were created for single-response
variables (simple and aggregated), multiple-response variables, and
numeric variables with special missing codes. Each template included
generic characters for column locations, variable names, and SAS informats,
as well as pre-formed logical recoding statements. Furthermore, the
templates contained repeated lines of code for different numbers of
punches or aggregate variables so that it would not be necessary to
enter the redundant information for the different variables.
These templates considerably sped up translation code creation. Once
the type of the current variable had been determined, the appropriate
input and recode statements were copied from the template file, pasted
into the program code, and then the key characters (such as column
location and variable name) were replaced using a find/replace function
in the text editor. For example, to create the code for variable Q14,
a piece of the template would have been copied from the template file
and pasted into the program, as illustrated in table 4. The aggregated
single-response variable template included input and recode statements
for sub-questions from A through Z; only A and B are shown in table
4.
Table 4. Examples of SAS input statements
and recoding statements
/* Three options + NR */
(for INPUT command)
@COL1 Q000A_1 PUNCH.1
@COL1 Q000A_2 PUNCH.2
@COL1 Q000A_3 PUNCH.3
@COL1 Q000A_Y PUNCH.12
@COL2 Q000B_1 PUNCH.1
@COL2 Q000B_2 PUNCH.2
@COL2 Q000B_3 PUNCH.3
@COL2 Q000B_Y PUNCH.12
(for recoding statements)
IF Q000A_1=1 AND Q000A_2=0 AND Q000A_3=0 AND Q000A_Y=0
THEN Q000A=1;
IF Q000A_1=0 AND Q000A_2=1 AND Q000A_3=0 AND Q000A_Y=0
THEN Q000A=2;
IF Q000A_1=0 AND Q000A_2=0 AND Q000A_3=1 AND Q000A_Y=0
THEN Q000A=3;
IF Q000A_1=0 AND Q000A_2=0 AND Q000A_3=0 AND Q000A_Y=1
THEN Q000A=9;
IF Q000B_1=1 AND Q000B_2=0 AND Q000B_3=0 AND Q000B_Y=0
THEN Q000B=1;
IF Q000B_1=0 AND Q000B_2=1 AND Q000B_3=0 AND Q000B_Y=0
THEN Q000B=2;
IF Q000B_1=0 AND Q000B_2=0 AND Q000B_3=1 AND Q000B_Y=0
THEN Q000B=3;
IF Q000B_1=0 AND Q000B_2=0 AND Q000B_3=0 AND Q000B_Y=1
THEN Q000B=9;
|
Then the string Q000 would be replaced with Q14, and COL1 and COL2
would be replaced with 30 and 31, respectively, resulting in the final
code for inputting and recoding the variables Q14A and Q14B.
Create macro programs to recode data files. Given that we could
define fairly well the different types of variables in a data set,
and that we could create input and recode template statements for these
different variable types, it was tempting to think that we would be
able to write programs to automate the whole procedure. That is, we
might imagine a program that could perform all of the operations described
above with minimal effort from the human operators. Although such automation
may be possible in the future, the irregularity of the data files presented
a major obstacle. Not only would an automatic translation program have
to deal with some of the complexities of, for example, split samples
with different variable types in the same column, but it would also
have to handle the different types of errors that occur in the original
data files.
For example, one fairly straightforward solution (though quite a lot
of work) would have been to create a program that received some sort
of variable list from a human operator, set up logical conditions to
create code around split samples, and then write out a SAS program
to translate and recode the data file. More sophisticated error-checking
would have been necessary, however, to handle simple events such as
a variable being incorrectly marked as a single-response type in the
documentation, but actually having multiple responses coded in the
data. Because the recoded data files needed to be of archival quality,
this error-checking would have had to be quite rigorous. Furthermore,
judgment calls would have sometimes arisen when dealing with irregularities
(that is, should the variable be recoded differently or should the
irregularity be ignored), so that leaving the decision solely to a
computer program was not advisable. In short, creating an automatic
translation program to recode the data would have involved several
compromises that we would not recommend.
Debug programs and check data. Several types of errors occurred
in the SAS programming: typographic errors, invalid informats, and
unexpected changes in variable types (where the questionnaire did not
match the data). The key indicators for these problems were INVALID
DATA warnings that appeared in the SAS log.
Once the programs ran without INVALID DATA messages, the accuracy
of the translation needed to be checked. Complete frequencies for all of
the variables in the final data set had to be compared to the frequencies
in the xray.
For example, if variable Q9 (deck 1, column 30) had the following
frequencies (in this case, PUNCH.12 was recoded to the missing value
of 9):
|
Value
|
Frequency
|
|
1
|
290
|
|
2
|
1300
|
|
9
|
403
|
then the xray for deck 1, column 30 should look like this:
|
Punch location
|
12
|
11
|
0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
|
Sum
|
403
|
0
|
0
|
290
|
1300
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
It was necessary to check all of the variables in the final data set,
because one badly coded variable in the conversion job could compromise
the archival integrity of the entire data set. Unfortunately, a random
subset of variables may miss the errors. If there was a mismatch between
the xray and the frequencies, the difference needed to be tracked down
and the SAS job edited. This frequency check was also the last chance
to resolve ambiguities between the type of variable listed in the documentation
and that shown in the xray. One final point to note during this check
was the maximum number of possible digits in a recoded variable. If
the data set was to be output in ASCII format, then the number of digits
in the special missing value should be equal to the number of digits
in the regular values. That is, if a variable had values 1 through
3, then the special missing value should be 9; if it had values 1 through
12, then the special missing should be 99; and so on.
Save recoded data. The final step in the reading and recoding
process was to write to disk the resultant SAS data sets. For recoded
SAS to SAS export files, we saved the files as SAS system files and
export files with all intermediate and recoded variables, and as SAS
system files and export files with just the recoded variables (see
Appendix 3).
For recoded SAS to ASCII, once the SAS data set had been completely
debugged and double-checked, PUT statements could be written to create
an ASCII version of the recoded data (see Appendix 2). The simplest
way to write these PUT statements was to copy the INPUT statements
from the original job (or from the recoding statements), strip off
the column binary informat information, and manually type in the new
column locations for each variable. These PUT statements had two advantages:
unlike automatic translation programs, they created ASCII data files
with minimum wasted space and the PUT statement itself could be distributed
as a data dictionary.
For the first data set translated, an ASCII version of the recoded
SAS data set was created using an automatic translation program called
DBMSCOPY, version 5.10. However, this ASCII data set was not satisfactory.
Although the translation program automatically created a data dictionary
(based on the variable names in the SAS data set), the ASCII data set
was not compact. The translation program seemed to be incapable of
concatenating one variable's width of columns against the previous
variable's columns, so that there would be no wasted space within the
flat data set. Instead, the program inserted multiple spaces between
each variable, which allowed a differing number of characters within
each variable, but also inserted approximately 10 ASCII space characters
between each variable. Because of the enormously increased storage
requirements this insertion causes, this approach to creating ASCII
data files was abandoned and program code for creating compact ASCII
data files was written in SAS.
9. Create Spread ASCII Data Files
The original column binary data files did not necessarily have to
be recoded during the translation process. Another possible method
of translation was simply to convert, or spread, the column binary
into ASCII data. Spread ASCII data files keep the binary structure
of the column binary data, but encode each 0 or 1 as an ASCII character.
For example, the following column in the original data (each 0 or 1
is one bit)
0
0
0
0
1
0
0
0
0
0
0
0
would become 000010000000 in the spread version. While this example
uses a single-response variable, multiple-response questions may have
more than one bit with a value of 1 in the column. To create a spread
version of the data, each bit (or punch) in the column binary file
could be read (via SAS in our case) as an individual variable; each
variable was then written to a new ASCII data set. Because the punches
in the column binary data were rectangular, and because the variables
being written out did not have to be meaningfully recoded, the SAS
code itself was largely a simple iteration, over columns and rows,
of INPUT and PUT statements.
The iterative nature of the SAS program suggested that a macro program
could be written to automate the creation of SAS code. A simple C program
was written based on this iteration over columns and rows. After prompting
the user for unique informationthe name of the SAS program file to
be created, the name of the data set to be created, the name and path
of the column binary file, and the number of decks and record length
(LRECL) of the original datathe C program simply looped over columns
and rows. The program passed through the loop twice. In the first pass,
the program created an INPUT statement that would read in each punch
(looping over columns and rows) as a new variable. The second pass
created a PUT statement in which each new variable would be written
to an ASCII file. The C program was not itself reading or writing the
data files; instead, it created many repetitive lines of SAS code to
read and write the data files (see Appendix 4).
Each card of column binary data was translated to a line of ASCII
data. Since each card contained 960 data points (80 columns by 12 rows),
each line of ASCII data contained 960 characters. The spread ASCII
data set expanded about 600 percent from the original size. It took
about 15 minutes to create the spread ASCII data on the IBM PC network,
including all steps from running the C program to writing out the ASCII
data.
We investigated the formats currently distributed by the Roper Center
and the Institute for Research in the Social Sciences at the University
of North Carolina at Chapel Hill (IRSS) to determine their utility
over time, and to evaluate them in light of our findings. Both distributors
were producing what we call the hybrid spread ASCII data files from
the original column binary files. These files are produced by horizontally
concatenating the data into a new format. The first 80 columns
of the data set contained the 80 columns of the original file; however,
any binary encoding of variables with multiple-punch coding was converted
to ASCII. The entire file was then converted to spread ASCII and the
resultant data were appended to the end of the record. Users could
access the non-binary data in the original column locations; if they
needed to access data that were originally punched in the binary mode,
they could read those data from the spread ASCII portion of the record.
Data dictionaries were produced that mapped the location of variables
from both the original 80 columns and the spread ASCII portion of the
record.
We obtained sample programs from IRSS to help us evaluate these hybrid
spread file formats. We also acquired sample data files from IRSS and
the Roper Center, with their enhanced data dictionaries, to evaluate
their utility. We then adapted a SAS program from IRSS to produce the
data map showing the new column locations of the data items. This data
map allowed for quick translation from the column-by-row information
necessary to read column binary data, to the column-only information
necessary to read the spread ASCII data. A note at the top of the data
map showed the order of punches in the spread ASCII data. For example,
for a response located at column 56, row 1 in the original data set,
the data map showed the same response's location as column 661 in the
ASCII data set; a response at column 56, row 5 would be located at
column 665; and a response at column 56, row Y(12) would be located
at column 672 (see Appendix 5).
Next Previous
Return to CLIR Home Page >>
|