 |
The Documentation Track
Several options were available for digitizing the documentation,
including image scanning, OCR, combinations of image and text (PDF
format), and text encoding. The initial part of the project included
identifying formats to test and set up the scanning workstation.
Complete documentation for each file was scanned, including the questionnaires,
xrays, frequency counts, data maps, and other metadata whenever available.
Software and Equipment
The scanner used was a Hewlett-Packard ScanJet 4c with a document
feeder and SCSI connection to an IBM 750-P90 PC. All files were stored
on an external SCSI Iomega Jaz drive. For a major scanning project,
a fast machine with 32 MB of memory was required. Also, a PCI bus
SCSI card speeded up transfer rates from the scanner to the computer.
An automatic document feeder reduced the labor by automating the
page turning. Such labor-saving devices are cost-effective because
scanning operations tend to absorb a lot of resources and to constrain
work on other major tasks while the scanning is in progress.
Scanners differ in speed, and, for a given scanner, speed varies
with the desired scanning resolution. Speed could be an important
factor in making a purchasing decision for a major project, as it
can have a considerable impact on labor costs. For the HP 4c, the
time it takes to scan a page varied with the desired resolution as
indicated below.
|
Desired Resolution
|
Scanner Speed
|
|
600 dpi
|
30 seconds
|
|
300 dpi
|
7.5 seconds
|
|
200 dpi
|
3.3 seconds
|
TextBridge Pro Optical Character Recognition
We first scanned the documentation for one Roper Report using the
OCR software TextBridge Pro. We reviewed alternative OCR software
products and, finding no significant benefits to using one over another,
chose a package with which the staff had experience. Initial evaluation
of the OCR output showed that there were significant numbers of errors
in the resultant ASCII text. We tested various resolutions and documented
time taken for setup and for scanning, optimum settings, range of
file sizes, quality of proofing summaries, and procedures to follow.
The questionnaires we scanned with the TextBridge Pro software had
an unacceptable rate of character recognition, including incorrect
location information necessary for manipulating the accompanying
data files. Handwritten notes were completely lost and the editing
costs of reviewing the output and changing all errors would have
been prohibitive. This format did not present us with an adequate
archival solution to preserving the textual material, so no further
documentation was scanned using this process. TextBridge Pro does
not work well with poor originals, determining optimum scanning settings
was very time-consuming (and sometimes impossible), compression formats
did not give good results, and raw ASCII format required time-consuming
reformatting (see Appendix 6 for a photocopy of a page from Roper
Report 9309, Question 10, and for the TextBridge Pro sample output
of the same question).
Document condition. When used with printed clean originals,
OCR is very accurate even when the font size is small, and can replicate
the formatting of the original document. For example, TextBridge
Pro is capable of producing low-resolution images and exporting both
images and text to a word processing document that retains the format
of the original printed page. However, there are far more problems
with this process when the quality of the original is anything less
than perfect, as in our case. TextBridge has a particular problem
with italics and underlining, even with good quality originals.
TextBridge Pro also does not work well with columns and has some
difficulty recognizing tables and columns in originals, let alone
poor photocopies. This problem gets much worse when some of the entries
in some of the table cells are blank because the columns get shifted;
cleaning up the resulting output files becomes a major undertaking.
Scanner settings. TextBridge Pro allows considerable control
over the way in which the document is scanned in as an image. For
some settings, this task can be delegated to TextBridge Pro by setting
options to "automatic"; in other words, TextBridge Pro
tries to figure out what works as it scans the page. But TextBridge
Pro does not always make these determinations successfully. Nor are
photocopies ideal for scanning, particularly if not all of the characters
are completely clear and if not all of the pages are of the same
lightness/darkness. Successful recognition requires changing the
settings periodically to account for the varying quality of the photocopies.
Two outcomes are possible if the settings are not optimal: in some
cases, the program is unsuccessful in recognizing text that is legible
in the original; in others, it gives the frustratingly cryptic error, "page
too complex for selected mode."
We finally became convinced that there was no simple system for
setting these options. In the worst case, it was a frustrating process
of trial and error. Too dark a setting meant that the program tried
to decipher each small dot on the page as though it were part of
a character. As a result, it was not possible to use 600x600 resolution
in cases where originals were speckled. Likewise, selecting the "text
only" option for the original page format forced the program
to try to convert everything on the page into text, including imperfections
in the image or dark binding patches. On the other hand, sometimes
the auto brightness setting scans were so light that no text was
recognized on the page. In some cases, we spent hours trying to correct
the settings manually.
Proofing text. TextBridge Pro provides an optional feature
that facilitates the correction of recognition errors. When text
is scanned, it is possible to save "proofing text." This
information is used by special modules that are installed into WordPerfect
or Word and are implemented as macros. If the relevant module is
installed and the document opened in the word processor, all words
are highlighted that TextBridge Pro was unsure it recognized. The
color of the highlighted word indicates the confidence TextBridge
Pro has in its accuracy.
TextBridge Pro can be taught to recognize particular fonts with
a higher degree of accuracy through a period of training during which
it asks the user to help it recognize ambiguous characters. We did
not use this feature extensively. Our limited experience showed it
does increase the likelihood of successful recognition if originals
were poor but not truly awful. The effort is only worthwhile if the
same types of fonts will be encountered often, which was not the
case with the Roper Reports documentation.
Final format. The available formats into which the output
text can be saved depends on the options selected. There are a very
large number of possible word processor, text, and spreadsheet formats.
However, if "save proofing text" is selected, then the
file can be saved in only Word or WordPerfect format. Similarly,
if "reconstitute text" is selected, only file formats that
support fairly complex formatting are available. When formatting
text with the "reconstitute text" option, TextBridge Pro
will use some of the new "style" features of WordPerfect
or Word in the new document. This can make subsequent editing cumbersome.
Though the styles themselves can be edited, an alternative is to
save in a file format that supports the text features that are being "reconstituted" but
does not support "styles." In this way, the formatting
will appear as regular tabs, font changes, and so forth, that can
be directly edited. The editing of ASCII text to recreate the format
of the original is a major undertaking and the time required to reformat
each document is extensive. We found that getting pagination to match
the original is particularly difficult.
The scanned images produced by TextBridge Pro can be stored in CCITT-3,
a compression standard, for later processing, but the results from
the subsequent processing of these images were not as good as those
obtained from processing the images directly from the scanner. We
decided that using these compression formats would not give usable
results.
PDF Files from Adobe Capture
The next step in the documentation portion of the project was to
produce documents in the portable document format (PDF) used by Adobe
Acrobat, a widely accepted de facto standard for encoding electronic
documents. The viewing software provided by Adobe allows for reading
and searching the text, viewing the structure of a document, and
printing high-quality hard copy. PDF documents provided clear, accurate
reproductions of the questionnaires. The Adobe Capture software produced
an interim ASCII text file that could be edited to improve text searching.
An example of a viewing screen may be found at the end of Appendix
6.
Basic structure of Acrobat files. Adobe Acrobat files (distinguished
by the PDF suffix on the file names) can contain both text and image
information from the original document. There are different types
of PDF files containing different kinds of information: normal PDF,
image-only PDF, and image+text PDF.
Normal PDF files, by default, display the text information derived
from the OCR process. Where the text information is unknown (when
there is a nontext picture on the page or there were difficulties
in the OCR process), a normal PDF file will insert the original image
into that space on the display page. Image-only PDF files are, in
effect, paginated pictures of the original pages. Like tagged
image file format (TIFF) files, the text in these images
is not searchable. Like image-only files, image+text PDF files show
the image of the original pages, but also contain searchable text.
The image+text files were chosen as the most appropriate for this
project. The user would see a faithful reproduction of the original
documentation (complete with handwritten notes) with the PDF browser,
but could also search for specific text within the document. If text
in the search function looked suspicious, a user could view the original
image. In comparison, files produced by OCR programs contain only
the text information, with no way to double-check the text against
the image of the original.
Adobe Acrobat Capture procedures. When scanning the document
with the Capture software, a set of pages was scanned in sequence.
Each page was stored as a separate TIFF file, with the filenames
numbered sequentially. For example, a 40-page document produced 40
TIFF files, named page01, page02, through page40. These original
image files in TIFF format were stored separately from the final
reformatted PDF documents, providing a set of image files for digital
storage.
When translating the images into editable text, the TIFF files were
concatenated into a single document during the OCR scanning process.
Acrobat Capture analyzed the page layout and grouped text into regions.
It then identified characters and grouped them into words. The words
were looked up in the Acrobat dictionary (which can be customized)
and spelling suspects noted. Fonts were analyzed and font suspects
identified. The interim text layer of the final PDF file contains
no image data and can be edited with the Capture Reviewer so that
the text matches the original document as closely as possible. During
the OCR process, each word was assigned a confidence rating, representing
the software's estimate of its OCR accuracy.
For the searchable text of the final PDF file to be as accurate
as possible, many OCR errors were corrected by editing the interim
files. Fortunately, the program used to edit interim files, Acrobat
Capture Reviewer, would highlight words whose confidence levels fell
below a certain threshold, or that were not included in a dictionary
file. The majority of unrecognized words could be easily spotted
and corrected to match the original document. Although the Reviewer
software allows one to change fonts and other formatting options,
the only editing necessary for the project was in the content of
the words used for searching purposes (words used to locate terms
in question text and variable coding). Since the user would see only
the image reproduction of the original, the underlying ASCII text
need to be reformatted as a visual reproduction of the original.
Once the document was edited, it was then saved in the image+text
PDF format.
Time and storage requirements of PDF files. The total time
required to process a 39-page document was approximately four hours,
from scanning to saving the final PDF. The scanning itself took 30
minutes; the OCR process took 20 minutes; and editing the resulting
file took 3 hours 15 minutes. The storage space required for this
document is shown in table 5.
Table 5. Example of storage space required
for Acrobat PDF files
|
File Type
|
Storage Space
|
|
39 TIFF files
|
1.19 MB
|
|
collated image file
|
1.42 MB
|
|
final PDF files:
normal PDF (image)
image+text PDF |
924 KB
1.49 MB
|
|
ASCII output file
|
100 KB
|
Documentation for the other nine Roper Report data files was also
scanned and edited. Some data files with split samples had dual documentation.
The time taken for the scanning process using a document feeder ranged
from 5 to 30 minutes. The OCR software took between 15 and 35 minutes
to process each document. The time taken to edit each document varied
widely, from one to eight-and-a-half hours. The time it took to complete
a single document depended largely on the quality of the original.
Features such as background shadows, crooked lines of text, compressed
fonts, jagged edges on letters, and handwriting increased both the
time it took the OCR software to process the page, and, more importantly,
the time it took to edit or insert accurate, searchable text. In
some cases, blocks of text were so unrecognizable that whole questions
needed to be typed in as hidden search text. Additional time was
required for error-checking.
Problems encountered in text recognition and editing of PDF files. Although
Adobe Acrobat Preview highlights most words with low confidence levels,
some forms of errors are not so easily detected during the editing
phase.
- A word was not recognized as a block of text by Capture software
during the OCR process. This means that instead of a text form
of the word, the document included simply a bitmapped image of
the word from the original page. Such an image, of course, would
not be searchable. Fortunately, these image blocks had some telltale
signs. First, they appeared to be in a font that was usually quite
different from the fonts assigned to the recognized words. Also,
they were usually surrounded by a fine gray or blue box. After
some experience, an editor could quickly spot the image boxes as
the page was being read. Although these image boxes could not be
changed to text, a Reviewer command can be used to insert hidden
search text underneath them. A later search would highlight the
image as if it were a normal word.
- A word was recognized as another, valid English word.
In this case, the confidence level for the word might be quite
high and the word would not normally be highlighted. For instance,
if a falsely recognized word was assigned a confidence level of
97 percent, and Preview was set to highlight words at 95 percent or
below, then the wrong word would not be highlighted. These
words can be highlighted by raising the threshold setting, although
at high levels (98 percent or 99 percent), virtually every word
in the document would be highlighted. The only practical way to
discover these errors was for an editor to read through the document
carefully. Once errors were spotted, the correct words would be
inserted.
- In rare cases, a line of text could be skipped in the OCR process.
Again, nothing would be highlighted, but fortunately there would
be obvious gaps in the text block where the line was skipped. A
careful reading of the document would reveal these gaps. As a corrective,
a new block of text could be created to overlay the gap. The correct
text could then be typed in for the purpose of future searches.
In each of these cases, an attentive editor must catch the OCR error
and make appropriate changes to the text to verify accurate search
and retrieval. We did not edit enough documents to estimate the average
time needed for cleaning a complete document. Future projects will
need to budget extensive editing costs.
HTML and SGML/XML Marked-up Files
Conversion of scanned text to hypertext markup language (HTML)
format would provide a more readily accessible browsing format. However,
the text of each document would need to be fully edited and formatted.
As indicated above, the ASCII output from the OCR technology we used
could not provide us with text clean enough to use in HTML. Moving
text into documents adhering to the standard general markup language/extensible
markup language (SGML/XML) is the most labor-intensive
but also the most dynamic alternative for text applications. SGML/XML
tagging allows customized and robust access to specific pieces of
the documentation (such as question text and variable location information).
SGML/XML Document Type Definitions maintain the integrity of document
content and structure and also define the relationships among the
elements. The emerging social science documentation standard for
both formatting and content, the Data Documentation Initiative (DDI)
Document Type Definition (DTD), provides standard elements developed
specifically for social science numeric resources. The standard adheres
to our requirement that text be stored in a system-independent, nonproprietary
format. Furthermore, this standard, developed by representatives
from the international social science research community, is intended
to fill the need for a structured codebook standard that will serve
as an interchange format and permit the development of new Web applications.
However, this format requires that the text be fully edited and the
components of the documentation tagged, and funding for this work
was not included in the budget for this project.
Next Previous
Return to CLIR Home Page >> |