Several options were available for digitizing the documentation, including image scanning, OCR, combinations of image and text (PDF format), and text encoding. The initial part of the project included identifying formats to test and set up the scanning workstation. Complete documentation for each file was scanned, including the questionnaires, xrays, frequency counts, data maps, and other metadata whenever available.
Software and Equipment
The scanner used was a Hewlett-Packard ScanJet 4c with a document feeder and SCSI connection to an IBM 750-P90 PC. All files were stored on an external SCSI Iomega Jaz drive. For a major scanning project, a fast machine with 32 MB of memory was required. Also, a PCI bus SCSI card speeded up transfer rates from the scanner to the computer. An automatic document feeder reduced the labor by automating the page turning. Such labor-saving devices are cost-effective because scanning operations tend to absorb a lot of resources and to constrain work on other major tasks while the scanning is in progress.
Scanners differ in speed, and, for a given scanner, speed varies with the desired scanning resolution. Speed could be an important factor in making a purchasing decision for a major project, as it can have a considerable impact on labor costs. For the HP 4c, the time it takes to scan a page varied with the desired resolution as indicated below.
TextBridge Pro Optical Character Recognition
We first scanned the documentation for one Roper Report using the OCR software TextBridge Pro. We reviewed alternative OCR software products and, finding no significant benefits to using one over another, chose a package with which the staff had experience. Initial evaluation of the OCR output showed that there were significant numbers of errors in the resultant ASCII text. We tested various resolutions and documented time taken for setup and for scanning, optimum settings, range of file sizes, quality of proofing summaries, and procedures to follow.
The questionnaires we scanned with the TextBridge Pro software had an unacceptable rate of character recognition, including incorrect location information necessary for manipulating the accompanying data files. Handwritten notes were completely lost and the editing costs of reviewing the output and changing all errors would have been prohibitive. This format did not present us with an adequate archival solution to preserving the textual material, so no further documentation was scanned using this process. TextBridge Pro does not work well with poor originals, determining optimum scanning settings was very time-consuming (and sometimes impossible), compression formats did not give good results, and raw ASCII format required time-consuming reformatting (see Appendix 6 for a photocopy of a page from Roper Report 9309, Question 10, and for the TextBridge Pro sample output of the same question).
Document condition. When used with printed clean originals, OCR is very accurate even when the font size is small, and can replicate the formatting of the original document. For example, TextBridge Pro is capable of producing low-resolution images and exporting both images and text to a word processing document that retains the format of the original printed page. However, there are far more problems with this process when the quality of the original is anything less than perfect, as in our case. TextBridge has a particular problem with italics and underlining, even with good quality originals.
TextBridge Pro also does not work well with columns and has some difficulty recognizing tables and columns in originals, let alone poor photocopies. This problem gets much worse when some of the entries in some of the table cells are blank because the columns get shifted; cleaning up the resulting output files becomes a major undertaking.
Scanner settings. TextBridge Pro allows considerable control over the way in which the document is scanned in as an image. For some settings, this task can be delegated to TextBridge Pro by setting options to “automatic”; in other words, TextBridge Pro tries to figure out what works as it scans the page. But TextBridge Pro does not always make these determinations successfully. Nor are photocopies ideal for scanning, particularly if not all of the characters are completely clear and if not all of the pages are of the same lightness/darkness. Successful recognition requires changing the settings periodically to account for the varying quality of the photocopies. Two outcomes are possible if the settings are not optimal: in some cases, the program is unsuccessful in recognizing text that is legible in the original; in others, it gives the frustratingly cryptic error, “page too complex for selected mode.”
We finally became convinced that there was no simple system for setting these options. In the worst case, it was a frustrating process of trial and error. Too dark a setting meant that the program tried to decipher each small dot on the page as though it were part of a character. As a result, it was not possible to use 600×600 resolution in cases where originals were speckled. Likewise, selecting the “text only” option for the original page format forced the program to try to convert everything on the page into text, including imperfections in the image or dark binding patches. On the other hand, sometimes the auto brightness setting scans were so light that no text was recognized on the page. In some cases, we spent hours trying to correct the settings manually.
Proofing text. TextBridge Pro provides an optional feature that facilitates the correction of recognition errors. When text is scanned, it is possible to save “proofing text.” This information is used by special modules that are installed into WordPerfect or Word and are implemented as macros. If the relevant module is installed and the document opened in the word processor, all words are highlighted that TextBridge Pro was unsure it recognized. The color of the highlighted word indicates the confidence TextBridge Pro has in its accuracy.
TextBridge Pro can be taught to recognize particular fonts with a higher degree of accuracy through a period of training during which it asks the user to help it recognize ambiguous characters. We did not use this feature extensively. Our limited experience showed it does increase the likelihood of successful recognition if originals were poor but not truly awful. The effort is only worthwhile if the same types of fonts will be encountered often, which was not the case with the Roper Reports documentation.
Final format. The available formats into which the output text can be saved depends on the options selected. There are a very large number of possible word processor, text, and spreadsheet formats. However, if “save proofing text” is selected, then the file can be saved in only Word or WordPerfect format. Similarly, if “reconstitute text” is selected, only file formats that support fairly complex formatting are available. When formatting text with the “reconstitute text” option, TextBridge Pro will use some of the new “style” features of WordPerfect or Word in the new document. This can make subsequent editing cumbersome. Though the styles themselves can be edited, an alternative is to save in a file format that supports the text features that are being “reconstituted” but does not support “styles.” In this way, the formatting will appear as regular tabs, font changes, and so forth, that can be directly edited. The editing of ASCII text to recreate the format of the original is a major undertaking and the time required to reformat each document is extensive. We found that getting pagination to match the original is particularly difficult.
The scanned images produced by TextBridge Pro can be stored in CCITT-3, a compression standard, for later processing, but the results from the subsequent processing of these images were not as good as those obtained from processing the images directly from the scanner. We decided that using these compression formats would not give usable results.
PDF Files from Adobe Capture
The next step in the documentation portion of the project was to produce documents in the portable document format (PDF) used by Adobe Acrobat, a widely accepted de facto standard for encoding electronic documents. The viewing software provided by Adobe allows for reading and searching the text, viewing the structure of a document, and printing high-quality hard copy. PDF documents provided clear, accurate reproductions of the questionnaires. The Adobe Capture software produced an interim ASCII text file that could be edited to improve text searching. An example of a viewing screen may be found at the end of Appendix 6.
Basic structure of Acrobat files. Adobe Acrobat files (distinguished by the PDF suffix on the file names) can contain both text and image information from the original document. There are different types of PDF files containing different kinds of information: normal PDF, image-only PDF, and image+text PDF.
Normal PDF files, by default, display the text information derived from the OCR process. Where the text information is unknown (when there is a nontext picture on the page or there were difficulties in the OCR process), a normal PDF file will insert the original image into that space on the display page. Image-only PDF files are, in effect, paginated pictures of the original pages. Like tagged image file format (TIFF) files, the text in these images is not searchable. Like image-only files, image+text PDF files show the image of the original pages, but also contain searchable text.
The image+text files were chosen as the most appropriate for this project. The user would see a faithful reproduction of the original documentation (complete with handwritten notes) with the PDF browser, but could also search for specific text within the document. If text in the search function looked suspicious, a user could view the original image. In comparison, files produced by OCR programs contain only the text information, with no way to double-check the text against the image of the original.
Adobe Acrobat Capture procedures. When scanning the document with the Capture software, a set of pages was scanned in sequence. Each page was stored as a separate TIFF file, with the filenames numbered sequentially. For example, a 40-page document produced 40 TIFF files, named page01, page02, through page40. These original image files in TIFF format were stored separately from the final reformatted PDF documents, providing a set of image files for digital storage.
When translating the images into editable text, the TIFF files were concatenated into a single document during the OCR scanning process. Acrobat Capture analyzed the page layout and grouped text into regions. It then identified characters and grouped them into words. The words were looked up in the Acrobat dictionary (which can be customized) and spelling suspects noted. Fonts were analyzed and font suspects identified. The interim text layer of the final PDF file contains no image data and can be edited with the Capture Reviewer so that the text matches the original document as closely as possible. During the OCR process, each word was assigned a confidence rating, representing the software’s estimate of its OCR accuracy.
For the searchable text of the final PDF file to be as accurate as possible, many OCR errors were corrected by editing the interim files. Fortunately, the program used to edit interim files, Acrobat Capture Reviewer, would highlight words whose confidence levels fell below a certain threshold, or that were not included in a dictionary file. The majority of unrecognized words could be easily spotted and corrected to match the original document. Although the Reviewer software allows one to change fonts and other formatting options, the only editing necessary for the project was in the content of the words used for searching purposes (words used to locate terms in question text and variable coding). Since the user would see only the image reproduction of the original, the underlying ASCII text need to be reformatted as a visual reproduction of the original. Once the document was edited, it was then saved in the image+text PDF format.
Time and storage requirements of PDF files. The total time required to process a 39-page document was approximately four hours, from scanning to saving the final PDF. The scanning itself took 30 minutes; the OCR process took 20 minutes; and editing the resulting file took 3 hours 15 minutes. The storage space required for this document is shown in table 5.
Table 5. Example of storage space required for Acrobat PDF files
|39 TIFF files
|collated image file
|final PDF files:
normal PDF (image)
|ASCII output file
Documentation for the other nine Roper Report data files was also scanned and edited. Some data files with split samples had dual documentation. The time taken for the scanning process using a document feeder ranged from 5 to 30 minutes. The OCR software took between 15 and 35 minutes to process each document. The time taken to edit each document varied widely, from one to eight-and-a-half hours. The time it took to complete a single document depended largely on the quality of the original. Features such as background shadows, crooked lines of text, compressed fonts, jagged edges on letters, and handwriting increased both the time it took the OCR software to process the page, and, more importantly, the time it took to edit or insert accurate, searchable text. In some cases, blocks of text were so unrecognizable that whole questions needed to be typed in as hidden search text. Additional time was required for error-checking.
Problems encountered in text recognition and editing of PDF files. Although Adobe Acrobat Preview highlights most words with low confidence levels, some forms of errors are not so easily detected during the editing phase.
- A word was not recognized as a block of text by Capture software during the OCR process. This means that instead of a text form of the word, the document included simply a bitmapped image of the word from the original page. Such an image, of course, would not be searchable. Fortunately, these image blocks had some telltale signs. First, they appeared to be in a font that was usually quite different from the fonts assigned to the recognized words. Also, they were usually surrounded by a fine gray or blue box. After some experience, an editor could quickly spot the image boxes as the page was being read. Although these image boxes could not be changed to text, a Reviewer command can be used to insert hidden search text underneath them. A later search would highlight the image as if it were a normal word.
- A word was recognized as another, valid English word. In this case, the confidence level for the word might be quite high and the word would not normally be highlighted. For instance, if a falsely recognized word was assigned a confidence level of 97 percent, and Preview was set to highlight words at 95 percent or below, then the wrong word would not be highlighted. These words can be highlighted by raising the threshold setting, although at high levels (98 percent or 99 percent), virtually every word in the document would be highlighted. The only practical way to discover these errors was for an editor to read through the document carefully. Once errors were spotted, the correct words would be inserted.
- In rare cases, a line of text could be skipped in the OCR process. Again, nothing would be highlighted, but fortunately there would be obvious gaps in the text block where the line was skipped. A careful reading of the document would reveal these gaps. As a corrective, a new block of text could be created to overlay the gap. The correct text could then be typed in for the purpose of future searches.
In each of these cases, an attentive editor must catch the OCR error and make appropriate changes to the text to verify accurate search and retrieval. We did not edit enough documents to estimate the average time needed for cleaning a complete document. Future projects will need to budget extensive editing costs.
HTML and SGML/XML Marked-up Files
Conversion of scanned text to hypertext markup language (HTML) format would provide a more readily accessible browsing format. However, the text of each document would need to be fully edited and formatted. As indicated above, the ASCII output from the OCR technology we used could not provide us with text clean enough to use in HTML. Moving text into documents adhering to the standard general markup language/extensible markup language (SGML/XML) is the most labor-intensive but also the most dynamic alternative for text applications. SGML/XML tagging allows customized and robust access to specific pieces of the documentation (such as question text and variable location information). SGML/XML Document Type Definitions maintain the integrity of document content and structure and also define the relationships among the elements. The emerging social science documentation standard for both formatting and content, the Data Documentation Initiative (DDI) Document Type Definition (DTD), provides standard elements developed specifically for social science numeric resources. The standard adheres to our requirement that text be stored in a system-independent, nonproprietary format. Furthermore, this standard, developed by representatives from the international social science research community, is intended to fill the need for a structured codebook standard that will serve as an interchange format and permit the development of new Web applications. However, this format requires that the text be fully edited and the components of the documentation tagged, and funding for this work was not included in the budget for this project.