The ability to view and consult on-screen many original documents through their digital images stored on optical disk is the most innovative and well-known aspect of the project.25
As noted earlier, the AGI has two priorities: the sound conservation of original documents, and the optimal access to those documents by researchers. In 1992, the Reading Room accommodated some 50 researchers each day. In 1990, 37,303 consultations of original documents were recorded, and in 1991, 37,172.
Since this is a relatively small archive (slightly more than 43,000 document bundles), such heavy use by researchers and photocopiers entails a serious risk of deterioration of the original documents. Between the second half of 1989 and the end of 1992, almost 200 bundles in the AGI were consulted more than 50 times in the Reading Room.
How could the dilemma of conservation and document access be resolved? The best alternative was offered by the systems of document reproduction and the use of copies, rather than originals, for consultation. This leads to an important point that affected subsequent technical decisions: the aim was not to replace the original documents with digital copies but to preserve the original better by avoiding its continual handling during consultation. Thus, the technical specifications for digitization were set to ensure adequate quality for consultation-not to replace the original for all purposes. The level of quality had to satisfy most researchers so that originals would need to be consulted only in exceptional cases.
We can analyze the system of digital image storage by dividing it into two basic processes: the process of digitization and storage on optical medium, and the process of consultation. Digitization in turn can be studied in two further steps: the preparation of documents or prior archival work, and the digitization itself.
Selection of Documents to be Digitized
Since it was not possible to digitize all of the AGI documents in a reasonable amount of time, the first question asked was how to make proper selection of the documents to be scanned. The following selection criteria were used:
- Only complete series would be digitized-never selected individual documents. This was the first and mandatory criterion.
- Documents found to be of greatest use for consultation would be selected. A statistical analysis would locate the documentary series that researchers had most often used.
- The documents selected would cover all territories relating to Spanish colonization in the New World. This would draw on the Archivo’s strength as the basic archive for history of the Americas.
- As a practical criterion, the status of document description was also considered, with a view to the work of preliminary preparation.
The final selection yielded a list of documentary series that would satisfy more than 30 percent of users’ requests. The list represented about 10 percent of all AGI documents. It was revised by adding some documents held by two other historical archives relating to Spanish history in the New World, the Archivo Histórico Nacional (AHN) and the Archivo General de Simanacas (AGS). These documents, however, were not selected by the same criteria as those for AGI.
The purpose of supplementing the Archivo’s holdings was to meet fully the goal established by King Carlos III when the AGI was founded: that all documents “referring to the Indies” should be deposited in the AGI. A further goal was to incorporate other national archives into the process of technological modernization. More detail about this work appears in Appendix 3.
The documents selected must be properly prepared for forwarding to the digitization room. This entails the traditional work of organizing the documents, bundle by bundle, drafting new descriptions for them or revising present ones, placing them properly in folders, writing the reference number, etc. These steps are fundamental in ensuring a subsequent expeditious and successful retrieval of digital images.
The final step in preparation is entry into the information system of the descriptive data that will eventually enable document access.26 A “digitization guide” is also prepared, a form that includes the minimum information needed to guide the scanner operator in his work and to carry out the subsequent liaison between the textual database forming the information and reference system and the digital images stored on optical disks.
Once the documents have been prepared, they are digitized. From 1989 to 1994, digitization stations were supplied with an IBM AT computer, a Rank Xerox 7650 flatbed scanner, and an optical disk unit-first an IBM 3363 and later a Panasonic.27 The scanner could digitize up to size A3 to 400 dots per inch (dpi) with 256 grayscale levels, although the AGI uses only 100 dpi and preserves only the most significant 16 grayscale levels.
The same routine is followed for each page: scanning, viewing, compression, and direct recording on optical disk. The entire process takes about one minute per page. The documents-because of their intrinsic value, age, and state of conservation-require careful handling during digitization, so no automatic scanner feeding is allowed. From 1990 to 1992, the work was carried out by 15 digitization stations working double shifts with a variety of equipment to prevent delays in case of breakdown. The number of staff decreased between 1992 and 1994, when the agreement between the three partners ended. In 1995 and 1996, work continued with three full-time and six part-time employees.
At the end of 1996, the AGI got two new digitization stations, each equipped with a Kodak DCS 420 digital camera and Hewlett-Packard CD-Writer 4020i disk recording units. This made it easier and faster to complete the work (only seconds per page), with less risk to the documents. Now, AGI requires only one-third the staff to achieve the same rate of productivity as with the old equipment.
The most important aspects of AGI’s digitization work are image quality, storage support, formats and image compression, and quality control.
The quality of a digital image is determined basically by two parameters: resolution and grayscale. How can the AGI obtain a digital image offering quality guarantees for its stated goals? In recent years, a series of projects have tried to set standards guiding the selection of such parameters, and various criteria have been suggested. Don Willis distinguishes between “archival resolution,” defined as “the resolution required to ensure a faithful replica of the original document regardless of cost” (with archival resolution at 600 dpi and 8 bits of grayscale), and “optimal archival resolution,” defined as “the highest resolution economically supported by technology at a given time” (assuming a balance between cost and quality), and “adequate access” resolution, disregarding conservation but focusing on information needs (estimated at 300 dpi in black and white).28
Obviously, many types of documentation can be digitized, each with its own characteristics and requirements.29 Anne R. Kenney and Stephen Chapman divide the documents by category: text or line, halftone, continuous tone, or mixed documents. Each type requires different parameters in principle. While some maintain that “archival resolution” is 600 dpi, NARA proposes 300 dpi, after having used 200 dpi in its ODISS (Optical Digital Image Storage System) project.30
Different projects have different goals that determine digitization parameters. Seeking complete “replacement” of the original by the digital image differs from digitizing solely for on-screen consultation or on-line network access to limit handling of originals.
Document characteristics also determine the selection of parameters. There is a difference between digitizing modern typed documents that are well preserved and documents of the sixteenth or seventeenth centuries with bleedthrough from ferrogallic inks or with inks faded by exposure to humidity. A suitable balance among costs, budget resources, and the stated objectives must also be considered. Ignoring these aspects while pursuing optimum quality could make it impossible to maintain projects because of the implications for storage and even general systems configuration. For example, optimal quality is likely to require more powerful processors, higher-quality monitors, and enhanced capacity networks.
In terms of the AGI project, the aim is to obtain reference quality that will allow users to consult the document on-screen or in hard copy, rather than in its original form. The documents to be digitized are manuscripts from the fifteenth to nineteenth centuries, in various states of conservation. They may show water spots, stains, faded ink, or bleedthrough. They can be classified as “continuous tone” according to the classification drawn up by Anne Kenney and Stephen Chapman.31
The fundamental criterion is the search for a proper balance between image quality on the one hand and storage and processing-capacity needs on the other. Fortunately, the costs of storage and processing capacity have dropped steadily, making it less expensive to store and process good quality images. Technological advances will also lead to the development of reasonably priced equipment that can handle vast quantities of information.
Several tests were conducted to ensure quality consistent with the purpose and type of documents,32 with an eye to minimizing storage requirements. Based on these tests, it was decided to digitize at 100 dpi and 16 grayscale levels (4 bits per pixel). These parameters are quite different from those used in other digitization projects but are well suited to the aims of the Seville project-to digitize for consultation, rather than for replacement of the original. The parameters are also consistent with the minimum values recommended in 1995 by the Technology Subcommittee of the UNESCO Memory of the World Programme.33
Based on these parameters, project staff decided not to improve images during digitization, but to defer that step to the stage of viewing. The viewer decides on the use of image enhancement algorithms at the time of consultation. This procedure also simplifies and shortens the work of digitizing: the operator does not need to make any special decisions regarding image quality, which is assumed to be adequate if the specified parameters are maintained.
The effect of digitization with grayscale on resolution requires further study, as noted by Kenney and Chapman.34 Their comments are made with respect to tests that the Cornell Department of Preservation and Conservation conducted on brittle books. The impact of digitization with grayscale on the cost of storage, time, equipment, and processors is declining steadily owing to the rapid growth of processing capacity.
The project also aimed to digitize AGI’s maps and plans collection, which consists of about 7,000 pieces that have been extracted from their original, provenance-based files to facilitate their conservation. These materials differ from other normal documents in that they use color and are oversized. Because of this, AGI had to use computers with greater capacity and processing speed, and new solutions for storage had to be found. It was also impossible to use the same scanner as for other materials.
It was decided to first make a color microfilm copy, which would subsequently be digitized. Microfilming with Cibachrome, at 200 line pairs, would yield a good-quality color copy. The film could then be digitized with a Nikon LS3500 slide scanner, able to capture 4,096 x 6,144 pixels for 35 mm frames. The final resolution in digital form was similar to the grayscale documents (100 dpi), with 256 colors. For quicker access, a black-and-white copy could be saved.
The color microfilming was completed, but the scanning operation was interrupted and only one hundred digital color images were obtained. There were three reasons for this. First, although the quality of the images appeared basically good, sometimes the resolution was too low for sharp text legibility. Second, it took too long to display the large color document images on the screen with the 486 processors. Third, this work assumed lower priority after the AGI could offer consultation of maps and plans through color microfilm in the Reading Room.
Most of the AGI documents are folio size, measuring slightly over A4. An A4-size image, digitized at 100 dpi and 4 bits/pixel uncompressed, can occupy about half a megabyte. While still considered sizeable in 1997, the storage requirement was much more significant ten years earlier.
Besides needing image compression algorithms, there were few choices for media carriers. The most useful option was the WORM (write once/read many) optical disk because of its capacity, ease of recording and of subsequent use for consultation, and predictable longevity. The fundamental problem with WORM disks was the absence of standards and the variety of trademarks and formats, which increased the risk of obsolescence.
A 200 MB capacity IBM 3363 optical disk was used first. This disk model, on which 1,729 bundles were digitized, was soon replaced by a Panasonic (Reflection Systems RF-5010C) disk with 940 MB capacity. The Panasonic could usually store all the images of a bundle on a single disk, after compression. It was used to record 3,732 bundles of documents. Each new RF-5010C disk cost 15,000 pesetas wholesale (about US $100). Thus, with 5,511 bundles digitized by the end of May 1997, the cost of storage media was very high. Assuming that each disk held the images of one bundle, and each bundle averaged 1,956 pages, the per-page storage cost could be estimated at 7.67 pesetas (approximately five cents).
The process of converting all disks to the new media carrier began with the changeover from the IBM to Panasonic disks. (The conversion was not completed as of 1997 and the consulting system was still using both types of disks at that time.)
In recent years, the use of CD-ROM has become more common. CD-R recording equipment has been developed, and the cost of blank CDs has dropped. Consequently, it was decided to migrate to this new format. At the end of 1996, the AGI acquired new recording equipment with CD-R units and installed six units for converting WORM disks to CD-R (three to convert IBM disks and three for Panasonic disks). Migration has been under way since early 1997 and should take about two years.
For several years, the lack of image backup copies has been a key problem. The AGI postponed a decision on the backup system in anticipation of better options, such as magnetic or optical tape. Meanwhile, the AGI had only a single copy of the images, risking the loss or deterioration of a disk, which would require redigitization of the original. A backup copy program was finally established in 1995. Five units were installed for copying Panasonic disks onto DAT DSS I and II magnetic tape (2 and 4 Gb tapes). Between April 1995 and the end of 1996, 3,205 disks were copied, leaving only 525 IBM disks uncopied onto a new media carrier. These are now being copied directly to CD-R in the new migration process that began early in 1997. Now, when documents are digitized, two CD-R copies of the digitized images are made: one for use and another for backup. The cost of a new CD was $6.50 at the beginning of 1997, but is now much cheaper.
By the end of May 1997, the status of migration was as follows:
- Digitized: 1,729 bundles
- Copied on Panasonic: 1,186 bundles
- Copied on CD-R: 110 bundles
- Uncopied: 433 bundles
Panasonic RF-5010C disks:
- Digitized: 3,732 bundles
- Duplicated on Panasonic: 571 bundles
- Copied on magnetic tape: 3,205 bundles
- Copied on CD-R: 141 bundles
- Digitized: 50 bundles (two copies)
- Copied from IBM: 110 bundles
- Copied from Panasonic: 141 bundles
- Copied from Panasonic: 3,205 bundles
Formats and Image Compression
The heavy storage requirements for digital document images call for the use of compression formats. The AGI wanted to scan for grayscale, rather than for black-and-white or binary images. But standard algorithms for color or grayscale images, such as JPEG and GIF, were not available in 1988. Consequently, the AGI digitization project developed its own compression algorithm using a DPCM (Differential Pulse Code Modulation) model with statistical compression. This model allows for an approximate reduction of 2 to 1 without loss of quality.
The increased use of JPEG made it advisable for the AGI to eventually adopt this compression algorithm. Current migration to the new CD-R support is also changing the compression format. Decompression of the AGI format and subsequent compression in JPEG are carried out during the recording process. Since JPEG allows adjustment of the compression parameters to determine the quality sought in viewing, the new compression is designed to permit no more than a 15 percent loss. The compression factor is similar to the AGI format, with the images occupying the same amount of space in both AGI and JPEG formats.
After digitization, quality control can be done in two ways. The first is automatic, by comparing the digitization guide with the resulting optical disk. This process can detect mistakes, such as omissions and reference number errors, that can be resolved before the WORM disk is dispatched for consultation service or before transfer from hard disk to the CD.
The second means of quality control is manual inspection. This can be done by accessing document images at selected intervals (such as every page, or the first pages of each document or block, one page of every five, or one page of every day’s batch of digitized images). In practice, this type of quality control creates a bottleneck since it must be done by specialized personnel and is time-consuming. It was done for only a brief period at the AGI. The users themselves-outside researchers and staffers-have been the ones to detect possible errors, such as omissions or repetitions.
Once digitization and quality control with the first method have been completed, the resulting disk may be used for consultation. It may also be duplicated and a backup copy made.
In deciding how to place such a large volume of images in service, several options were considered.
- Decentralized service would allow users to retrieve the disks themselves, but it has several drawbacks. These include security risks and the fact that readers need two types of optical disks for each user station. There would also be the problem of many researchers simultaneously searching for their disks. This option would be more attractive if few disks were involved.
- A centralized strategy called for various possibilities to be studied:
- Use of a jukebox. This is the usual solution for centralized service. But no single unit could handle so many optical disks, and the use of a battery of jukeboxes greatly increased the cost. In addition, a large space was needed to house them.35
- Use of a robot. There were no robots available on the market for optical disks, only for magnetic tape cartridges. One of these could have been adapted, but the cost was very high and the risk of obsolescence significant.
- Centralized service with human intervention. A user’s request for a disk would be handled by a human operator, located in the Optical Disk Room, who would retrieve the disk and place it in the reading unit.
The use of a set of high-capacity magnetic disks was not studied because they were considered impossible to obtain at the time. This option now begins to offer possibilities, and its use will probably be justified for urgent requests (over the Internet, for instance).
The AGI chose the option of centralized service with human intervention. It was the least expensive and allows for the introduction of other options in the future. Optical disk servers were installed in the AGI Optical Disk Room, connecting more than a dozen optical disk readers for the two existing formats. All available optical disks are installed and organized in shelves beside the servers and the optical disk readers. An operator at that site handles the requests for images as shown on a monitor displaying disk request messages. There are now different servers for IBM and for Panasonic disks. It will eventually be necessary to install a server, with reading units, for CDs.
When a user at a workstation requests an image, the system generates a message shown on the monitor. The operator receives the message, selects the requested disk from its shelf, and places it in an available disk unit. The document images are sent through the local area network (16 megabits per sec.) to the user’s workstation. When all the document images requested have been delivered to the workstation, the optical disk may be withdrawn by the disk unit operator, who is then ready for the next request. The entire process can be carried out relatively easily and efficiently by a single person handling requests through the monitor and a few disk-reading units.
This means that part of the consulting process is not automated and that human errors can occur. The time required for sending the images is brief: in a minute to a minute and a half, the researcher will receive the first image on a monitor and the series of pages constituting the document or file requested will begin to be stored in the workstation. Within a short time, the researcher will have the entire document. From that point on, the researcher can work locally for as long as desired.
When the researcher receives the requested document images on the monitor, he or she can begin to consult them, using a variety of tools for image treatment and enhancement.36
In the original version of the system, the workstation consisted of a PS/2 computer with a 486 processor and OS/2 operating system, and with Dialog Manager and Presentation Manager for user interface. These controlled two monitors: one conventional VGA (IBM 8513) for text and image management, and another high-resolution unit for image display (IBM 8508 for grayscale images or IBM 6091 for color). This initial interface with two screens has been modified; the subsequent versions carry out all functions in a microcomputer with a Pentium processor and a single monitor. The same mix of facilities is retained for document management, browsing, expansion and rotation, printing, and the use of algorithms for historical document treatment (such as elimination of stains and ink bleedthrough, enhancement of faded inks, and improvement of contrast).37
25 Pedro González, “¿Salas de Lectura sin Papel?” [Paperless Reading Rooms?], in Proceedings of the 11th International Congress on Archives (Munich, New York, London, Paris: K.G. Sauer, 1989), 229-33.
26 “Efficient retrieval of scanned document images and graphic data depends on the accurate, up-to-date index data base. Indexing a digital image involves linking descriptive image information with header file information…And accuracy is critical because an erroneous index term may result in non-retrieval of the related image.” National Archives and Records Administration (NARA). Digital Imaging and Optical Digital Data Disk Storage Systems: Long-Term Access Strategies for Federal Agencies. Technical Information Paper no. 12 (NARA, 1994).
29 Anne R. Kenney and Stephen Chapman. Tutorial: Digital Resolution Requirements for Replacing Text-Based Material: Methods for Benchmarking Image Quality (Washington, DC: Commission on Preservation and Access, 1995).
32 See Julián Bescós and Juan Navarro, “La Digitalización como Medio para la Preservación y Acceso a la Información en Archivos y Bibliotecas” [Digitization as a Means of Preserving and Accessing Information in Archives and Libraries], Educación y Bibliotecas 80 (1987): 28-41.
35 Jukeboxes on the market now can handle several hundred CDs, which means they could be practical, especially if the new DVD standard is adopted. This would yield much more storage capacity than the current CD and a sizable reduction in the number of disks needed.
36 For research conducted to develop image enhancement tools for better readability of documents, see:
Julián Bescós Ramón, “Image Processing Algorithms for Readability Enhancement of Old Manuscripts,” in Electronic Imaging 89 (Pasadena, CA, 1989), 1:392-97.
Julián Bescós Ramón, Francisco Jaque, and Luis Montoto, “Reflectance and Optical Contrast of Old Manuscripts: Wavelength Dependence,” Scanning Imaging Vol. 1,028 (Society of Photo-optical Instrumentation Engineers [SPIE] 1989): 258-62.
Julián Bescós Ramón, Juan Pedro Secilla, and Juan Navarro, “Filtering and Compression of Old Manuscripts by Adaptive Processing Techniques.” Proceedings of the Society for Information Display International Symposium 1990 (Las Vegas: Society for Information Display, 1990): 384-87.
Julián Bescós Ramón, Juan Navarro, and Carlos Ramón, “Mejora de Legibilidad de Documentos Antiguos Mediante Tratamiento Digital de Imágenes” [Enhancing Readability of Old Documents through Digital Image Treatment]. IV Simposium Nacional de Reconocimiento de Formas y Análisis de Imágenes [Proceedings of Fourth National Symposium on Form Recognition and Image Analysis] (Granada: Sociedad Española de Reconocimiento de Formas y Análisis de Imágenes, 1990): 51-58.
37 In their previously cited report, Hans Rütimann and M. Stuart Lynn note: “The speed and ease of use of these tools are impressive. There is something almost magical in seeing a badly stained section of a 300-year-old manuscript cleaned up before one’s eyes and become legible again.” Rütimann and Lynn, 11.