Section 3 • CLIR

European Commission on Preservation and Access, Amsterdam October 1997

3. Recommendations for the digitization of microfilm

3.1 Picture quality

3.2 Storage form

3.3 Format, compression

3.4 Software requirements for image viewing

3.5 Hardware requirements for image viewing

3.6 Long-term preservation of the digital conversion form (migration)

3.7 Financial viability

3.8 Digitization and optical character recognition

3.1 Picture quality

Where good-quality microfilm is available as a long-term storage medium, the reproduction quality of the digital conversion form will be determined by its intended purpose. In other words, as a general rule, digitization of microfilm should not aim at the best possible result in the way that is mandatory for direct digitization of endangered original material.

Bitonal digitization on pan-chromatic ahu microfilm is adequate for the reproduction of printed text, including line drawings, and for modern non-impact typescript (plastic carbon band, and inkjet and laser printers). Gray scale must be used to digitize manuscripts, pencil and crayon drawings, typescript produced with a silk ribbon, color illustrations and drawings, other material with varying shades of gray, and black-and-white and color photographs. Sixteen gray scale (4 bit) is usually adequate for digitizing contrast-enhancing ahu film. For digitization from halftone film, 256 gray scale (8 bit) should be used. Digitization with gray scale requires far more storage, which has serious implications for cost at all stages of the process. It should thus be undertaken only where such reproduction quality is indispensable.

In digitizing from film, the necessary resolution is determined by the size of the smallest element that is to be clearly discernible. With printed texts, this is the height of the small “e”; with manuscripts it is the doubled letter width described in paragraph 2.1. In applying the appropriate formulas of the quality index, resolution requirements are determined in relation to the size of these elements. For bitonal digitization, the quality index is calculated according to the following formula: qi = (0.039h)/3, where a is the resolution in dpi and h the height of the small “e” in millimeters. For digitization with gray scale, the formula is: qi = (a ? 0.039)/2.

With bitonal digitization, a resolution of 615 dpi (for 256 gray scale 410 dpi) is necessary to reproduce the small “e” at a height of 1 mm at higher quality. Medium quality is achieved with 385 dpi (256 gray scale 256 dpi). Lower quality results from 277 dpi (256 gray scale 185 dpi).

Given the high quality of the microfilm, it will be sufficient for most purposes to aim for a digital copy of medium quality. The required resolution can then be calculated on the basis of the quality index qi = 5 for medium quality as follows: resolution in dpi a = 3 ? 5/0.039h, where h is the height of the small “e”. Where the height of the small “e” is 1 mm, this gives a value of 384. For digitization with gray scale, the formula is a = 2 ? 5/0.039h, which, for an “e” of the same height, gives a value of 256. Letters of this size (about 7 pt) are often used in footnotes.

As an indication, the aim should be 350-400 dpi for bitonal digitization, 250-300 for gray scale. Test runs with typical films should be used to decide the quality required for each purpose.

3.2 Storage form

Transfer of the digitized image data should be by digital audio tapes (dat) or cd-r (recordable). Readability independent of hardware is guaranteed for both media through standardization (din 66211 for dat, iso 9660 for cd-r). The current storage capacity of 650 Mb per cd-r and 2 Gb per dat tape will increase in the near future.

It is important to reach a binding agreement with the company undertaking the digitization that it will store the transferred material for at least as long as it takes for the customer to check and secure the data.

The digital conversion form is reliably secured when loss-free compressed or uncompressed image data have been secured on at least two data carriers, and it has been verified that their contents are identical and readable with no difficulty. In the simplest case, the two data carriers with the same content, the “primary data carrier” and the “working duplicate,” will be created by repeated successive transfer of the image data.

To ensure readability of the primary data carrier, multiple working duplicates should be produced from it. Performing a decompression test for every stored digital copy further enhances data security (see paragraph 5.3).

3.3 Format, compression

The image data should be supplied the right way up (readable without being turned) in a continuous format, suitable for the largest possible number of applications. The Tagged Image File Format (tiff) has established itself widely as a model format for image data. The advantage of this format—in contrast, for example, to Windows-Bitmap—derives from the fact that it is largely platform-independent. tiff files can be read and further processed on differing equipment with differing systems and programs. It should, however, be noted that, despite thoroughgoing standardization, the tiff format allows variations that may not be compatible with the installed software. Here, too, careful discussion and, possibly, experimental runs with test data are recommended. tiff provides for uncompressed and compressed data supply. tiff G 4 is available for compression without loss of black-and-white material. If loss-free compression is possible, it should be used for data delivery to save storage space. However, since not all programs can work with compressed tiff data, the compatibility of the application must be established in advance. In any case of doubt, uncompressed supply is to be recommended. The Joint Photographic Experts Group (jpeg) format, which is frequently used for the transfer of half-tone and color pictures, has variable compression ratios that are all lossy and thus not to be recommended.

Because image data can be organized in different ways, it is advisable to agree with the service provider on the organization of the material appropriate to each application. As a rule, each picture will be stored in a separate file. Gathering related pictures in one file (multiple tiff) is possible only with documents that consist of no more than a few pages.

For additional use of the data on the Internet, it is advisable to convert data into platform-independent formats that allow inclusion of the widest variety of documents. Such conversions are part of the service offered today by most of the specialist companies. Where appropriate, this format should be added to the contract.

3.4 Software requirements for image viewing

For access to digitized images, various programs for viewing and manipulation are available for pc and unix environments. These include “Viewer” software, obtainable as public-domain software or shareware programs. It is recommended to install at each institution only one specific, standardized software, whose compatibility with the supply of digitized conversion formats can be rigorously tested in advance.

As a rule, viewer software should have the following features: page-turning forward and backward; use of the whole screen for display; magnification of the whole image and of selected parts of the image; reduction of the whole image; option of return to the original image; image rotation; image inversion; and display of technical information from the headers, such as picture size, resolution, format, bit depth, and print. It is also very useful to have the option of image conversion into other formats and of image compression.

For instance, in the unix world xv is available as shareware. Depending on the installed hardware, appropriate viewers are contained in the supply range of the operating systems (e.g., hp-ux imageview). For pcs, Imaging for Windows is a feature available at no extra charge with Windows 95. Other examples of suitable software are PixView 2.1 from Pixel Translation, ScanMos uvp from ms Electronic Service, or, with limits, Hijaak Pro 2.0 from North American Software.

Software for the control and display of digitized images and for rapid access should be chosen with a view to its specific applications. The requirements we have outlined serve as performance criteria for the viewer components of this application software.

3.5 Hardware requirements for image viewing

Hardware installation that meets requirements for inspection and use of digitized images must be provided at each institution. The relatively large quantities of data contained in digitized images as compared to text files leads to heavier demands on the data bus and ram, if the picture recovery time is to remain within acceptable limits. The minimum requirements are met by pc systems based on processors of type 468 with 66 MHz or Pentium, with Windows 3.11 or higher, 16 Mb ram and a hard disk in the gigabyte range.

In the context of ergonomic design of the work station, particular importance attaches to size of screen (at least 17 inches diagonally), speed, the graphic card, and the appropriate drive. Normal pc screens with 14 inches are unsuitable for image representation, quite apart from the question of resolution. The resolution capacity of normal pc color screens is about 75 dpi, so the image resolution has to be reduced for producing it on the screen. Large screens manufactured specially for image work can reach higher resolutions, up to 120 dpi. In principle, the digital conversion form offers a higher resolution, but this becomes apparent only with magnification of selected parts of the screen (zooming).

3.6 Long-term preservation of the digital conversion form (migration)

Even where a high-quality microform is available alongside the digital conversion form, and thus allows, if necessary, for repeated digitization, the converted format must be preserved in the long term. If only on financial grounds, repeated digitization is out of the question. Given the increasing importance of electronic information systems in research and teaching, the digitized images should be useable in the future for many possible applications. The complete data should therefore be preserved for the long term retaining as much of the information as possible, i.e., with loss-free compression or uncompressed, in a format that allows every conceivable use. Storage of data that have been compressed and formatted only for one specific application is not sufficient.

The loss-free compressed or uncompressed image data must therefore be migrated to new systems in a tiff format or in a platform-independent tiff consequential format. This adaptation must follow a planned concept, in line with technical progress, and must not omit any development steps. The regular adaptation must take into account not only the expected durability of the storage medium, but also the currency of the format and the availability of the hardware and software needed for reading. The rapid succession of innovations in hardware and software, which seldom respect standardization efforts (scarce in this area anyway), can produce problems of compatibility. Migration must be carried out with extreme care. The results must be checked image by image, as the loss of one bit in a graphic file can result in serious loss of data, even up to a whole image. Responsible migration calls for organizational and technical measures to be undertaken before systems are replaced. The object of migration is to hold the data in at least two long-lived storage mediums, secure against interference, in a platform-independent format that is compatible with the edp system being used. Thus, the complete contents of the transferred image data can be checked against the data source of the earlier generation, as long as the edp system that produced it remains available.

3.7 Financial viability

Generally, the digitization of microforms should be done by a service bureau. The costs of digitizing a uniformly produced 35 mm microfilm according to the foregoing recommendations depend essentially on the size of the task, the mode (bitonal or gray scale), and the resolution, but also on the quality of the film and the type and readability of the filmed material. Since digitization costs are also dependent on the market situation, it is not possible to give any general indication of prices that will have long-term validity.

The cost factors we have mentioned take account only of digitization itself. Experience has shown that further costs are incurred by manual turning, splicing images out of the general frame, and marking. Programming costs and the initial cost of programming the film scanner according to the customer’s requirements must also be considered. Finally, there are the costs of downloading the data, operating the cd-r, the carrier medium, and packing and transport. In cases where individual work and image enhancement with special software are necessary to improve quality, such costs must also be included.

The choice between digitization with a general raising of the resolution on the one hand and with gray scale on the other has an indirect bearing on the cost of the conversion. Higher densities of data mean higher costs in data supply, storage, and handling. The consequential costs of any planned migration must also be taken into account. In cases of need, it may be more economical to digitize a second time from the microfilm rather than to constantly migrate the data.

3.8 Digitization and optical character recognition

Optical character recognition (ocr) is a machine process that turns visible alpha-numeric signs into coded data (codes corresponding to the alphanumeric signs and their context), according to a more or less standard pattern of recognition. There is here a fundamental difference between fully automatic text recognition and trainable recognition that supports pattern recognition with dictionaries, linguistic methods, and features of “artificial intelligence.” The text recognition programs increasingly integrate dictionaries and substitution lists that are adjustable according to the degrees of certainty. To prevent the substitution of inaccurate characters that were wrongly recognized as accurate, systems work with fuzzy logic and probabilities. Some systems include an interesting further feature known as “mixed mode.” Signs or groups of signs that are either not recognized, or not recognized with certainty, are retained as images and remain in that uncoded form, in position in the remaining—correctly recognized—text.

In addition to reliable text recognition, page segmenting is an essential performance feature of text recognition systems—that is, the interpretation of contextual information such as columns, blocks of text, and graphics. Further features are deskew, segmenting of individual units, and recognition of types of handwriting and signatures or of more than one language in the same document.

The economical cut-off point for machine text recognition is at 99.95%. In other words, if there are more than 4 or 5 mistakes per 1,000 units, processing by hand is more economical.

Reliability of text recognition depends essentially on the background, the kind and size of the writing, and the contrast between text and background. Disruption of text recognition occurs when there is dirt on the material and omissions from the image information caused by incomplete or irregularly printed letters. Reliability also depends on the density of the image information. The greater the amount of image information being processed, the higher the recognition rate. Higher resolution in digitizing can therefore improve the recognition rate, as with digitizing in gray scale.

In principle, the quality criteria we have mentioned also apply to microfilm. The correct standard background density and minimal ground shade are important to achieve high resolution and adequate contrast. Digitizing negative film avoids the disruption caused by dirt and scratches. In practice, there has not yet been enough experience with machine text recognition in conjunction with microfilm to allow the formulation of reliable views.