Carnegie Mellon University: A New Electronic Archives • CLIR

Carnegie Mellon University A New Electronic Archives

http://www.library.cmu.edu

BACKGROUND

Carnegie Mellon University (CMU) is a private institution serving an undergraduate and graduate student population of 7,500, with a faculty, research, and administrative staff of 3,000, located in the Oakland section of Pittsburgh. The University Libraries, consisting of the Engineering and Science Library, the Hunt Library, and the Mellon Institute Library, have a staff of 32 faculty members, 56 library employees, and 28 students working full-time, and an annual budget of $6.3 million. The libraries’ collections consist of about 800,000 volumes and 250,000 audiovisual materials. Reciprocal borrowing agreements have been established with both the Carnegie Library and the University of Pittsburgh’s Hillman Library, also located in Oakland. Motivated by the limited size of its print collection, the University Libraries have made extensive efforts to provide a wide range of electronic resources.

Institutional goals at Carnegie Mellon University emphasize the following:

delivering distinctive and first-quality education
fostering research, creativity, and discovery
using knowledge developed on campus to serve society

Carnegie Mellon University has had a strong technical focus since its founding in 1900. Throughout its evolution, culminating in its merger with the Mellon Institute in 1967, CMU has honored the words of founder Andrew Carnegie: “My heart is in the work.” Research lies at the heart of this work, and Carnegie Mellon, a leader in using computer technology in research and education, is recognized as being among the top research institutions in the country. Outstanding programs in computer science, robotics, and engineering have further solidified Carnegie Mellon’s technological reputation.

Because of CMU’s traditional emphasis on technology, it seemed natural for the Univerity Libraries to assume a stronger position by providing digital services. The libraries had already participated in the Mercury, TULIP, and UMI Virtual Library projects. The subject of this case study, a project to create the Senator H. John Heinz Archives (referred to hereafter as “the Heinz Archives”), follows the tradition of advancing the libraries’ strategic plan, according to University Librarian Gloriana St. Clair. The plan includes the following:

technical expertise in digital library development issues
demonstration digital library projects
exemplary digital instructional programs
automated reference assistance that students can use remotely
intuitive, easy to use systems¹

The Heinz project readily meets three of these five conditions.

THE PROJECT

In 1992, as part of a larger donation to Carnegie Mellon University, Teresa Heinz, widow of the late Senator H. John Heinz III, gave papers from her husband’s congressional service to the University Archives. The Heinz family donated the papers to Carnegie Mellon to encourage exploration of primary source congressional archive documents by a broad group of users. At the same time, they hoped the papers would be a focal point for research by the faculty and students of the H. John Heinz III School of Public Policy and Management. While a traditional paper archives would provide the latter opportunity, the broader group of users could best be reached by making the core documents accessible electronically. According to Gabrielle Michalek, university archivist and manager of the Heinz Archives project, Teresa Heinz was also interested in the development of an innovative means of access to her husband’s papers. The libraries’ staff was called upon to draft a proposal, which received funding of just over one million dollars from the Heinz foundations.

In November 1993, Michalek, along with Charles Lowry, then Carnegie Mellon University librarian, and David Evans, director of the Laboratory for Computational Linguistics, drafted a proposal to develop an electronic archives of the most important of Heinz’s papers. The archives was designated HELIOS-the Heinz Electronic Library Interactive Online System. In creating this electronic archives and related interfaces, the planners drew on lessons from the libraries’ previous information technology projects

The Process

Staff across the Carnegie Mellon campus collaborated to develop a model for HELIOS. Continuity of staff was essential to the success of the project. Changes in administration occurred in the offices of both the university president and the librarian during the course of the project, but the University Archives and University Libraries’ information technology staff worked to ensure sustained progress. They were supported by university and library administrators.

The technological orientation of Carnegie Mellon has been essential to this project’s success. Equally important is University Librarian St. Clair’s enthusiasm for digital library resources and development. Furthermore, the staff and administration of the University Libraries have embraced the opportunities that technology can bring to their work. The administration has encouraged staff members to look to technology for solutions and alternatives. Collaboration is yet another essential element. The University Libraries work closely with other campus units to find supplemental grants for digital projects. In keeping with their desire to find collaborative solutions to technical problems, the libraries have also become a member of the Digital Library Federation.

The collaborative nature of the development of the Heinz Archives has become the standard for many other projects undertaken at Carnegie Mellon. Three organizations participated in the development: the University Libraries, the Laboratory for Computational Linguistics (LCL), and CLARITECH Corporation, a company that developed and customized software for this project. The latter two groups provided software development expertise, while library staff provided system design leadership and project management.

Heinz Archivist Edward Galloway manually processed 1,200 cartons of materials to identify the structure and composition of the collection. Carton contents included papers, audiovisual materials, and memorabilia. The archivist used previously written guides for developing congressional archive collections. After assessment and selection, the number of cartons to be processed was reduced by half. During this review process, series within the collection were identified, a logical arrangement was developed, and an initial inventory was created. While manual processing of archive materials took place, CLARITECH began work on the graphical interfaces.

Graphical Interfaces

Important issues of preservation, system design, access, and place and time independence were identified. System design focused on the need for different interfaces for scanning, verification, and public access. Access to the collection was enhanced through natural language processing, which was used to develop thesauruses and “find the natural linkages in the large diverse body of the archives to build relational hierarchies among documents.”² Electronic access allowed the materials to be used at any time from any place.

Not all parts of the system were developed in-house. CLARITECH Corporation, in addition to developing the interface, provided commercial software modules for indexing and retrieval of the HELIOS documents. They included the following:

CLARIT-NL for robust and efficient identification of noun phrases in arbitrary texts
CLARIT-LEX for semi-automatic domain lexicon development
CLARIT-THES for automatic domain thesaurus discovery
CLARIT-IR for vector-space document indexing and retrieval
CLARIT-EQ for automatic higher-order term relationship analysis
CLARIT-OCR for domain-targeted OCR post-processing and normalization³

University Libraries staff were responsible for system design standards and data formats. These were developed to ensure portability and integration of HELIOS with the libraries’ existing information system. Experience with other university library information technology projects had shown this to be the most important factor for long-term support. While the encoded archival description (EAD) standard was in its early stages of development, both Galloway and Michalek were aware of its significance. To ensure incorporation of the EAD standard over the long-term, sufficient data were initially included for mapping into the appropriate tags.

Interfaces developed for HELIOS provide access for scanning and verification as well as for public use. The scanning and verification interfaces incorporate much of the hierarchical framework identified for the collection during manual processing. These interfaces were developed to reflect inherent relationships among the documents in the folders.

A series map for the collection provides metadata for scanning processing. Scanning involves identifying the unique characteristics for each document, designating document type, assigning folder numbers, and making subgroup/series/subseries designations. Many of these elements were built into the interface as lists from which operators can select. This streamlines the process while also promoting quality control and consistency of input. Electronic folders hold documents, some of which are designated to bundles, reflecting items stapled or clipped together. Throughout the scanning process, the scanned image appears within the interface to allow visual verification. Image scanning is done in real time, while optical character recognition (OCR) processing of the scanned text is done overnight, extending the usage of system workstations.

Verification of both image quality and accuracy of the OCR conversion is done for primary materials, which are defined as memos, speeches, and correspondence that are at the heart of the congressional collection. Input areas in the interface allow for text correction, as well as the inclusion of notes regarding handwritten annotations or text not captured. The interface highlights indexed concepts to facilitate verification efforts. Document sequencing is also ensured during this process, with the interface allowing the resequencing of pages as needed.

Given the limitations of keyword searches against large text-based resources, the Laboratory for Computational Linguistics developed a “content-based document-processing ‘engine'”⁴ for the HELIOS system. Along with the CLARIT modules noted previously, the LCL developed methods to interpret natural language queries from the public interface, streamline analysis of OCR output, and automate topical division creation within the archive. This final feature significantly reduces the amount of manual work for the archivist. The use of the CLARIT modules fosters the relationship between noun phrases identified within the document texts and the natural language queries researchers are expected to use. The Laboratory for Computational Linguistics also developed sets of queries from materials extracted from the archives, as well as suggestions for alternative ways of depicting concepts.

Public Access

Introduction of a Web-based public interface at the beginning of 1998 allowed archivist Galloway to make broader assessments of collection use. Detailed analysis of search behavior is not yet supported beyond calculations of session length, specific actions during sessions (display, browse, new search, where, search), and session origin. Most surprising to Galloway is the high percentage of off-campus use, based on IP address analysis.

RESULTS

The project demonstrated the importance of contractual agreements instead of handshakes. CLARITECH did software development under just such a gentleman’s agreement. When CLARITECH was sold in 1997, it stopped supporting the software used at the Heinz Archives. Fortunately, this development coincided with the library’s negotiations with a third-party vendor for a new library management system, and the vendor agreed to incorporate significant features from the archives interfaces into its archives module.

According to recent analysis by the University Libraries staff, 75 percent of online catalog and database queries are from users at remote sites. To address a corresponding need for remote reference assistance, a project is being framed to develop an automated help mechanism. Components from the Heinz Archives project, such as natural language processing, may be used in this project. In addition, the project will draw on expertise from other technology departments on campus, such as artificial intelligence and robotics.

The HELIOS project is a good example of how technology is being used to reconceptualize and re-engineer traditional library and archival processes for the benefit of users.

FOOTNOTES

¹Carnegie Mellon University Libraries Strategic Plan, Gloriana St. Clair, University Librarian, October 5, 1998, 3.

²The HELIOS Archive: A Proposal for the Preservation and Use of the Professional Papers of the Late Senator H. John Heinz III, David A. Evans, Michael L. Horowitz, and Charles B. Lowry, November 4, 1993, ii.

³ Ibid., 11.

⁴ Ibid., 8.