 |
Carnegie Mellon University
A New Electronic Archives
http://www.library.cmu.edu
BACKGROUND
Carnegie Mellon University (CMU) is a private institution serving
an undergraduate and graduate student population of 7,500, with a
faculty, research, and administrative staff of 3,000, located in
the Oakland section of Pittsburgh. The University Libraries, consisting
of the Engineering and Science Library, the Hunt Library, and the
Mellon Institute Library, have a staff of 32 faculty members, 56
library employees, and 28 students working full-time, and an annual
budget of $6.3 million. The libraries' collections consist of about
800,000 volumes and 250,000 audiovisual materials. Reciprocal borrowing
agreements have been established with both the Carnegie Library and
the University of Pittsburgh's Hillman Library, also located in Oakland.
Motivated by the limited size of its print collection, the University
Libraries have made extensive efforts to provide a wide range of
electronic resources.
Institutional goals at Carnegie Mellon University emphasize the
following:
- delivering distinctive and first-quality education
- fostering research, creativity, and discovery
- using knowledge developed on campus to serve society
Carnegie Mellon University has had a strong technical focus since
its founding in 1900. Throughout its evolution, culminating in its
merger with the Mellon Institute in 1967, CMU has honored the words
of founder Andrew Carnegie: "My heart is in the work." Research
lies at the heart of this work, and Carnegie Mellon, a leader in
using computer technology in research and education, is recognized
as being among the top research institutions in the country. Outstanding
programs in computer science, robotics, and engineering have further
solidified Carnegie Mellon's technological reputation.
Because of CMU's traditional emphasis on technology, it seemed natural
for the Univerity Libraries to assume a stronger position by providing
digital services. The libraries had already participated in the Mercury,
TULIP, and UMI Virtual Library projects. The subject of this case
study, a project to create the Senator H. John Heinz Archives (referred
to hereafter as "the Heinz Archives"), follows the tradition
of advancing the libraries' strategic plan, according to University
Librarian Gloriana St. Clair. The plan includes the following:
- technical expertise in digital library development issues
- demonstration digital library projects
- exemplary digital instructional programs
- automated reference assistance that students can use remotely
- intuitive, easy to use systems1
The Heinz project readily meets three of these five conditions.
THE PROJECT
In 1992, as part of a larger donation to Carnegie Mellon University,
Teresa Heinz, widow of the late Senator H. John Heinz III, gave papers
from her husband's congressional service to the University Archives.
The Heinz family donated the papers to Carnegie Mellon to encourage
exploration of primary source congressional archive documents by
a broad group of users. At the same time, they hoped the papers would
be a focal point for research by the faculty and students of the
H. John Heinz III School of Public Policy and Management. While a
traditional paper archives would provide the latter opportunity,
the broader group of users could best be reached by making the core
documents accessible electronically. According to Gabrielle Michalek,
university archivist and manager of the Heinz Archives project, Teresa
Heinz was also interested in the development of an innovative means
of access to her husband's papers. The libraries' staff was called
upon to draft a proposal, which received funding of just over one
million dollars from the Heinz foundations.
In November 1993, Michalek, along with Charles Lowry, then Carnegie
Mellon University librarian, and David Evans, director of the Laboratory
for Computational Linguistics, drafted a proposal to develop an electronic
archives of the most important of Heinz's papers. The archives was
designated HELIOSthe Heinz Electronic Library Interactive Online
System. In creating this electronic archives and related interfaces,
the planners drew on lessons from the libraries' previous information
technology projects
The Process
Staff across the Carnegie Mellon campus collaborated to develop
a model for HELIOS. Continuity of staff was essential to the success
of the project. Changes in administration occurred in the offices
of both the university president and the librarian during the course
of the project, but the University Archives and University Libraries'
information technology staff worked to ensure sustained progress.
They were supported by university and library administrators.
The technological orientation of Carnegie Mellon has been essential
to this project's success. Equally important is University Librarian
St. Clair's enthusiasm for digital library resources and development.
Furthermore, the staff and administration of the University Libraries
have embraced the opportunities that technology can bring to their
work. The administration has encouraged staff members to look to
technology for solutions and alternatives. Collaboration is yet another
essential element. The University Libraries work closely with other
campus units to find supplemental grants for digital projects. In
keeping with their desire to find collaborative solutions to technical
problems, the libraries have also become a member of the Digital
Library Federation.
The collaborative nature of the development of the Heinz Archives
has become the standard for many other projects undertaken at Carnegie
Mellon. Three organizations participated in the development: the
University Libraries, the Laboratory for Computational Linguistics
(LCL), and CLARITECH Corporation, a company that developed and customized
software for this project. The latter two groups provided software
development expertise, while library staff provided system design
leadership and project management.
Heinz Archivist Edward Galloway manually processed 1,200 cartons
of materials to identify the structure and composition of the collection.
Carton contents included papers, audiovisual materials, and memorabilia.
The archivist used previously written guides for developing congressional
archive collections. After assessment and selection, the number of
cartons to be processed was reduced by half. During this review process,
series within the collection were identified, a logical arrangement
was developed, and an initial inventory was created. While manual
processing of archive materials took place, CLARITECH began work
on the graphical interfaces.
Graphical Interfaces
Important issues of preservation, system design, access, and place
and time independence were identified. System design focused on the
need for different interfaces for scanning, verification, and public
access. Access to the collection was enhanced through natural language
processing, which was used to develop thesauruses and "find
the natural linkages in the large diverse body of the archives to
build relational hierarchies among documents."2 Electronic
access allowed the materials to be used at any time from any place.
Not all parts of the system were developed in-house. CLARITECH Corporation,
in addition to developing the interface, provided commercial software
modules for indexing and retrieval of the HELIOS documents. They
included the following:
- CLARIT-NL for robust and efficient identification of noun phrases
in arbitrary texts
- CLARIT-LEX for semi-automatic domain lexicon development
- CLARIT-THES for automatic domain thesaurus discovery
- CLARIT-IR for vector-space document indexing and retrieval
- CLARIT-EQ for automatic higher-order term relationship analysis
- CLARIT-OCR for domain-targeted OCR post-processing and normalization3
University Libraries staff were responsible for system design standards
and data formats. These were developed to ensure portability and
integration of HELIOS with the libraries' existing information system.
Experience with other university library information technology projects
had shown this to be the most important factor for long-term support.
While the encoded archival description (EAD) standard was in its
early stages of development, both Galloway and Michalek were aware
of its significance. To ensure incorporation of the EAD standard
over the long-term, sufficient data were initially included for mapping
into the appropriate tags.
Interfaces developed for HELIOS provide access for scanning and
verification as well as for public use. The scanning and verification
interfaces incorporate much of the hierarchical framework identified
for the collection during manual processing. These interfaces were
developed to reflect inherent relationships among the documents in
the folders.
A series map for the collection provides metadata for scanning processing.
Scanning involves identifying the unique characteristics for each
document, designating document type, assigning folder numbers, and
making subgroup/series/subseries designations. Many of these elements
were built into the interface as lists from which operators can select.
This streamlines the process while also promoting quality control
and consistency of input. Electronic folders hold documents, some
of which are designated to bundles, reflecting items stapled or clipped
together. Throughout the scanning process, the scanned image appears
within the interface to allow visual verification. Image scanning
is done in real time, while optical character recognition (OCR) processing
of the scanned text is done overnight, extending the usage of system
workstations.
Verification of both image quality and accuracy of the OCR conversion
is done for primary materials, which are defined as memos, speeches,
and correspondence that are at the heart of the congressional collection.
Input areas in the interface allow for text correction, as well as
the inclusion of notes regarding handwritten annotations or text
not captured. The interface highlights indexed concepts to facilitate
verification efforts. Document sequencing is also ensured during
this process, with the interface allowing the resequencing of pages
as needed.
Given the limitations of keyword searches against large text-based
resources, the Laboratory for Computational Linguistics developed
a "content-based document-processing 'engine'"4 for
the HELIOS system. Along with the CLARIT modules noted previously,
the LCL developed methods to interpret natural language queries from
the public interface, streamline analysis of OCR output, and automate
topical division creation within the archive. This final feature
significantly reduces the amount of manual work for the archivist.
The use of the CLARIT modules fosters the relationship between noun
phrases identified within the document texts and the natural language
queries researchers are expected to use. The Laboratory for Computational
Linguistics also developed sets of queries from materials extracted
from the archives, as well as suggestions for alternative ways of
depicting concepts.
Public Access
Introduction of a Web-based public interface at the beginning of
1998 allowed archivist Galloway to make broader assessments of collection
use. Detailed analysis of search behavior is not yet supported beyond
calculations of session length, specific actions during sessions
(display, browse, new search, where, search), and session origin.
Most surprising to Galloway is the high percentage of off-campus
use, based on IP address analysis.
RESULTS
The project demonstrated the importance of contractual agreements
instead of handshakes. CLARITECH did software development under just
such a gentleman's agreement. When CLARITECH was sold in 1997, it
stopped supporting the software used at the Heinz Archives. Fortunately,
this development coincided with the library's negotiations with a
third-party vendor for a new library management system, and the vendor
agreed to incorporate significant features from the archives interfaces
into its archives module.
According to recent analysis by the University Libraries staff,
75 percent of online catalog and database queries are from users
at remote sites. To address a corresponding need for remote reference
assistance, a project is being framed to develop an automated help
mechanism. Components from the Heinz Archives project, such as natural
language processing, may be used in this project. In addition, the
project will draw on expertise from other technology departments
on campus, such as artificial intelligence and robotics.
The HELIOS project is a good example of how technology is being
used to reconceptualize and re-engineer traditional library and archival
processes for the benefit of users.
1 Carnegie Mellon University Libraries Strategic Plan,
Gloriana St. Clair, University Librarian, October 5, 1998, 3.
2 The HELIOS Archive: A Proposal for the Preservation
and Use of the Professional Papers of the Late Senator H. John
Heinz III, David A. Evans, Michael L. Horowitz, and Charles
B. Lowry, November 4, 1993, ii.
3 Ibid., 11.
4 Ibid., 8.
Next Previous
Return to CLIR Home Page >> |