 |
2. Linking Digital Library Resources to
Related Resources
This section emphasizes the ability of knowledge organization systems
to link digital library resources to other related resources. The
basis for this linking is the identification of information within
a digital resource that can be extracted and used to search and locate
information within a KOS. The KOS may then be used to expand codes
to more explanatory full text, to provide more descriptive records,
or to link entity names to resources of physical specimens.
Expanding Codes to Full Text
Practitioners of a discipline use coding schemes to facilitate communication
within that discipline. It is often helpful to connect these coding
schemes to the full names for which the code stands. The examples
provided here include links between databank registration codes and
the biological sequence data, and between industrial codes and the
full name that the code represents.
Linking Sequence Numbers to Biosequence Databanks
The lengthy biochemical and genetic sequences that molecular biologists,
biotechnologists, and geneticists identify each day are kept in databanks.
Several databanks have been developed, for example, to cover protein
sequences, nucleotides, and cell lines. One of the largest databanks
contains information on the mapping of the human genome. As molecular
biologists began to discover these sequences, they reported them
in scientific journals. Difficulties in composing, proofreading,
and printing the text soon arose. Through an ad hoc standards process,
major biomedical publishers agreed to require the inclusion of codes
or databank numbers for these sequences in articles when they are
published. In addition, the sequence itself must be registered in
a databank before the paper can be published.
Some of the most frequently referenced databanks are listed on the
Web site of the National Center for Biotechnology Information. They
include GenBank and the Research Collaboratory for Structural Bioinformatics
Protein Data Bank. Each sequence number is different, but all begin
with a persistent code identifying the databank.
How can the link be made between the literature and the databank?
Through a search profile, a text analysis program, or keyword indexing,
the text can be analyzed and the sequence databank numbers identified.
An active link can be embedded. The active link consists of a search
strategy (possibly written as a CGI script) to locate that sequence
number in the databank where the actual sequence is stored. When
the user clicks on the active link, the script is generated and launched
from the user's browser. The Web-enabled database is searched, and
the sequence record is returned to the user. Depending on the services
provided by the databank site, the user can analyze the sequence
using a number of tools provided by the databank or download the
sequence for local manipulation.
This type of connection exists between the National Library of Medicine's
(NLM) search service, PubMed, and GenBank at the National Center
for Biotechnology Information. If a search in PubMed yields records
that have GenBank numbers, the user can automatically search and
display the sequence records from GenBank.
Linking Individual Industrial Codes to the Full Scheme
In business, classification schemes serve to communicate important
facts about a company or product. These codes are generally controlled
by a government, professional, trade, or international standards
organization. They often serve as shorthand for users interested
in material in a particular area of industry or a specific business
sector.
Perhaps the most familiar scheme is the SIC code, which was last
updated in 1987. The SIC codes have been used by the U.S. government,
economists, financial markets, regulators, and procurement offices
to identify manufacturing, agriculture, and service sectors of the
economy. In 1997, a new scheme was approved for use within the United
States. The North American Industrial Classification System was developed
with Canada and Mexico as a means of providing an agreed-upon scheme
for the collection, reporting, and analysis of information about
the economy by sector, both within and across borders. Information
about NAICS is available from the Web site of the U.S. Census Bureau
(see references for address).
The digital library can provide related information by using the
authority files for the coding schemes as a linked authority file.
If a company or economic sector mentioned in the digital library's
collection can be linked to an SIC or NAICS code, the code can be
searched against the official tables of definitions maintained by
the U.S. Census Bureau. These files provide definitions of the codes
and place each code in the classification scheme with other economic
sectors.
The digital library's content can be further enhanced by making
a link between the SIC and NAICS codes. If the digital library resource
has the SIC code, it can be extracted and searched against the Census
Bureau's 1997 NAICS and 1987 SIC Correspondence Tables. The
table returns the corresponding code from the alternate scheme.
Linking to Descriptive Records
Linking the name of an entity, such as a personal name, organization,
or location, to additional information about that entity was one
of the first uses of hyperlinking. Knowledge organization systems
such as dictionaries, glossaries, and classification schemes can
be used to link the entities in one resource to richer descriptions
of that entity in another resource. This is particularly helpful
for users who are new to a topic and in cases where the additional
information can make the user's task more efficient.
The examples that follow are from three disciplines. The first example
links organism names to records that not only describe the species
more fully but also put it in the context of the overall classification
scheme for living organisms. The second example links chemical names
to descriptive records and molecular structures. In the third example,
proper names are linked to the biographies for the person.
Linking Organism Names to Taxonomic Records
Genus-species names are the Latin names for organisms e.g., plants,
animals, and microorganisms. Taxonomists, who study and classify
living organisms, create records for each of these organisms. Generally,
these records are linked relationally to the other organisms in a
hierarchy. Beyond the organism name and the information that it and
its placement in the hierarchy convey, taxonomic records use other
elements to describe the organism. These may include distribution
patterns, the authority for naming and classification, and the date
the organism was identified. Scientists base the information on specimens
that are retained because they serve as the physical evidence of
the description. Natural history museums, private collections, and
individual scientists number, or code, the specimens in their collections.
Sometimes specimens are supported by photographs or line drawings,
which may be digitized.
By using a taxonomic authority file as an intermediate authority
file, one can link a text or an image file containing a name or picture
of an organism to additional related information. By automatically
processing the text or embedding a link from the organism name in
the text or from the image to the taxonomic authority record, one
can extend the knowledge conveyed by the text. The text can include
the descriptive and historical information in the taxonomic record
and, ultimately, link to a photograph, a drawing, or appropriate
video or audio segments.
Because of the ambiguity in organism names, many examples of this
type are now created manually. However, depending on the extent of
the files involved, the ambiguity of the Latin and common names for
organisms can be overcome. An example of a taxonomic intermediate
file is the Integrated Taxonomic Information System (ITIS). ITIS
is a partnership of U.S., Canadian, and Mexican government agencies,
private organizations, and taxonomic specialists cooperating to develop
an online, scientifically credible list of biological names of North
American plants and animals. It is used by many U.S. government agencies
for consistent naming of plants and animals for regulatory and monitoring
purposes. To link textual material in a digital library to the ITIS
record, the organism name can be identified manually or automatically
in the text and submitted as a query to the ITIS database. When a
match is found, ITIS presents the ITIS record, which provides essential
information about the organism. The information includes synonymous
names, including some common names, and an indication of the placement
of the organism in the larger taxonomic classification scheme.
Linking Chemical Names to Molecular Structures
The unique identification for a chemical substance is not its name
but its molecular structure. However, chemical names are commonly
used in research documents, project plans, catalogs, and directories,
all of which may be resources in a digital library. There are competing
systems of nomenclature (i.e., that of the Chemical Abstracts Service
[CAS] and of the International Union of Pure and Applied Chemistry)
as well as common and commercial synonyms.
The ambiguity is resolved by providing links between the chemical
names in the text and the molecular structure. This is done through
a chemical registry number or code that is connected to a particular
chemical name (using certain nomenclature standards) and an authority
record that provides additional information about the chemical. This
information includes the chemical's synonyms and some of its chemical
and physical properties. Most important in today's research environment
is the link from this authority file to a chemical structure file.
Structure files, used with the appropriate software, graphically
depict the molecular structure. This sophisticated software allows
for three-dimensional visualization, rotation, and substitution of
the chemical bonds.
An example of the use of the chemical registry number to link chemical
names with molecular structures can be seen in the work of BIOSIS,
the world's largest not-for-profit producer of biological and biomedical
databases. In 1993, BIOSIS began processing its bibliographic citations
(titles and keywords) to automatically identify chemical names (Hodge,
Nelson, and Vleduts-Stokolov 1989). BIOSIS assigns CAS Registry Numbers
(RNs) to the chemical names identified in this process. In the STN
International online system, hosted in the United States by CAS,
a user of BIOSIS can select one or more of the records resulting
from a search and extract the RN. The extracted RN can be applied
against the CAS Registry File, which contains more than 21 million
substances, including organics, inorganics, biosequences, metals,
and alloys. The registry file record for the chemical name, including
the link to the synonyms for the chemical name and the structure
file itself, can then be accessed. With special tools developed by
CAS, the structure can be viewed and manipulated. It can be imported
into modeling tools that allow the chemist to manipulate the structure
and thereby envision new chemicals. Alternatively, the user can start
with any database that contains CAS RNs and extract the resulting
RNs to perform a search for complementary bibliographic records in
the BIOSIS database.
Linking chemical names to structures using RNs on a large scale
is neither inexpensive nor easy. There are two approaches to identifying
chemical names in text. Some journal articles include the CAS RN
for the major chemicals discussed. In this case, an analysis of the
text for the terms "RN," "CAS RN," and variations
preceding numerics can identify RNs that can be used as a link. Alternatively,
a program to identify chemical names in text, similar to that developed
by BIOSIS, could be devised. Developing the identification program,
as well as searching chemical databases, is costly; however, if the
digital library has license agreements for chemistry databases, this
type of linkage may be possible. In addition, many organizations
have small chemical files of their own that may include RNs and other
information of particular relevance to the organization's research.
It may be possible to link to these local databases using methods
that are more direct.
Linking Personal Names to Biographical Information
A common type of authority file is the personal name authority,
which controls variants of personal names. For example, the Library
of Congress Name Authority File (LCNAF) is used to control variant
personal names for authors, editors, artists, and others. The Union
List of Artist Names (ULAN), developed by the Getty Vocabulary Program,
is another example. Name authorities serve as tools for catalogers
and indexers. They ensure that the proper form of the name, rather
than an unapproved variant, is used and bring together all works
by or about the person.
A name authority file can also be used to link a bibliographic record
or document containing the person's name to a variety of other related
materials. If the digital library's resource has a standardized form
of the name, it can be identified and searched against the authority
file to locate variants. The standardized and variant forms can be
joined in a search against a variety of other resources that can
provide related information.
For example, in the case of a digital library of images of artists'
works or biographical or critical text, a name authority file such
as the ULAN or the LCNAF can act as an intermediate file to provide
additional information. The file, which contains integrated variant
names, can be searched by the name appearing in the digital library
collection. When the record is found, the information about the artist
can be displayed, providing a wide range of contextual material for
the user. Citations to significant biographical or critical works
about the artist, some of which may also be available on the Web,
may also be provided in the name authority file.
The variant names from a name authority can also be used to locate
and provide automatic links from the personal name in the text to
a biography, without requiring that the name be presented in the
same fashion in the two resources. One such resource that could be
linked to for biographical information is Gale's Biography Resource,
which contains more than 142,000 biographies and related citations
from more than 1,000 periodicals.
However, to produce this kind of link, there must be a mechanism
for locating personal names in text. Several programs can do this
type of text analysis; among those that have been developed commercially
are NameFinder from the Carnegie Group and the Intelligent Agent
from IBM. In addition, variant names can be extracted from the name
authority itself, grouped, and run as a search against the text to
locate name occurrences.
Linking Entity Names to Physical Specimens
In some cases, it is possible to go another step and connect entity
names in the digital library resources to physical specimens. The
curation of physical specimens or artifacts is critical to the advancement
of many disciplines. Exhibition catalogs describe the art objects
in a particular exhibition. Museum catalogs provide inventories of
the art, natural history, or cultural objects held by a particular
museum. These catalogs, increasingly available as computerized databases,
are knowledge organization systems that not only provide descriptive
records but also point to the location of the object in a museum,
an archive, or another collection.
For example, in biology, a physical specimen is particularly important
when it is the result of the discovery and description of a new organism
or of the reclassification of a known organism. A type specimen is
the example collected from the field by a taxonomist to serve as
the prime example for the description of the organism and the validation
of its taxonomic classification and naming. These specimens are held
by natural history collections, and their deposit is required by
the rules of various taxonomic societies.
As part of the curatorial activity, the collections assign identification
codes. While the primary use of identification codes has been to
organize the physical collections, numerous projects are under way
in the natural history community to digitize photographs of specimens
and create database records for the specimens, including their identifiers,
and thereby make them more readily accessible. The degree of digitization
varies from specialty to specialty. For example, in botany, virtually
all significant research herbaria are digitally cataloging their
type collections instead of maintaining paper records. Many are also
making digital photographs of the type specimens available over the
Web.
The publication of identification codes in the journal literature
is also changing. Historically, identification codes have been presented
in the "Materials Used" sections of journal articles. The
level of specificity of the identification code has varied, depending
on the biological discipline. For example, botanical journals tend
to list only the institution and the catalog, while vertebrate journals
provide the code to the specimen level. The current trend is to require
lists of specimens that are more detailed. As the lists become longer
and the printing costs increase, journal publishers are beginning
to request links to independent Web sites maintained by the researchers
or their organizations that carry all the specimens used in the study
and provide some level of identification.
If the digital library collection contains resources that include
the identification codes, these codes can be extracted and matched
against the Web-based catalogs or databases. This link can provide
users with location and contact information to allow them to access
the physical object mentioned in the digital library resource.
Curators or registrars of artistic, archaeological, and cultural
history collections also assign inventory or accession numbers to
items in their collections. Identification numbers may also be found
in scholarly catalogues raisonnés. Links similar to those
described for natural history can be made between text related to
works of art and the physical work in a particular collection. An
article about a work of art can be linked to additional information
about the physical specimen by linking the identification number
in the text with an online catalog containing the number and additional
information about the work.
As museums digitize their collections to establish a presence on
the Web or to reduce the handling of the physical objects, KOSs that
can link the digital library resources to the physical object are
being developed. If there is a museum with a collection that complements
that of the digital library, it is worthwhile to discuss ways in
which the digital library and digital museum collections may "co-evolve."
Summary
Digital libraries can use KOSs to link digital resources to other
digital resources or, indirectly, to physical objects. A simple example
is the expansion of codes and acronyms. Descriptive records may also
be provided either directly from the KOS or indirectly by using the
KOS to capture a search key that can be used to access another resource.
This concept may be taken a step further by using a KOS, such as
a museum or exhibition catalog, to provide information about the
location of the physical object.
Next Previous
Return to CLIR Home Page >> |