Someone recently compared the Web with a large room filled with books that were scattered all over the floor. The Web is the world’s largest mass of bits and bytes. It is a meeting place that brings together disparate communities. The “Internet Commons,” as this meeting place has been called, requires connections between and among disparate communities in order for an “economy” to develop (Weibel 1999). This economy will provide the framework within which both commercial and noncommercial transactions can occur. KOSs are one means of connecting these disparate communities. Knowledge organization systems can be used to (1) provide alternate subject access, (2) add modes of understanding to digital library resources, (3) support multilingual access, and (4) supply terms for expansion of free-text searches in domains that are relatively unknown to the user.
Providing Alternate Subject Access
Alternate subject access refers to the provision of one or more additional subject orientations that make the resources of the digital library accessible to different audiences. This approach is particularly valuable when the digital library resources appeal to groups that do not share a common terminology. It can be a system of subject headings, a classification scheme, or any other subject-oriented system. Alternate subject access can be provided by
- indexing or classifying the resources using multiple schemes,
- retaining original schemes from organizations that contribute to the digital library, or
- mapping between the primary scheme and an alternate scheme.
Indexing the Material with Multiple Schemes
The most direct method for providing alternate subject access to a collection is by classifying or indexing the resources with multiple schemes, but it may also be the most costly. This approach requires redundant cataloging or catalogers who are knowledgeable in both schemes. It may also require modifications to the cataloging tools and procedures. However, if the cataloging is at a high level (resources versus individual documents), or if the schemes are not difficult or detailed, it may be a reasonable approach.
Retaining Alternate Indexing from Contributors
If the digital library is being built through contributions from a variety of sources, the originating organization may have applied an alternate scheme that could be used. For example, the NASA database on aeronautics and astronautics receives relevant bibliographic records from other U.S. agencies, such as the Department of Defense and the Department of Energy. The controlled vocabulary terms assigned by the contributing organization are processed through a machine-aided indexing process to create candidate indexing terms from the NASA Thesaurus for review by NASA’s indexers. However, the final records contain both the NASA Thesaurus terms and the controlled vocabulary terms from the contributing organization, with the alternate indexing terms retained in a separate data element in the bibliographic record. The terms collected from other organizations can be viewed as an alternate access point, so that at least part of the collection is accessible through another discipline’s terminology.
Mapping Multiple Schemes
The third method for providing alternate subject access is the most indirect, that of mapping one or more schemes. Several examples of this approach can be found among A&I services. Both BIOSIS, the world’s largest private sector A&I service in the life sciences, and the NLM apply MeSH to BIOSIS documents. The records that BIOSIS contributes to NLM’s TOXLINE database are processed automatically to have appropriate MeSH terms added. This is based on a mapping of the natural language terms that occur in the toxicology literature and BIOSIS’ normalized natural language keyword indexing with the MeSH terminology. In the new BIOSIS relational indexing structure, BIOSIS builds and maintains authority files that connect natural language disease names to the MeSH-controlled disease terms. When the BIOSIS indexer assigns the free text keyword for the disease name, the appropriate MeSH term is also added to the record as an alternate access point (BIOSIS 1999). The assignment is based on the development over time of a mapping between the terminology used by BIOSIS and the MeSH-controlled terms.
In addition to providing alternate access points to BIOSIS products, the inclusion of the MeSH terms makes it possible to perform cross database searching on the indexing field with MEDLINE and other databases that include MeSH terms. From 1999 forward, users can search BIOSIS databases using MeSH disease terms. The disease terms can be extracted from the MeSH authority file or from a MEDLINE record and then used in a search against the BIOSIS files, or vice versa. This helps users find relevant records that are unique to either BIOSIS or MEDLINE. The inclusion of terms from an alternate KOS, such as MeSH, therefore supports the use of BIOSIS by medical librarians and practitioners who are familiar with MeSH terminology.
A more extensive example of mapping variant schemes is the metathesaurus developed by the NLM’s Unified Medical Language System (UMLS). This system has linked more than 40 separate KOSs from various medical specialties. They range from MeSH to coding and classification schemes used by insurance companies and physicians to describe treatments and diseases on patient records. The UMLS is licensed by many other organizations for inclusion in applications that can bridge various health care communities.
How can digital libraries use alternate indexing? While many digital libraries do not have the A&I resources of large database producers such as NLM and BIOSIS, the concept of applying alternate indexing can be scaled to fit. While the systems described deal with item-level bibliographic records, alternate indexing can be applied at several levels. Alternate subject access can be applied only at the resource level, for the database, electronic book, electronic journal, or image collection, so that other communities can identify resources of interest that must then by searched or browsed individually. This concept is conducive to use with portals that provide access to the same resources with different views for different audiences. Alternatively, if the digital library has bibliographic records or metadata records at a very detailed level, it may be possible to develop switching programs that will translate concepts from the original organization of the digital library or resource to that of the alternate scheme.
Adding New Modes of Understanding to the Digital Library
People perceive the world through many modes, including textual and graphical. Some people comprehend information more easily in one mode than another. Most people benefit from a variety of modes that reinforce one another or that can be used when appropriate to the context. Many digital library projects remain text-based; however, this text-only dimension is changing as digital libraries become oriented more to multimedia and as other modes of information presentation become viable on the Internet.
KOSs can be used to bring new dimensions to an information resource or a collection in a digital library. In the digital library environment, these dimensions can be viewed as layers that can be added on top of one or more objects. Various tools and services can be developed that are geared to a particular mode. For example, the results of a text search can be presented in graphical or visual form, based on the number of occurrences of a term or concept or on the occurrences of documents from a particular country, journal title, or author.
A more complex dimension that can be added is the geospatial dimension, which emphasizes access by place. A “geolibrary” is defined as a digital library consisting of “geoinformation,” or material that can be accessed by place (National Research Council 1999). This so-called georeferencing can be either direct (by a geospatial footprint, a series of latitudes and longitudes for the location) or indirect (by a textual place name). Georeferencing of textual objects is facilitated by a gazetteer, which brings together the place name and the spatial footprint for its location.2 Many gazetteers also include feature types for each footprint. The vocabulary used for the feature types varies among gazetteers, but may include terms such as “airport,” “harbor,” and “railroad station.”
Although many organizations, including federal and state agencies, are currently required to provide geospatial referencing as part of the National Spatial Data Infrastructure Program, the geospatial referencing is not readily available for older works. How can the data sets of today be integrated with the textual information of yesterday? The answer is by adding geospatial referencing to the text resource. Geospatial referencing requires that the text name for a place have an associated spatial footprint. This can be achieved by using a georeferenced, digital gazetteer that provides geospatial footprints for place names.
Through this type of knowledge organization system, place names in a library catalog or bibliographic database can have footprints assigned (Blair 1999; Tahirkheli 1999). If one or more of the library’s resources have latitude or longitude coordinates in the catalog record or in the full text but no place name, the coordinates can be extracted and submitted to the gazetteer service. The service will return the place name for the footprint. Alternatively, the resource may have a textual place name. This place name can be extracted and searched against the gazetteer, and the footprint can be provided to a mapping application. The latter search may result in more than one footprint, since place names may be ambiguous. Therefore, it is important that the user interface be designed to allow the user to distinguish the locations. Once the footprint has been determined, a user can access the text resource through a geographic mapping tool. Alternatively, a user of the text resource can find a set of results and have the place names displayed as footprints on a map.
In disciplines such as ecology, environmental science, and even public health and epidemiology, it would be beneficial to build a digital library with access to such a digital gazetteer service. Users could then access the system through the text mode or the geographic mode, depending on their comfort level and the type of information needed. Presenting the results on a map allows users to make new associations and analyze the results more easily. Through a geospatial KOS, they can see connections between disparate data, because the data are presented in an alternate mode.
Providing Multilingual Access
A third way that KOSs can support the use of digital libraries by disparate communities is to provide multilingual access. A variety of sources, including multilingual dictionaries and multilingual thesauri, can support this type of access.
One of the most extensive multilingual thesaurus efforts is the Generalized Multilingual Environmental Thesaurus (GEMET) from the European Environment Agency (EEA), produced by Italy’s research council, the Consiglio Nazionale delle Ricerche (CNR). The GEMET is available in 12 languages, and plans for a global environmental thesaurus in many more languages were recently announced. GEMET is available by agreement with the EEA.
The European Topic Centre on Catalogue of Data Sources in Germany is developing a system that will link data sources and metadata information in a virtual library. GEMET will be used to convert a search in one language into searches for the same concepts in other languages. Users will retrieve documents not only in their native language but also in other languages. This will allow data systems from throughout the EEA and beyond to be accessed as a virtual library collection with both controlled vocabulary and free-text term searching in multiple languages.
Expanding Free-Text Search Terms
Free-text searching is the main method of searching on the Web. Only a small percentage of Web resources have metadata, and an even smaller percentage have controlled vocabulary assigned. However, variations in natural language make free-text searching problematic. Even a knowledgeable user may not know all the terminology (synonyms or related terms) that can be used in the literature to express a concept. The problem is exacerbated when the user is unfamiliar with the topic or is interested in an interdisciplinary area. How can the user expand his or her search to overcome these terminology differences? One possibility is to use KOSs as aids to the selection of free-text keywords.
The Getty Vocabulary Project emphasizes support for searching as a significant application of its vocabularies. Harpring (1999) reports that the vocabularies are increasingly being used in search engines to look for different terms that refer to the same concept. The Getty vocabularies (the Art and Architecture Thesaurus, the Union List of Artists Names, and the Thesaurus of Geographic Names) are particularly rich in equivalence relationships. “When these equivalence relationships are exploited in search engines, there are typically two possible scenarios: the user may be allowed to first query the vocabulary database, locating appropriate terms, and then applying those chosen terms in a query across target databases; or there may be little or no user interaction with the vocabulary, when the vocabularies are used behind the scenes [to expand the search] . . . ” (Harpring 1999). Getty developed a prototype called a.k.a. to experiment with the use of equivalence terms to broaden or narrow searches across databases on the Web.
In addition to expanding routine search queries, KOSs can be used in Web mining tools. Northern Light has developed a Web mining tool that reportedly returns a high degree of relevant hits. The KOS that supports the Northern Light site was built by ingesting large existing vocabularies and thesauri. The result was then organized under an extensive classification scheme developed by Northern Light. The terms can be used to extend a user’s search or to distinguish between multiple meanings of the terms supplied by the user. The results of a search are organized into “folders” based on the classification scheme. These high-level categories, represented by the folders, help distinguish multiple meanings of the same term. For example, an ambiguous word such as “pitcher” might result in two folders being presented to the user. One folder would be titled “Sports” (as in baseball pitcher), the second “Decorative Arts” (as in water pitcher). The user who chooses only the Sports folder will be presented with only those Web resources that use “pitcher” in the baseball sense. The user who selects the folder called “Decorative Arts” will be presented only with those resources that are related to water pitchers.
KOSs can be very powerful in supporting free-text searching within digital libraries and in integrating Web resources into existing digital libraries. However, these systems must be used with caution. KOSs have generally been developed for a specific discipline, task, or function, or for the indexing of a specific collection or database. Therefore, depending on the domain in which the KOS is being used and the complexity of the system, it may or may not suggest relevant free-text terms. Expanding a search with related terms, rather than pure synonyms, may return hits that are only peripherally relevant to the user.
One of the benefits of the Internet, the Web, and digital libraries is the degree to which resources can be made available to broader audiences. The technology facilitates the connection of disparate knowledge communities at the network level. However, discovery of the resources and true accessibility require that the content and its organization be understood by these disparate communities. By providing alternate subject access, adding modes of understanding, supporting multilingual access, and supplying terms for expanding free-text searching, KOSs can facilitate discovery and understanding by disparate communities, and allow these communities to interact in new ways.
2. A recent National Science Foundation-sponsored workshop, “Digital Gazetteer Information Exchange,” addressed the issues of digital gazetteers. One of the critical issues is that there is no standard for the interchange of information, either to provide gazetteer information physically to another gazetteer or to interoperate with one or more distributed gazetteers through the Internet. The workshop participants emphasized the need for such protocols and for enhancements to current gazetteers. (Many gazetteers do not include coordinates or are incomplete in this regard.) The goal is to develop a digital gazetteer service that can be accessed by any application.
Such a service is central to the vision of a geolibrary. A report on distributed geolibraries from the National Research Council (1999) envisions the geolibrary as a physical globe. One would walk into such a geolibrary and be confronted not by a card catalog or an OPAC terminal but by a large physical globe. The user would indicate his or her area of interest by pointing to a place on the globe. The librarian would use the geospatial location information to retrieve and present materials related to that place. By comparing feature types, the user could ask for other place names and locations that were similar to the original.
Significant work into digital gazetteer services and geospatial libraries has been conducted by the Alexandria Digital Library (ADL) Project at the University of California at Santa Barbara, with support from the National Science Foundation’s Digital Library Initiative-1 (Hill and Zheng 1999). An ADL Gazetteer was created by merging place name authority files from the National Image Mapping Agency and the U.S. Board on Geographic Names of the U.S. Geological Survey. The project also added controlled feature types to the gazetteer. With the aid of a visualization tool, the information can be provided on a map and accessed using other geographic visualization tools.