This section provides general guidelines that may be useful for an organization that wants to use knowledge organization systems to organize a digital library. The framework described is applicable for KOSs of any type or subject.
Planning Knowledge Organization Systems
Analyzing User Needs
Of primary importance to any digital library project is an analysis of its users’ needs, in terms of content and functionality. Many volumes have already been written about needs assessment, and providing detailed guidance on this subject is beyond the scope of this paper. However, when analyzing how a KOS might be used with a particular digital library, it is essential to thoroughly understand the environment of the user. One must look not only at the needs for organizing the digital library materials but also at possible links between content within and outside the digital library walls. This is particularly important for KOSs that are acting as intermediate authority files, because in such cases the links may not be readily apparent. It is important to consider other views that might be valuable for users and peripheral communities that might benefit from the digital library’s content were it accessible to them through a KOS.
Locating Knowledge Organization Systems
Once the user’s needs have been analyzed, it is necessary to locate KOSs to meet the need. While an alternate system can be built locally, it is preferable to find an existing KOS for several reasons. First, it is costly and time-consuming to build a KOS. Second, KOSs often benefit by having been built over time. Many of the systems described in this report have been built over decades; some existed in paper before digitization. The value of a KOS comes from its acceptance by the user community; sources built by noted authorities such as learned societies, trade associations, or standards groups will be viewed as more trustworthy than those built internally. Finally, the networked environment has resulted in both an explosion of primary materials, including documents, electronic journals, and Web-based databases, and in an equivalent explosion of KOSs on the Web.
There are several ways to identify KOSs that may be of interest. Many users are already aware of KOSs on the Web within their discipline. Developers may also turn to directories, librarians in the field, and reference sources, or they may perform a general search of the Internet.
Planning the Infrastructure
It is necessary to make decisions about the architecture of the KOS in the context of the digital library setting. The physical location of the KOS is important. Will the system be held externally or internally? There are pros and cons to either approach.
If the system is available on the Web, it is possible to consider linking to the KOS as an external system. This architecture requires a script or some search query to locate the resource. One must then launch a query against the resource to obtain the piece of information that will serve as the key between the two files. This key could be a universal resource locator (URL) or input to another search query. A query may be necessary if the KOS is stored in a database. The script may transfer log-on information (including user ID and password) from the digital library system to the external KOS, in order to provide access to the Web-enabled database. In the case of a more direct link, the access may be by URL.
However, the use of a URL as the link has the same problem with persistence as does direct access via a URL from a browser. The organization may move the KOS, thereby changing the URL that is being used as the key. It is important to determine how often the URLs in the KOS change, whether there is a means of notification of these changes, and whether it is possible to consider an alternative that would be more persistent. Schemes such as the Digital Object Identifier and the Persistent URL have been devised to enable resources to be physically moved among servers without having their names changed. Another alternative is the use of other Uniform Resource Identification (URI) schemes and the Uniform Resource Name (URN), which can be sent from the newer Web browsers. The benefit of linking to a remote resource is that the resource will always be up-to-date. The maintenance of the KOS is in the hands of the owner, not the digital librarian. It may also be more apparent to users that the KOS is not owned by the digital library.
Linking to a remote KOS also has disadvantages. Persistence and unexpected changes in the organization and content of the system may cause problems. The software or telecommunications route between the digital library server and the KOS may be unreliable. In systems requiring fast response time or large amounts of data transfer, and, therefore, high bandwidth (such as full-motion video or detailed graphics), the fact that a connection must be made between the digital library and the external KOS may make the system unacceptable to the user.
Alternatively, the KOS may be obtained from the owner and loaded locally. In many cases, this requires licensing that may not be required when the KOS is accessed remotely, because a copy of the whole resource is being provided to the digital library. Loading a KOS locally also requires that one consider issues such as maintenance, local system administration, and disk storage. If the KOS uses special software, such as a database management system, loading the KOS locally will require a copy of that software, which may require additional purchase or licensing. Other considerations are the need for firewalls and interface design. On the positive side, the KOS is under more local control. Therefore, it may be possible to improve the response time by not accessing the KOS over the Internet. If the KOS is to be used behind the scenes (that is, the system is not visible to the user), concerns of speed and integration become more important. If additional modifications (including digitization) need to be made to the KOS to integrate it with the digital library, it will also be necessary to load the KOS locally.
If the digital library intends to incorporate numerous secondary KOSs, it is important to consider the degree to which the architecture is scaleable. The National Library of Medicine’s UMLS incorporates more than 40 different sources. While its main purpose has been to develop a metathesaurus for moving among these vocabularies, the management of the systems, regardless of the mapping issues, has been a major consideration. Ingest has been a major concern, with the need to develop a system that can handle a variety of input formatsfrom ASCII text files to highly structured database output. The architecture must also accommodate the character sets of the incoming sources. This is particularly important if a mark-up language has been used to represent special characters and diacritical marks. Systems that have been developed in Unicode, which extends ASCII to accommodate diacritical marks and non-Roman character sets, cannot be handled by systems that deal only with ASCII or extended ASCII sets.
Since many digital library systems are being built as extensions or applications of existing integrated library systems (ILS), it is important to consider how the KOSs will integrate with the library system. Unfortunately, many ILS vendors have not considered links to external files or databases in their system designs. In some cases, the vendor may require that the information be stored in the proprietary format of the ILS. The system may require that the files be on the same directory or server as the accessing ILS. The fields that can be linked to the Web or searched may be limited. Outside communications may require Z39.50 client-server connections. With relatively closed systems, ILSs may be a difficult environment in which to implement alternative and nontraditional KOSs.
Digital libraries that are interested in using KOSs should consider this integration when developing requirements for the procurement of a system to support them. Vendors should be encouraged to support relatively open architectures and to consider the extension of traditional library systems to support broader digital library functionality.
In addition to these immediate concerns, it is important to consider the incorporation of future KOSs. Initial success may spur the desire for integration of additional KOSs or enhanced functionality for the existing KOS. Success may breed additional requirements and increase the strain on hardware, software, and network architectures.
Maintaining the Knowledge Organization System
For a digital library, an outdated KOS can be more of a hindrance than a benefit. Maintenance, both of content and of the system, should be considered when planning a KOS. This is particularly important if the digital library is to be self-supporting or revenue generating.
Version control of the KOS is extremely important. Reloading a new version from the system provider is one way to accommodate changes; however, this may not be acceptable if the locally held version differs substantially from that held by the system’s provider. If there has been significant transformation or processing of the original KOS, it may be difficult, or impossible, to reload the original and recreate the changes that have been made.
A transaction-based approach, whereby only changes are transferred between the KOS provider and the library, is also possible; however, this requires that the system provider have the infrastructure, both machine and human, to produce these transactions. It also requires that the changes to the original KOS be identifiable in order to create change transactions. For example, Stuart Nelson of the NLM’s UMLS Project recently reported that many systems can create annual transaction records to inform the UMLS about the changes that have occurred to the original system. However, the changes are often not indicated with enough detail to support automatic change transactions in the UMLS. If a change date, for example, is recorded only at the level of the concept record, it is impossible to tell whether the term has changed (a correction of a typographic error for example) or if the relationship between this concept and another concept has changed. Since the UMLS splits the incoming terminology and its relationships into a variety of files, it is often difficult to tell how the UMLS files must be change based on the changes made during the maintenance of the original KOS (NISO 1999).
Presenting the Knowledge Organization System to the User
In addition to deciding which KOS should be used and what functions it should serve, the digital library will need to determine how to present the KOS to its users. A KOS may be exposed to the user or made relatively transparent.
The KOS can be exposed to the user in different ways. Material can be grouped into KOS-related themes or categories on the digital library’s Web site. The KOS may be used at a higher level to identify specific portals for different uses or users. If the content of the digital library includes metadata records, the KOS may be displayed as index terms on the records or in its entirety as a navigation aid to searching.
In other cases, the KOS may be transparent. For example, a thesaurus can be used behind the scenes to extend the user’s search to include synonyms, to connect the digital library’s resources to other information and resources, or to filter or rank the information obtained.
Implementing Knowledge Organization Systems
Acquisition and Intellectual Property Issues
It is critical to properly handle the acquisition of knowledge organization systems. The first question is whether the KOS is under copyright. If so, the copyright holder should be contacted concerning the KOS. It is important to ensure that the apparent contact is the official one. Many references have been reprinted or put on the Web without proper acknowledgment of the real owner.
Once the contact has been made, there are several points for discussion:
- If the provider maintains the KOS, how will the digital library find out about any changes that may be made in it? Is there a notification mechanism in place? How frequently must theinformation be updated to be of benefit to the digital library’s users? Will the maintenance be self-evident, or must the agreement include notification requirements? What will the owner do if the maintenance can no longer be performed?
- What will happen if the provider discontinues the product or sells or transfers it to someone else?
- What uses can the digital library make of the KOS under the proposed agreement? As with other licensing, it is advisable to aim for the broadest permissions and the longest term possible. At a minimum, the library should be able to renegotiate the terms of the agreement relatively easily.
- In a networked environment, it is beneficial to develop mechanisms for linking to online versions rather than to maintain a local copy of the resource. This ensures that what is presented is up-to-date, and acknowledges more clearly the ownership of the KOS. However, there are numerous factors to consider. Will the KOS be used on an intranet or behind a firewall, where access to the outside or information coming into the organization might be prohibited? Does the KOS service use “cookies” or require knowledge of the user’s Internet provider address? Does it require a user ID and password?
- If the KOS is to be accessed remotely, are there service issues? Is it likely to be accessed with bandwidth, model, and computer speeds that are adequate for outside connections of this type? Is the use of such a critical nature that unreliable service on the part of the KOS or the Internet connection will cause the digital library itself to be viewed as less useful? Does the KOS require a specialized search engine or search query formulation? Can the digital library system properly display the results, or would the results be better displayed through the KOS system? Will the resulting information be used in its native form or must it be extracted or transformed? If the KOS is to be loaded locally, in what formats can the content be received?
- If the KOS is not available electronically, can it be digitized? Is the owner interested in a cooperative venture, and are the human eand financial resources for such an effort available?
Making the Link
There are two parts to establishing the link between the digital library and the KOS. The first is locating the key anchor information in the digital library’s resource. The second involves the look up against the target file. The creation of this link may be more or less automatic, depending on the particular situation. The characterization of this activity is meant to be general and to allow both “on-the-fly” links and embedded links.
Regardless of what function the KOS is going to serve in the digital library, the essential information contained in the digital library resource from which the link is to be made must be identified. The mechanism for doing this depends on the type of object from which the link is being made and on the information that is expected to be identified in the digital library’s resource.
The first step is to review any metadata related to the digital library resource. Do the metadata carry the term (such as SIC code, artist’s name, place name, geographic coordinates) that is needed to make the link? If this information is included, the level at which the metadata are assigned should be reviewed. If the metadata indicate the subject matter of the specific resource in which the user will be interested, the metadata can be used to make the links. However, in some cases, the terms that appear in the title or description at the resource level (e.g., the book) may not be indicative of the subject at the individual item level (e.g., the chapter). Automatically making a link on the basis of the content description for an entire book may misrepresent the content of a chapter. Whether or not the metadata can be used will depend on the amount and type of information given in the metadata and the level at which the metadata are assigned.
If a text resource in the digital library provides no appropriate metadata, the procedure for identifying the key information may involve text analysis. A program to perform simple string searching or a search engine that can preserve hit locations can be used if the text string has distinguishing characteristics, such as a database acronym, or a specific structure, such as a latitude and longitude coordinate. If the text string has no such cues, text mining or more complex text-analysis tools may be necessary. These tools use a variety of semantic and syntactic algorithms to locate key information. There have been significant advances in commercially available text-mining tools, such as IBM’s Intelligent Agent, which includes specific algorithms for identification of names of places and persons.
The second step of the linking activity is to make the connection to the KOS. The methods for doing this vary, depending on whether the system is being loaded locally or is referenced remotely. If the system is loaded locally, it is possible to perform a significant amount of processing to match the two files, assuming that computer resources of this type are available to the digital library organization. If the system is only available remotely over the Web, the interaction will require knowledge of scripting and various Web-based access techniques. Scripting should be considered in both local and remote approaches, since the more integrated the linking is with the resource, the more maintenance may be required if there are changes in either the resource or the KOS. Regardless of the approach that is taken, making the link requires an analysis of both the information in the original digital library material and the corresponding information in the KOS.
If the KOS is being used as an intermediate file to bridge between the digital library’s resource and another resource, it is also important to understand the data and the process whereby the search is performed and information returned from the target resource. If the KOS must return a value to the original digital library resource, the data and process must be evaluated in a bidirectional sense.
Choosing the linking mechanism is equally important. The link may be fixed or “on-the-fly.” In the case of a fixed link, a specific URL is embedded at the link point in the digital library material. However, as stated before, problems of persistence are inherent in this approach. Alternatively, a URN can be used. The URN requires the creation of a namespace on the point of the target file, and the search is to this namespace rather than to a specific URL. Persistent locators (PURLs) and digital object identifiers (DOIs) can also solve this problem. These schemes are sufficient if the material is an HTML document.
Content in databases is more difficult to retrieve. The National Library of Medicine now supports the searching of a variety of its databases through its Internet Grateful Med (IGM) URL function. IGM users can create URLs that will actually perform searches against the databases. For example, the following script would perform a search for “pneumonia” in the HealthSTAR file: http://igm-02.nlm.nih.gov/cgi-bin/IGM_robot.pl?datafile=HealthSTAR&search=Subject=pneumonia.
Information on the syntax for creating such a URL is provided on the NLM Web site. While the intent is that the search URL will be bookmarked by an individual user, the same concept can be used for creating an active link at the anchor point for the link. With additional scripting, the creation of the term pneumonia can be automatically replaced with an active link that picks up the term where the link has been made.
Summary
The framework for developing an infrastructure to support the use of KOSs in digital libraries requires an analysis of user needs, the identification and location of the appropriate KOSs, and the development of the hardware, software, and network architecture to support its integration and maintenance. The digital librarian must make decisions concerning the degree to which they will be presented to the user, acquisition and intellectual property issues, and maintenance and update procedures. There are several technical ways to make the link between the digital library and the KOS. As knowledge organization systems are increasingly available on the Web, requirements are beginning to be defined to improve the interoperability and general use of these resources through the development of knowledge organization services on the Web.