5. The Future of Knowledge Organization Systems on the Web • CLIR

As online databases moved to the Web, they began to provide their products, including vocabulary aids, in this environment. Portable document format (PDF) versions of printed vocabulary aids are common, since PDF can be easily produced from a Postscript file and it retains the look of the printed product. With Adobe’s tools for indexing and searching, the PDF file can provide some level of support for linking. Many of these aids, however, remain in the form of HTML files only there is no database structure to easily support the linking and searching. In some cases, the full structure of the KOS is not made available on the Web; the only format for a Web-based thesaurus may be an alphabetical list of terms that does not enable the user to navigate easily the hierarchical structure. As unique ways of using these resources are developed, it is hoped that more KOS providers will be encouraged to provide their systems in formats that are conducive to such networked uses.

Some of the requirements for such electronic KOSs were identified at a workshop entitled “Electronic Thesauri: Planning for a Standard” and sponsored by NISO (1999). While the focus of this meeting was digital thesauri, consideration was also given to other KOSs in digital form. The identified requirements include persistent identification at the concept level, the need for a simple protocol for the distributed querying and response from a KOS, and the development of a standard set of metadata attributes for describing a remote KOS.

To facilitate the search and display of information from a previously unknown KOS, the system must have unique and persistent identifiers for each of the concepts in the system. For example, the California Environmental Resources Evaluation System (of the California Natural Resources Agency) and the U.S. Geological Survey have developed a system for remote querying and response (CERES 1999). It requires that each concept in the thesaurus have a unique identifier. In the case of the previously described ITIS, which is accessed remotely by the CERES system, the ITIS record number is used as the identifier. Other unique identifiers could include the DOI, or a classification notation that has been made unique by appending the scheme name or the URL to the notation.

The second requirement is a protocol for the distributed querying and response of KOSs. This is particularly critical for highly structured systems such as thesauri, semantic networks, and ontologies. Work has been done in this area within the Z39.50 community. (Z39.50 is the NISO standard for searching distributed bibliographic databases.) A profile has been proposed by the Zthes Working Group to tailor the Z39.50 protocol to operate on thesauri that follow the Z39.19 standard.

A similar effort is under way at the CERES Project. Instead of a Z39.50-based protocol, CERES has developed a structure that is based on the Resource Description Framework (RDF) and the HTTP protocol of standard browsers. The RDF’s concept of containers is a natural for managing the hierarchical structure of complex systems such as thesauri. The structure proposed by CERES is likely to be encoded using XML, a mark-up format that lends itself to structured information. This protocol for linking distributed vocabularies will support both searching and cataloging. The user will be presented with remote vocabularies that can be displayed and navigated by a local client.

The third major finding from the NISO workshop was the need for a metadata content standard for the description of KOSs. Such a standard is key to provision of knowledge organization services over the Internet. The metadata identify the Web resource as a KOS and provide important information to allow an application to use it remotely without prior knowledge of its content or structure.

A draft set of attributes for describing KOSs available in a networked environment has been developed by a task group of the Network Knowledge Organization Systems (NKOS) Working Group, an ad hoc group of terminology experts from organizations that are interested in issues related to the use and interoperability of KOSs over the Internet. The draft attributes are based on work originally done by Linda Hill (Alexandria Digital Library at the University of California at Santa Barbara) and Michael Raugh (Interconnect Technologies).

The attributes describe the KOS so that content from the system can be transferred over the Internet and handled by a remote browser or client application. The attributes include the depth of hierarchy, the types of relationships included, the subject (described by free text or by a declared classification scheme), storage format, copyright and rights management, and contact information. To facilitate the transfer of information, the attribute set also includes information on character set and file size. To facilitate the acquisition and licensing of the KOSs, the draft content description includes point of contact information.

During discussions about the metadata content standard, workshop attendees identified three methods for storing the metadata for a KOS. First, the metadata could be stored with the KOS, as metadata elements for that resource. Second, the metadata could be stored in a physically separate knowledge organization registry. The third possibility is a hybrid approach, where a minimal set of metadata elements is contained in a central registry (i.e., sufficient information to identify the resource, where it is located, and how more information can be obtained). The more detailed information would be stored with the KOS itself.

There is significant interest in the use of KOSs to organize and search material on the Internet. It is hoped that this interest will result in knowledge organization services that will make these sources more readily accessible to a variety of software applications and to a variety of users. As services and enabled software proliferate, it will be easier to integrate these KOSs into digital libraries.