Image Retrieval Benchmark Database Service:

A Needs Assessment and Preliminary Development Plan

 

 

A Report Prepared for the Council on Library and Information Resources and the Coalition for Networked Information

 

 

Jennifer Trant
Archives & Museum Informatics
www.archimuse.com

 

 

Original draft: October 1, 2003; Updated January 2004


Table of Contents

 

Acknowledgments

1.       Executive Summary

2.       Problem Statement

2.1         Images in Digital Libraries

2.2         The Issue for Digital Libraries

3.       Image Retrieval to Date

3.1         Overview

3.2         Evaluating Image Retrieval

4.       Toward an Image Retrieval Benchmarking Database and Related Services

4.1         Why Benchmarking?

4.2         Who Does Benchmarking?

4.3         How Is Benchmarking Done?

4.4         When Is Benchmarking Valuable?

4.5         What Could Be Benchmarked, and How?

4.6         How Are Benchmarks Established?

4.7         Questions a Benchmarking Database Cannot Answer

4.8         Other Issues

4.9         An Environment for Research

5.       Planning an Image Retrieval Benchmark Service

5.1         Goals for a Research Benchmarking Service

5.2         Audiences/Users of the Benchmarking Service

5.3         Components of an Image Retrieval Benchmark System

5.3.1        Collections of Test Images

5.3.2        Benchmark Queries

5.3.3        Relevance Assessments

5.3.4        Quantitative Evaluation Metrics

5.3.5        Community of Researchers

5.4         Success Factors in the Creation of an Image Retrieval Benchmarking Service

5.4.1        Sponsorship

5.4.2        Community Buy-In

5.4.3        Governance

5.4.4        Creating Incentives to Use

5.4.5        Technical Success Factors

5.5         Ancillary Costs to the Research Community

6.       Scenarios for Developing the Image Retrieval Benchmark Database

6.1         TREC Model

6.1.1        New TREC Video Tracks

6.1.2        Emerging TREC Communities

6.1.3        An Image Retrieval TREC Track?

6.2         Genesis from within the Computer Science Research Community: Benchathlon Expansion

6.3         Music Retrieval

6.4         An Industry Consortium

7.       Stages in Developing an Image Retrieval Benchmark Database

7.1         Phased Approach

7.2         Phase 1: Establish a Case, Identify Sponsors, and Recruit Research Participants

7.2.1        Form Steering Committee

7.2.2        Issue Request for Comment

7.2.3        Hold Workshops

7.2.4        Draft Implementation Plan

7.3         Phase 2: Establish Organization

7.3.1        Establish Governance

7.3.2        Issue Request for Proposals for Host

7.3.3        Issue Formal Call for Participation

7.3.4        Issue Call for Data Sets

7.3.5        Prototype Integration of a Initial Data Sets

7.3.6        Establish Test Queries

7.3.7        Establish Test Ground-Truth Assessments

7.3.8        Release Test Data Sets without Ground Truth

7.4         Analyse and Report Prototype Results

7.5         Phase 3: Launch Service

7.5.1        Construct Production Systems

7.5.2        Obtain Data (Image and Metadata Sets)

7.5.3        Establish Queries

7.5.4        Establish Relevance Judgments

7.5.5        Launch Test

7.5.6        Convene an Image Retrieval Conference

7.6         Phase 4: Operationalize Service

8.       Conclusion


Acknowledgments

This study was initiated as a joint project between the Coalition for Networked Information (CNI) and the Council on Library and Information Resources (CLIR) and was funded by the Atlantic Philanthropies. It has benefited from the thoughtful groundwork of Clifford Lynch (CNI) and Anne Kenney (then CLIR, now Cornell University Library).

 

The problems of assessing image retrieval were explored by the participants in several planning meetings held prior to the commissioning of this report. One such session, entitled "Planning Meeting for Test Database for Digital Visual Resources" and convened by Clifford Lynch and Anne Kenney in May 2001, was particularly helpful in shaping my initial thinking. Participating in this session were Joseph Bush, director, Solutions Architecture, Interwoven; Don D'Amato, MITRE Corporation; Corinne Jörgensen, School of Informatics, Florida State University; Donna Harmon, Text REtrieval Conference (TREC)/ National Institute of Standards and Technology (NIST); Peter Hirtle, Cornell University Library, Cornell University; Matthew Kirschenbaum, Department of English, University of Kentucky; Max Marmor, Arts Library, Yale University (now with ARTstor); Worthy Martin, Department of Computer Science, University of Virginia; Beth Sandor, University Libraries, University of Illinois; Don Waters, The Andrew W. Mellon Foundation; and John Weiss, Digital Library Production Service, University of Michigan.

 

I would also like to thank all those with whom I've discussed this question. I am particularly grateful to Margaret Graham, Corinne Jorgensen, Anne Kenney, and James Wang, who commented on drafts of the manuscript.

 

My personal thanks to David Bearman for his insight into our discussions of this and many other problems in the digitization of cultural heritage information.


 

1.                   EXECUTIVE SUMMARY

The rapid increase in the quantity of visual materials in digital libraries—supported by significant advances in digital imaging technologies—has not been supported by a corresponding advance in image retrieval technologies and techniques. Digital librarians sense that much could be done to improve access to visual collections and hope, perhaps vainly, that users' needs to identify relevant digital visual resources might be met more satisfactorily through search strategies based on visual characteristics rather than on textual metadata associated with the image, which are expensive to produce. However, digital librarians currently have no tools for evaluating either content-based or metadata-based image retrieval systems. Consequently, they have difficulty assessing existing systems of image access, evaluating proposed changes in these systems, or comparing metadata-based and content-based image retrieval.

 

Some have proposed benchmarking as a solution to this problem. An image retrieval benchmark database could provide a controlled context within which various approaches could be tested. Equally important, it might provide a focus for image retrieval research and help bridge the significant divide between researchers exploring these two search paradigms: metadata-based vs. content-based image retrieval. If so, such a database could spur advances in research, as comparative results make it possible to evaluate the effectiveness of particular strategies and thereby add value to studies supported by many funding agencies.

 

Creating an image retrieval benchmarking service would be a significant undertaking. A benchmarking database is more than a collection of images. Benchmarking requires a set of queries to be put to that test collection. Each image in the test collection must be assessed to determine whether it is relevant to that query. Assessing the performance of systems requires a set of evaluation metrics that make it possible to compare one system with another and to rank results. Developing a test collection requires an investment in data collection, documentation, enhancement, and distribution. Most significantly, maintaining an image reference benchmarking service requires that a community of researchers make a long-term commitment to its use. Without a community vested in the development of the database—and publishing research based on it—the collection remains a chimerical solution to advancing the state of research and improving the retrieval of visual materials in the digital library.

 

2.                   PROBLEM STATEMENT

2.1 Images in Digital Libraries

Digital libraries—managed collections of digital information assembled and curated by institutions (such as libraries, archives, and museums) or individuals and made available for use—are complex, hybrid environments in which many kinds of materials are brought together for the first time. Their promise of integrated access to information is not being fully realized because of the different ways in which apparently similar materials are currently described. As digital librarians—those professionals responsible for the creation, management, maintenance, and provision of access to digital libraries—struggle to improve services to their traditional users, they also feel pressure to increase the use of their collections in new communities whose needs are less well known and understood. The desire to make collections accessible to nontraditional users makes clear the need for measures of success in information retrieval and evaluation of the user experience.

 

Images appear to offer great potential for interdisciplinary use. Researchers concur that retrieval of images is going to be increasingly important for a range of commercial, governmental, and academic purposes. They also concur that large aggregations of images, measured in hundreds of millions of images and picabits of data, will soon be a standard searching target. Space science, medicine, trademarks, and patents are far along in the implementation of large-scale image databases. Scholarly databases of art and cultural images are still in their relative infancy but are growing fast (a comprehensive cultural heritage database would comprise many hundreds of millions of images). New applications are emerging in home-based image services addressing the hundreds of millions of personal images created with digital cameras.

 

Collections of visual materials bring the information retrieval problem into sharp focus. Resources originally created for use by a single department, such as art history, become a resource for users from many departments when digital image collections are made available on the campus network. Ironically, the potential for broad use of images is often frustrated by the same highly developed disciplinary descriptive and indexing systems that make them especially useable within a specialist community. One could imagine the same images being of interest in the creative arts, art history, the humanities (including history, languages, and philosophy), the social sciences, and anthropology. But for the historian, the artist-focused organizational systems of art history do little to surface the subject matter of images. For the cultural theorist, taxonomies in a biological database documenting an exploration do little to identify images depicting early cross-cultural contact. Scientific imagery is also growing greatly in volume, as satellite, meteorological, biological, and geographical images are gathered that document current and past conditions in detail; however, their electronically captured metadata do not identify the things depicted, making it difficult for the historical geographer to find a road's site using its name. These image collections are themselves of interest to computer scientists and those in information retrieval as well as to earth and space scientists.

 

Visual information has developed a significant role in our culture, and, it appears, in our research methods (Rhyne 1995, 1996). As more and more information resources are made available, their use and reuse become more difficult to predict and to assess. Recent surveys of image retrieval make the point that the users of such systems are drawn from many disciplines. Those cited by Venters and Cooper (2000) include "art galleries and museum management; architectural and engineering design; interior design; remote sensing and earth resource management; geographic information systems; scientific database management; weather forecasting; retailing; fabric and fashion design; trademark and copyright database management; law enforcement and criminal investigation; picture archiving and communications systems." Known users of the Art Museum Image Consortium (AMICO) Library range from art-history researchers, to information and computer scientists, to language teachers and other individuals at the graduate, undergraduate, and K–12 levels, to lifelong learners. E-commerce audiences for image access to product catalogs are growing rapidly; they are just one of many areas of heavy image use outside the research and educational communities. [1]

 

In framing a study of image use at Pennsylvania State University, Henry Pisciotta asks, "Are all picture collections now interdisciplinary?" (2001). If we accept this rhetorical statement, how does it complicate our challenge to provide access to collections for all users?

 

2.2 The Issue for Digital Libraries

As the content of digital libraries increasingly varies in form and grows in size, librarians more and more frequently wish that they could reliably move beyond text-based retrieval. Those developing digital library services (expressed in the meetings leading up to this report) fervently hope that user needs for identifying relevant digital visual resources might be met through search strategies based on visual characteristics rather than solely on those represented in textual metadata associated with the image. This hope is in part born from a frustration with current access methods and in part a reflection of the presumed cost of creating image metadata for retrieval, even if metadata-based image retrieval did work. Even among specialist librarians, there is a sense that much could be done to improve access to visual collections, both in the use of existing description and indexing schemes and in the application of new technologies (Eakins and Graham 2000). Perhaps the solution is to match the retrieval methods with the materials and to use more visually oriented retrieval tools to provide access to visual collections.

 

This sense of dissatisfaction with current retrieval methods co-exists with a —perhaps unfounded—sense that content-based image retrieval (CBIR or CBR) systems could enhance the effectiveness of resource delivery in the digital library. Unfounded, because there appears to be a lack of understanding of what CBIR systems do, how they function, and how well they work within the digital library community. Digital librarians have no readily available assessments of the various strategies for content-based image retrieval; the few comparative studies that exist (such as the recent Joint Information Systems Committee/Joint Technologies Application (JISC/JTAP) report (Venters and Cooper 2000) are neither expressed in librarians' language nor focused on the humanistic researcher or library administrator. This lack of knowledge exists along with a technology transfer gap. The call for applied research (expressed by Beth Sandore and others during meetings leading up to this report) echoes the sense that ongoing work is not being related to the needs and requirements of the digital library community and is not being integrated into their service-delivery environments. Pure research that identifies more-effective search algorithms is not being transferred into tools that could be deployed in digital libraries. Users, and the digital librarians who serve them, do not see the benefits of these technological developments.

 

The failure of technology transfer into the digital library application realm does not result from a paucity of image retrieval research. A review of the literature shows that content-based image retrieval is a vital area of computer science. The challenge is to operationalize services based on these technologies and to integrate them into digital libraries. Before we reach that point, we must develop methods to compare and contrast various strategies and to assess where progress has been made and where investment is required in order to create robust technical services. Such methods of direct comparison could improve the caliber of image retrieval research because the relative value of differing strategies could be directly known and the results of different experiments compared.

3.                   IMAGE RETRIEVAL TO DATE

3.1             Overview

Image retrieval is a large and active area of computer and information science, described as "breathtaking" in its pace in a recent survey (Smeulders et al. 2000). Many large groups maintain extensive teams and support multiple avenues of research that is well supported by national funding agencies and foundations (see References: Funding Support). Governments and industry are investing substantial amounts, and significant portions of their information research budgets, to the issue (though the field is expanding to include moving image retrieval as well). Several conferences each year are devoted exclusively to these issues and there are many conferences with major components for image retrieval (see References: Conferences). Hardly a major university worldwide is without a research group working on the problem (see the References: Research Groups for a sampling).

 

Image retrieval research has taken two distinct, and discrete, paths. The first is focused on metadata-based retrieval, where images are found on the basis of associated textual descriptions and indexing. The second is based on feature-driven CBR or CBIR, where computational methods are used to identify and abstract the visual elements of an image. In metadata-based retrieval, a searcher's chosen text strings are matched to those used to describe the image (with or without lexical aids such as thesauri or word stemming). In CBIR, a query image (selected or drawn) is compared against the image database, and images similar to it are retrieved. This ability to use the inherent features of an image to retrieve it is attractive to digital librarians, as metadata-based image retrieval brings with it many problems of disciplinary perspective, intercataloger consistency, and incompatible metadata schemas.

 

A summary review of the literature shows an exceptionally active community of researchers in CBIR. Smeulders et al. recently reviewed the research focus of more than 200 papers judged important to the field (2003); Rui et al. have summarized research in more than 100 papers (1997). Numerous conferences on the subject are listed in the references. Some specialized subsets of this research area, such as latent semantic indexing (Brinkley 2001, Zoran 1997) or progressive feature searching (Castelli et al. 1998), are quite large and have developed their own conferences, publications, and research and evaluation methods. CBIR research has been well funded and is most often concentrated in large, ongoing research teams within computer science.

 

Metadata-based image retrieval is a less coherent field (Chu 2001) that conducts research focused on image retrieval for particular disciplines (Shatford 1986, 1999; Roberts 2001) or formats (Hunter 1999, 2002) or on theoretical improvements in the way images are indexed (Jaimes and Chang 2000, Greenberg 2001). Metadata-oriented image retrieval research has less funding, is conducted by individuals or small transient teams, reports its results in a wider range of journals, and is concentrated in information science schools.

 

Metadata-based image retrieval research appears to have little if any impact on indexers or image metadata developers. For example, at the CNI/Online Computer Library Center (OCLC)'s Image Metadata Workshop Third Dublin Core Workshop (Weibel 1997), no reference was made to retrieval research in the two days of deliberation over data elements minimally required for image retrieval. Instead, an element set with a focus on information retrieval was created by practitioners, on the basis of their experience describing images. This irony is not unique to image retrieval (Bates 1999).

 

Cawkell (1992) identified the fundamental flaw in image retrieval research: little or no crossover between researchers using "visual" vs. metadata-based methods of image retrieval. This is borne out by a citation analysis conducted by Persson (no date) that was based on authors cited in Rasmussen (1997) and reaffirmed in recent reviews of the literature (Chu 2001) and the research agenda (Jorgensen 2001). Chu's citation study reinforces the gap, noting that the journals with the highest citation rates in CBIR were outside the normal disciplinary discourse of the metadata-based image description community. She also points out the significantly greater volume of literature published by CBIR community, by inadvertently dropping out all metadata-based researchers when choosing the most frequently cited for further analysis. (Margaret Graham of the School of Informatics, Northumbria University reported that this gap is one of the motivators for the creation of the Challenge of Image Retrieval/Challenge of Image and Video Retrieval conference series Personal communication, 2003.)

 

Methods using both visual characteristics and textual descriptions in retrieval have recently emerged (Enser 2002, Goodrum et al. 2000, Perez-Lopez et al. 2000). Lewis et al. (2002) summarize the issues as they relate to cultural heritage objects and posit the development of a visual [multimedia] thesaurus that will assist in identifying concepts in images, and thus bridge the "semantic gap." Barnard et al., in Clustering Art (2001), Barnard and Forsyth (2001), and Li and Wang (2003), among many others, including a group of papers presented at Internet Imaging 2003, have been exploring the relationships between visual characteristics and keywords. Image searching tools on the Web such as MetaSEEK (http://ana.ctr.columbia.edu/metaseek) and SIMLPIcity (http://wang.ist.psu.edu/IMAGE) use image and text in combination. However, evaluation is lacking (Chen and Rasmussen 1999). The methods and measures for comparative evaluation of the two approaches, or of hybrid approaches, are poorly articulated and untested (Sormunen et al., Wang et al. 2003, and Bernard and Shirahatti 2003 represent early attempts).

 

3.2             Evaluating Image Retrieval

Research is beginning into methods of measuring, evaluating, and benchmarking image retrieval systems. In particular, a collaborative effort to develop a CBIR benchmarking environment is under way involving the University of Geneva, the Viper Group (http://www.viber.nige.ch/benchmarking/) and the Benchathlon group (http://www.benchathlon.net), with benchmarking events taking place at imaging conferences such as Internet Imaging.

 

Although some researchers have created data sets against which to test their own methods and some have made these data available to others, there is no widely used data set and no generally accepted set of benchmarks against which to evaluate new methods. Sormunen et al. (1999) explored the use of a task-oriented evaluation framework and a test collection to evaluate CIBR. Müller et al. (2001) described a process for evaluating image browsers that attempts to define the "contribution of low-level, feature based systems to retrieval success" and posits the existence of a set of well-described images as a means of evaluating CBIR systems (though they do not define "well described"). Assessing the relative effectiveness of any image retrieval methods can be costly and frustrating (Venters and Cooper 2000). It is difficult to assess the relative utility of metadata-based systems and to compare them to CBIR.

 

There are few significant studies of users' needs for or experiences with image retrieval systems. The Consortium for the Computer Interchange of Museum Information (CIMI) (1995) summarized work about access points in the cultural heritage community. Jorgensen (1999) looked at the relationships between naive user query presentation language and some image classification systems. Rodden (1999) explored the utility of incorporating CBIR "intelligence" into interfaces. Markkula and Sormunen (2000) looked at the specific use of a digital newspaper photo archive, building on the work of Eaken, Enser, and others. Pisciotta et al. (2001) reports on an ambitious study now under way at Pennsylvania State University. But as Rasmussen (2002) points out, no balance has been achieved between system-centered and user-centered evaluation of information retrieval. It should therefore come as no surprise that there is neither a consensus on the most promising approaches to image retrieval nor an agreement on how proposed approaches, systems, and tools should be evaluated for effectiveness (Eakins and Graham 1999).

 

The digital library community, which is a major organized consumer and creator of image databases and has historically had a substantial interest in retrieval effectiveness as part of its service mission, is concerned that the various approaches to image retrieval have not been assessed against common standards, that too little is known about success factors, user needs, and retrieval methods. On the basis of the perceived effectiveness of the NIST/TREC program (http://trec.nist.gov/overview.html, the Council on Library and Information Resources (CLIR) and the Coalition for Networked Information (CNI) sponsored a series of meetings in 2000/2001 to explore whether a shared testbed for image retrieval research could address these problems. These meetings probed a specific question:

[jt1]  

Could an image retrieval benchmarking database focus the image retrieval community and further its results?

 

4.                   TOWARD AN IMAGE RETRIEVAL BENCHMARKING DATABASE AND RELATED SERVICES

4.1             Why Benchmarking?

While exploring the efficacy of an Image Retrieval Benchmarking Database the participants in the planning meetings (see Acknowledgements and References: Project Documents) articulated a number of questions that reflected the desire of the digital library community to understand aspects of retrieval and possibly integrate CBIR methods into their services. These included:

 

Each of these questions can be answered only by benchmarking.

 

Benchmarking involves the comparison of the results of two or more different methods of performing a known task with a known result (the benchmark) in order to establish relative effectiveness. Benchmarking is a critical component of establishing best practices, key performance indicators, or performance metrics, all of which are other terms often applied to benchmarking or its results.

 

We can compare two methods of CBIR by using them to answer the same questions by searching the same data set.

 

We can compare different methods of metadata-based retrieval by seeing how well images described using different methods are found when the same questions are asked of them and they exist in the same data set.

 

We can assess optimum levels of metadata for the description of images and establish things such as cost effectiveness in image description if we can compare the ease with which images are retrieved in controlled circumstances against the cost of creating metadata.

 

We can test the balance between descriptive and retrieval metadata by assessing how well images with different levels of metadata are retrieved when standard queries are run against a common data set.

 

We can begin to identify what needs to be described in images if we review the results of query effectiveness tests and compare them to the standard kinds of metadata assigned to images. Provided that the queries reflect real user need, we can see what kinds of metadata are most and least likely to be used  (if the queries reflect real user need).

 

We can compare QBIR and metadata-based image retrieval and, possibly, assess the degree to which these two approaches are complementary, if these two methods are used to ask the same questions of the same data set and the results are compared.

 

Key to the success of all these studies is the existence of a benchmarking environment where research can be done in controlled circumstances.

4.2             Who Does Benchmarking?

What kind of community of interest is required to sustain an image reference database?

 

Benchmarking is well established throughout the economy, in areas from automotive manufacturing to knowledge management to higher education. All benchmarking initiatives are committed to sharing information and to improving business processes or performance. Through shared measures, assessments can be conducted that provide comparable results in different contexts. The emphasis in benchmarking is on relia