Image
Retrieval Benchmark Database Service:
A
Needs Assessment and Preliminary Development Plan
A
Report Prepared for the Council on Library and Information Resources
and the Coalition for Networked Information
Jennifer
Trant
Archives & Museum Informatics
www.archimuse.com
Original
draft: October 1, 2003; Updated January 2004
Table
of Contents
2.1 Images in Digital Libraries
2.2 The Issue for Digital Libraries
3.1 Overview
3.2 Evaluating Image Retrieval
4. Toward an Image Retrieval Benchmarking Database and Related Services
4.1 Why Benchmarking?
4.2 Who Does Benchmarking?
4.3 How Is Benchmarking Done?
4.4 When Is Benchmarking Valuable?
4.5 What Could Be Benchmarked, and How?
4.6 How Are Benchmarks Established?
4.7 Questions a Benchmarking Database Cannot Answer
4.8 Other Issues
4.9 An Environment for Research
5. Planning an Image Retrieval Benchmark Service
5.1 Goals for a Research Benchmarking Service
5.2 Audiences/Users of the Benchmarking Service
5.3 Components of an Image Retrieval Benchmark System
5.3.1 Collections of Test Images
5.3.2 Benchmark Queries
5.3.3 Relevance Assessments
5.3.4 Quantitative Evaluation Metrics
5.3.5 Community of Researchers
5.4 Success Factors in the Creation of an Image Retrieval Benchmarking Service
5.4.1 Sponsorship
5.4.2 Community Buy-In
5.4.3 Governance
5.4.4 Creating Incentives to Use
5.4.5 Technical Success Factors
5.5 Ancillary Costs to the Research Community
6. Scenarios for Developing the Image Retrieval Benchmark Database
6.1 TREC Model
6.1.1 New TREC Video Tracks
6.1.2 Emerging TREC Communities
6.1.3 An Image Retrieval TREC Track?
6.2 Genesis from within the Computer Science Research Community: Benchathlon Expansion
6.3 Music Retrieval
6.4 An Industry Consortium
7. Stages in Developing an Image Retrieval Benchmark Database
7.1 Phased Approach
7.2 Phase 1: Establish a Case, Identify Sponsors, and Recruit Research Participants
7.2.1 Form Steering Committee
7.2.2 Issue Request for Comment
7.2.3 Hold Workshops
7.2.4 Draft Implementation Plan
7.3 Phase 2: Establish Organization
7.3.1 Establish Governance
7.3.2 Issue Request for Proposals for Host
7.3.3 Issue Formal Call for Participation
7.3.4 Issue Call for Data Sets
7.3.5 Prototype Integration of a Initial Data Sets
7.3.6 Establish Test Queries
7.3.7 Establish Test Ground-Truth Assessments
7.3.8 Release Test Data Sets without Ground Truth
7.4 Analyse and Report Prototype Results
7.5 Phase 3: Launch Service
7.5.1 Construct Production Systems
7.5.2 Obtain Data (Image and Metadata Sets)
7.5.3 Establish Queries
7.5.4 Establish Relevance Judgments
7.5.5 Launch Test
7.5.6 Convene an Image Retrieval Conference
7.6 Phase 4: Operationalize Service
8. Conclusion
This study was
initiated as a joint project between the Coalition for Networked Information
(CNI) and the Council on Library and Information Resources (CLIR) and
was funded by the Atlantic Philanthropies. It has benefited from the
thoughtful groundwork of Clifford Lynch (CNI) and Anne Kenney (then
CLIR, now Cornell University Library).
The
problems of assessing image retrieval were explored by the participants
in several planning meetings held prior to the commissioning of this
report. One such session, entitled "Planning Meeting for Test Database
for Digital Visual Resources" and convened by Clifford Lynch and Anne
Kenney in May 2001, was particularly helpful in shaping my initial
thinking. Participating in this session were Joseph Bush, director,
Solutions Architecture, Interwoven; Don D'Amato, MITRE Corporation;
Corinne Jörgensen, School of Informatics, Florida State University;
Donna Harmon, Text REtrieval Conference (TREC)/ National Institute
of Standards and Technology (NIST); Peter Hirtle, Cornell University
Library, Cornell University; Matthew Kirschenbaum, Department of English,
University of Kentucky; Max Marmor, Arts Library, Yale University (now
with ARTstor); Worthy Martin, Department of Computer Science, University
of Virginia; Beth Sandor, University Libraries, University of Illinois;
Don Waters, The Andrew W. Mellon Foundation; and John Weiss, Digital
Library Production Service, University of Michigan.
I
would also like to thank all those with whom I've discussed this question.
I am particularly grateful to Margaret Graham, Corinne Jorgensen, Anne
Kenney, and James Wang, who commented on drafts of the manuscript.
My
personal thanks to David Bearman for his insight into our discussions
of this and many other problems in the digitization of cultural heritage
information.
The
rapid increase in the quantity of visual materials in digital libraries—supported
by significant advances in digital imaging technologies—has not been
supported by a corresponding advance in image retrieval technologies
and techniques. Digital librarians sense that much could be done to
improve access to visual collections and hope, perhaps vainly, that
users' needs to identify relevant digital visual resources might be
met more satisfactorily through search strategies based on visual characteristics
rather than on textual metadata associated with the image, which are
expensive to produce. However, digital librarians currently have no
tools for evaluating either content-based or metadata-based image retrieval
systems. Consequently, they have difficulty assessing existing systems
of image access, evaluating proposed changes in these systems, or comparing
metadata-based and content-based image retrieval.
Some
have proposed benchmarking as a solution to this problem. An image
retrieval benchmark database could provide a controlled context within
which various approaches could be tested. Equally important, it might
provide a focus for image retrieval research and help bridge the significant
divide between researchers exploring these two search paradigms: metadata-based
vs. content-based image retrieval. If so, such a database could spur
advances in research, as comparative results make it possible to evaluate
the effectiveness of particular strategies and thereby add value to
studies supported by many funding agencies.
Creating an image
retrieval benchmarking service would be a significant undertaking.
A benchmarking database is more than a collection of images. Benchmarking
requires a set of queries to be put to that test collection. Each image
in the test collection must be assessed to determine whether it is
relevant to that query. Assessing the performance of systems requires
a set of evaluation metrics that make it possible to compare one system
with another and to rank results. Developing a test collection requires
an investment in data collection, documentation, enhancement, and distribution.
Most significantly, maintaining an image reference benchmarking service
requires that a community of researchers make a long-term commitment
to its use. Without a community vested in the development of the database—and
publishing research based on it—the collection remains a chimerical
solution to advancing the state of research and improving the retrieval
of visual materials in the digital library.
2.1 Images in Digital Libraries
Digital libraries—managed collections of digital information assembled
and curated by institutions (such as libraries, archives, and museums)
or individuals and made available for use—are complex, hybrid environments
in which many kinds of materials are brought together for the first
time. Their promise of integrated access to information is not
being fully realized because of the different ways in which apparently
similar materials are currently described. As digital librarians—those
professionals responsible for the creation, management, maintenance,
and provision of access to digital libraries—struggle to improve
services to their traditional users, they also feel pressure to
increase the use of their collections in new communities whose
needs are less well known and understood. The desire to make collections
accessible to nontraditional users makes clear the need for measures
of success in information retrieval and evaluation of the user
experience.
Images appear to offer great potential for interdisciplinary use.
Researchers concur that retrieval of images is going to be increasingly
important for a range of commercial, governmental, and academic
purposes. They also concur that large aggregations of images, measured
in hundreds of millions of images and picabits of data, will soon
be a standard searching target. Space science, medicine, trademarks,
and patents are far along in the implementation of large-scale
image databases. Scholarly databases of art and cultural images
are still in their relative infancy but are growing fast (a comprehensive
cultural heritage database would comprise many hundreds of millions
of images). New applications are emerging in home-based image services
addressing the hundreds of millions of personal images created
with digital cameras.
Collections of visual materials bring the information retrieval
problem into sharp focus. Resources originally created for use
by a single department, such as art history, become a resource
for users from many departments when digital image collections
are made available on the campus network. Ironically, the potential
for broad use of images is often frustrated by the same highly
developed disciplinary descriptive and indexing systems that make
them especially useable within a specialist community. One could
imagine the same images being of interest in the creative arts,
art history, the humanities (including history, languages, and
philosophy), the social sciences, and anthropology. But for the
historian, the artist-focused organizational systems of art history
do little to surface the subject matter of images. For the cultural
theorist, taxonomies in a biological database documenting an exploration
do little to identify images depicting early cross-cultural contact.
Scientific imagery is also growing greatly in volume, as satellite,
meteorological, biological, and geographical images are gathered
that document current and past conditions in detail; however, their
electronically captured metadata do not identify the things depicted,
making it difficult for the historical geographer to find a road's
site using its name. These image collections are themselves of
interest to computer scientists and those in information retrieval
as well as to earth and space scientists.
Visual information has developed a significant role in our culture,
and, it appears, in our research methods (Rhyne 1995, 1996). As
more and more information resources are made available, their use
and reuse become more difficult to predict and to assess. Recent
surveys of image retrieval make the point that the users of such
systems are drawn from many disciplines. Those cited by Venters
and Cooper (2000) include "art galleries and museum management;
architectural and engineering design; interior design; remote sensing
and earth resource management; geographic information systems;
scientific database management; weather forecasting; retailing;
fabric and fashion design; trademark and copyright database management;
law enforcement and criminal investigation; picture archiving and
communications systems." Known users of the Art Museum Image Consortium
(AMICO) Library range from art-history researchers, to information
and computer scientists, to language teachers and other individuals
at the graduate, undergraduate, and K–12 levels, to lifelong learners.
E-commerce audiences for image access to product catalogs are growing
rapidly; they are just one of many areas of heavy image use outside
the research and educational communities.
[1]
In framing a study of image use at Pennsylvania State University,
Henry Pisciotta asks, "Are all picture collections now interdisciplinary?" (2001).
If we accept this rhetorical statement, how does it complicate
our challenge to provide access to collections for all users?
2.2 The Issue for Digital Libraries
As the content of digital libraries increasingly varies in form
and grows in size, librarians more and more frequently wish that
they could reliably move beyond text-based retrieval. Those developing
digital library services (expressed in the meetings leading up
to this report) fervently hope that user needs for identifying
relevant digital visual resources might be met through search strategies
based on visual characteristics rather than solely on those represented
in textual metadata associated with the image. This hope is in
part born from a frustration with current access methods and in
part a reflection of the presumed cost of creating image metadata
for retrieval, even if metadata-based image retrieval did work.
Even among specialist librarians, there is a sense that much could
be done to improve access to visual collections, both in the use
of existing description and indexing schemes and in the application
of new technologies (Eakins and Graham 2000). Perhaps the solution
is to match the retrieval methods with the materials and to use
more visually oriented retrieval tools to provide access to visual
collections.
This sense of dissatisfaction with current retrieval methods co-exists
with a —perhaps unfounded—sense that content-based image retrieval
(CBIR or CBR) systems could enhance the effectiveness of resource
delivery in the digital library. Unfounded, because there appears
to be a lack of understanding of what CBIR systems do, how they
function, and how well they work within the digital library community.
Digital librarians have no readily available assessments of the
various strategies for content-based image retrieval; the few comparative
studies that exist (such as the recent Joint Information Systems
Committee/Joint Technologies Application (JISC/JTAP) report (Venters
and Cooper 2000) are neither expressed in librarians' language
nor focused on the humanistic researcher or library administrator.
This lack of knowledge exists along with a technology transfer
gap. The call for applied research (expressed by Beth Sandore and
others during meetings leading up to this report) echoes the sense
that ongoing work is not being related to the needs and requirements
of the digital library community and is not being integrated into
their service-delivery environments. Pure research that identifies
more-effective search algorithms is not being transferred into
tools that could be deployed in digital libraries. Users, and the
digital librarians who serve them, do not see the benefits of these
technological developments.
The failure of technology transfer into the digital library application
realm does not result from a paucity of image retrieval research.
A review of the literature shows that content-based image retrieval
is a vital area of computer science. The challenge is to operationalize
services based on these technologies and to integrate them into
digital libraries. Before we reach that point, we must develop
methods to compare and contrast various strategies and to assess
where progress has been made and where investment is required in
order to create robust technical services. Such methods of direct
comparison could improve the caliber of image retrieval research
because the relative value of differing strategies could be directly
known and the results of different experiments compared.
Image
retrieval is a large and active area of computer and information science,
described as "breathtaking" in its pace in a recent survey (Smeulders
et al. 2000). Many large groups maintain extensive teams and support
multiple avenues of research that is well supported by national funding
agencies and foundations (see References: Funding Support). Governments
and industry are investing substantial amounts, and significant portions
of their information research budgets, to the issue (though the field
is expanding to include moving image retrieval as well). Several
conferences each year are devoted exclusively to these issues and there
are many conferences with major components for image retrieval (see
References: Conferences). Hardly a major university worldwide is without
a research group working on the problem (see the References: Research
Groups for a sampling).
Image
retrieval research has taken two distinct, and discrete, paths. The
first is focused on metadata-based retrieval, where images are found
on the basis of associated textual descriptions and indexing. The second
is based on feature-driven CBR or CBIR, where computational methods
are used to identify and abstract the visual elements of an image.
In metadata-based retrieval, a searcher's chosen text strings are matched
to those used to describe the image (with or without lexical aids such
as thesauri or word stemming). In CBIR, a query image (selected or
drawn) is compared against the image database, and images similar to
it are retrieved. This ability to use the inherent features of an image
to retrieve it is attractive to digital librarians, as metadata-based
image retrieval brings with it many problems of disciplinary perspective,
intercataloger consistency, and incompatible metadata schemas.
A
summary review of the literature shows an exceptionally active community
of researchers in CBIR. Smeulders et al. recently reviewed the research
focus of more than 200 papers judged important to the field (2003);
Rui et al. have summarized research in more than 100 papers (1997).
Numerous conferences on the subject are listed in the references. Some
specialized subsets of this research area, such as latent semantic
indexing (Brinkley 2001, Zoran 1997) or progressive feature searching
(Castelli et al. 1998), are quite large and have developed their own
conferences, publications, and research and evaluation methods. CBIR
research has been well funded and is most often concentrated in large,
ongoing research teams within computer science.
Metadata-based
image retrieval is a less coherent field (Chu 2001) that conducts research
focused on image retrieval for particular disciplines (Shatford 1986,
1999; Roberts 2001) or formats (Hunter 1999, 2002) or on theoretical
improvements in the way images are indexed (Jaimes and Chang 2000,
Greenberg 2001). Metadata-oriented image retrieval research has less
funding, is conducted by individuals or small transient teams, reports
its results in a wider range of journals, and is concentrated in information
science schools.
Metadata-based
image retrieval research appears to have little if any impact on indexers
or image metadata developers. For example, at the CNI/Online Computer
Library Center (OCLC)'s Image Metadata Workshop Third Dublin Core Workshop
(Weibel 1997), no reference was made to retrieval research in the two
days of deliberation over data elements minimally required for image
retrieval. Instead, an element set with a focus on information retrieval
was created by practitioners, on the basis of their experience describing
images. This irony is not unique to image retrieval (Bates 1999).
Cawkell
(1992) identified the fundamental flaw in image retrieval research:
little or no crossover between researchers using "visual" vs. metadata-based
methods of image retrieval. This is borne out by a citation analysis
conducted by Persson (no date) that was based on authors cited in Rasmussen
(1997) and reaffirmed in recent reviews of the literature (Chu 2001)
and the research agenda (Jorgensen 2001). Chu's citation study reinforces
the gap, noting that the journals with the highest citation rates in
CBIR were outside the normal disciplinary discourse of the metadata-based
image description community. She also points out the significantly
greater volume of literature published by CBIR community, by inadvertently
dropping out all metadata-based researchers when choosing the most
frequently cited for further analysis. (Margaret Graham of the School
of Informatics, Northumbria University reported that this gap is one
of the motivators for the creation of the Challenge of Image Retrieval/Challenge
of Image and Video Retrieval conference series Personal communication,
2003.)
Methods
using both visual characteristics and textual descriptions in retrieval
have recently emerged (Enser 2002, Goodrum et al. 2000, Perez-Lopez
et al. 2000). Lewis et al. (2002) summarize the issues as they relate
to cultural heritage objects and posit the development of a visual
[multimedia] thesaurus that will assist in identifying concepts in
images, and thus bridge the "semantic gap." Barnard et al., in Clustering
Art (2001), Barnard and Forsyth (2001), and Li and Wang (2003), among
many others, including a group of papers presented at Internet Imaging
2003, have been exploring the relationships between visual characteristics
and keywords. Image searching tools on the Web such as MetaSEEK (http://ana.ctr.columbia.edu/metaseek)
and SIMLPIcity (http://wang.ist.psu.edu/IMAGE) use image and text in
combination. However, evaluation is lacking (Chen and Rasmussen 1999).
The methods and measures for comparative evaluation of the two approaches,
or of hybrid approaches, are poorly articulated and untested (Sormunen
et al., Wang et al. 2003, and Bernard and Shirahatti 2003 represent
early attempts).
Research
is beginning into methods of measuring, evaluating, and benchmarking
image retrieval systems. In particular, a collaborative effort to develop
a CBIR benchmarking environment is under way involving the University
of Geneva, the Viper Group (http://www.viber.nige.ch/benchmarking/)
and the Benchathlon group (http://www.benchathlon.net), with
benchmarking events taking place at imaging conferences such as Internet
Imaging.
Although
some researchers have created data sets against which to test their
own methods and some have made these data available to others, there
is no widely used data set and no generally accepted set of benchmarks
against which to evaluate new methods. Sormunen et al. (1999) explored
the use of a task-oriented evaluation framework and a test collection
to evaluate CIBR. Müller et al. (2001) described a process for evaluating
image browsers that attempts to define the "contribution of low-level,
feature based systems to retrieval success" and posits the existence
of a set of well-described images as a means of evaluating CBIR systems
(though they do not define "well described"). Assessing the relative
effectiveness of any image retrieval methods can be costly and frustrating
(Venters and Cooper 2000). It is difficult to assess the relative utility
of metadata-based systems and to compare them to CBIR.
There
are few significant studies of users' needs for or experiences with
image retrieval systems. The Consortium for the Computer Interchange
of Museum Information (CIMI) (1995) summarized work about access points
in the cultural heritage community. Jorgensen (1999) looked at the
relationships between naive user query presentation language and some
image classification systems. Rodden (1999) explored the utility of
incorporating CBIR "intelligence" into interfaces. Markkula and Sormunen
(2000) looked at the specific use of a digital newspaper photo archive,
building on the work of Eaken, Enser, and others. Pisciotta et al.
(2001) reports on an ambitious study now under way at Pennsylvania
State University. But as Rasmussen (2002) points out, no balance has
been achieved between system-centered and user-centered evaluation
of information retrieval. It should therefore come as no surprise that
there is neither a consensus on the most promising approaches to image
retrieval nor an agreement on how proposed approaches, systems, and
tools should be evaluated for effectiveness (Eakins and Graham 1999).
The
digital library community, which is a major organized consumer and
creator of image databases and has historically had a substantial interest
in retrieval effectiveness as part of its service mission, is concerned
that the various approaches to image retrieval have not been assessed
against common standards, that too little is known about success factors,
user needs, and retrieval methods. On the basis of the perceived effectiveness
of the NIST/TREC program (http://trec.nist.gov/overview.html,
the Council on Library and Information Resources (CLIR) and the Coalition
for Networked Information (CNI) sponsored a series of meetings in 2000/2001
to explore whether a shared testbed for image retrieval research could
address these problems. These meetings probed a specific question:
Could
an image retrieval benchmarking database focus the image retrieval
community and further its results?
While exploring the efficacy of an Image Retrieval Benchmarking
Database the participants in the planning meetings (see Acknowledgements
and References: Project Documents) articulated a number of questions
that reflected the desire of the digital library community to understand
aspects of retrieval and possibly integrate CBIR methods into their
services. These included:
Each
of these questions can be answered only by benchmarking.
Benchmarking
involves the comparison of the results of two or more different methods
of performing a known task with a known result (the benchmark) in order
to establish relative effectiveness. Benchmarking is a critical component
of establishing best practices, key performance indicators, or performance
metrics, all of which are other terms often applied to benchmarking
or its results.
We
can compare two methods of CBIR by using them to answer the same questions
by searching the same data set.
We
can compare different methods of metadata-based retrieval by seeing
how well images described using different methods are found when the
same questions are asked of them and they exist in the same data set.
We
can assess optimum levels of metadata for the description of images
and establish things such as cost effectiveness in image description if we can compare the ease with
which images are retrieved in controlled circumstances against the
cost of creating metadata.
We
can test the balance between descriptive and retrieval metadata by
assessing how well images with different levels of metadata are retrieved
when standard queries are run against a common data set.
We
can begin to identify what needs to be described in images if we review
the results of query effectiveness tests and compare them to the standard
kinds of metadata assigned to images. Provided that the queries reflect
real user need, we can see what kinds of metadata are most and least
likely to be used (if the queries reflect real user
need).
We
can compare QBIR and metadata-based image retrieval and, possibly,
assess the degree to which these two approaches are complementary,
if these two methods are used to ask the same questions of the same
data set and the results are compared.
Key
to the success of all these studies is the existence of a benchmarking
environment where research can be done in controlled circumstances.
What
kind of community of interest is required to sustain an image reference
database?
Benchmarking is well established throughout the economy, in areas from automotive manufacturing to knowledge management to higher education. All benchmarking initiatives are committed to sharing information and to improving business processes or performance. Through shared measures, assessments can be conducted that provide comparable results in different contexts. The emphasis in benchmarking is on relia