Richard E. (Rick) Luce
(Rick Luce is Vice Provost and Director of Libraries at Emory University).
Emergence of eScience
A convergence of exponential increases in computing, storage, online sensors, and bandwidth enabling collaboration in new ways has led to the rise of eScience. Characterized by large-scale, distributed global collaboration using distributed information technologies, eScience is supported by the next generation of cyberinfrastructure. eScience is typically conducted by a multidisciplinary team working on problems that have only become solvable in recent years with improved data collection and data analysis capabilities.
These characteristics fundamentally alter the ways in which scientists carry out their work, the tools and workflows they use, the types of problems they address, and the communications resulting from their research. The revolutionary potential of eScience is the ability to work at a much greater scale and intensity using distributed networks and powerful tools. Examples range from distributed computational astronomy to complex systems such as social networks, climate changes, multifactorial diseases, and pollution remediation.
Virtually every field in science and engineering has been changed by the convergence of these technologies, yielding entirely new ways of thinking about and understanding physical, biological, and social phenomena.1 These revolutionary developments will require a corresponding disruptive change in the ways in which libraries serve scientists’ needs.
A Growing Convergence: eResearch
While these new eScience developments initially characterized only the science, technology, engineering, and medicine disciplines, that distinction has faded; we now see these developments beginning to penetrate the social sciences and the humanities. The rise of transdisciplinary work, coupled with social scholarship-which is characterized by openness and the use of social tools, including virtual conversations, access, sharing, collaboration, and transparent revision-will continue to erode boundaries.
Corresponding with increased engagement in the social sciences and humanities is the broader notion of eResearch. eResearch refers to the development of, and the support for, advanced information and computational technologies to enhance all phases of research processes. A fundamental enabler of innovations and new discoveries, eResearch is becoming just as critical for the advancement of the social sciences and the humanities as it already is in the sciences.
Implications for Research Libraries
Preserving knowledge is one of the most vital and rapidly changing fundamental roles of the research library. For libraries that are now positioning themselves to support eResearch, preserving knowledge entails at least four key challenges:
- ensuring the quality, integrity, and curation of digital research information;
- sustaining today’s evolving digital service environments;
- bridging and connecting different worlds, disciplines, and paradigms for knowing and understanding; and
- archiving research data in a data world.
Discussion surrounding support of eResearch environments has focused on the overwhelming volume of data produced, with attendant challenges of scaling up capture and preservation capabilities. The more significant challenge, however, is the changing paradigm for capturing and reflecting research communication in an eResearch environment. Instead of simply storing objects of assorted types, researchers need libraries that reflect a Web 2.0 service environment in which communication is continuous and synchronous. This reality introduces significantly greater complexity to digital capture, curation, and preservation.
Innovative thinking is essential; a few existing lessons from external models tackling grand challenge problems are instructive. Institutional organizational silos alone cannot scale sufficiently to support this environment; the challenge requires transnational approaches and a matrix of capabilities. The speed of organizational deployment matters. The ability to move quickly and with agility is a competitive asset; slow-moving organizations are severely handicapped in this environment. Continuous adaptation is required; a diversity of approaches resulting in a variety of experiments should be celebrated.
Supporting Creation: A Key Role
A shift is emerging in the importance of different products of research. Increasingly, value is placed not on the publication(s) resulting from a research project but on the data-modeling and data-generation phases that occur earlier in the research life cycle. This shift to a more dynamic and collaborative process of doing science has led to a less formal means of communicating. In some areas of science this is leading to a less well-defined medium that is part publication and part ongoing communication process. Supporting this shift requires actively enabling and sustaining these communication processes rather than simply archiving the end result as a formal publication. Librarians and informaticians must be involved in the early planning and data-modeling phases of eResearch to ensure the collection, preservation, ease of use, and availability of data today and in the future.
There is a need for workflow tools that capture emerging communication modalities, and libraries and appropriate partners have the opportunity to fill that critical gap. It is at this early creation stage that the establishment of policies concerning data description, management, access, and sharing should be addressed, with particular attention paid to the demand for unfettered access to the research literature corpora. The level of knowledge and engagement required to effectively fill this role, however, goes well beyond knowledge of the literature. It requires being a trusted member of the community with recognized authority in information-related matters. This new paradigm entails shifting library foci from managing specialized collections to emphasizing proactive outreach and engagement.
Connecting Communities: A Second Key Role
The interactions required to facilitate eResearch differ in time and space from other methods. As a neutral commons, research libraries could provide collaborative facilities that allow startup efforts to congeal and connections to evolve. Centering startup activities within these co-laboratory facilities provides rich opportunities to connect with, and consult on, data practices ranging from collection and description to publication and preservation. Success, however, will require far more dynamic and proactive engagement than current institutional repository models do.
In the virtual world a neutral mechanism to create community interactions is needed. Groups conducting research will need access to information in collaborative Web spaces. These collaborative Web spaces will be populated by information feeds customized for individual teams of researchers. Some of these feeds will be customized for researchers fitting specific profiles; others will be pulled from external sites. Still others will be created by intelligent agents crawling the Web, remote repositories, and local resources.
Hybrid teams of information science experts working closely with researchers would determine the information requirements for these Web spaces. After determining the requirements, staff members would create and customize information feeds, serving as RSS channel editors and using tools to aggregate and filter RSS feeds from external sites. They could augment artificial intelligence with human intelligence by creating search strategies for intelligent agents with the help of taxonomies that map current terms to emerging terms, and terms from one domain to another domain.
Support for social networks through advanced social software capabilities is another potential service for research libraries. Social software in an eResearch environment has three dimensions: (1) conversational interactions, in which software applications facilitate synchronous and asynchronous communication among individuals and groups; (2) collaborative social networks, which allow individuals to discover and interact with colleagues who have related interests; and (3) social feedback systems, which use behavioral data, such as statistical log analyses, to create relationships and evaluative metrics.
One additional dimension of connection via the commons bears mentioning. Libraries can be the conveners that establish a common ground among different players. Collaboration and partnering are essential in the eResearch environment. While some organizations will specialize in building tools and others in building relationships, both are required.
Curation: A Third Key Role
The generation of vast amounts of primary data gives rise to data-curation questions. Data used to be hidden behind office walls, scribbled in notebooks, stored in file cabinets, and recorded on hard drives. Now data are more often “loose” and available to be repurposed and recombined. Caring for these data requires life cycle data management, covering acquisition and integration, treatment, provenance, persistence, and digital preservation.
Over the next five years we will collect more scientific data than we have collected in all of human history. Access and cross-domain usage of distributed collections is highly dependent upon the application of uniform methods of description when the data are created. Metadata are an essential component of research data. Research libraries can lead the development of standardized, ontologically rich automated metadata for such data sets. Developing and managing metadata are already established tasks in the library community-although current practices will not handle the scale envisioned. The pervasive use of machine-aided semantic annotation, using well-structured metadata, is the only feasible approach for effectively organizing and describing eResearch data.
Standardizing approaches to metadata collection is fundamental, and metadata must be a required part of the eResearch communication process. We should not underestimate the cost and effort that will be required to collect metadata on this scale-nor can we underestimate the cost to redo it if not done properly. Given the challenges of scale, the potential of socially tagging data-similar to the process of social bookmarking currently used to catalog photos on Flickr-should be robustly explored.
With the rise of the semantic Web, we can forecast the age of distributed personal publication, a new paradigm where individuals and teams publish their own results, rather than relying on conventional centralized databases with their corresponding curatorial staff. Future eResearch will include communication about a variety of dimensions surrounding data, published locally by individuals, institutional or domain repositories, or the next generation of journals, complete with semantically rich metadata.
Research libraries could take responsibility for assisting with curation and preservation of smaller-scale data repositories arising from the work of local or domain-specific research groups. The level of description used with research data is critical to discovering new ways of combining and using data. Research libraries focusing on their core competencies are well positioned to lead this strategic work.
Developing the Supporting Infrastructure
Research libraries will be best served by focusing on their critical core competencies while partnering with other organizational players. It is already clear that public and nonprofit institutions, no matter how large, will be singularly unable to meet the ever-expanding massive-scale data storage needs of eScience projects, such as the Large Synoptic Survey Telescope project, which generates 30 terabytes of data nightly. Whether aggregations of public or academic research communities banding together will be up to the task remains unclear.
Compounding the issue of scale are the challenges of providing adequate electrical power to run the necessary storage and server farms. Today, electrical power and cost per megawatt are the limiting factors in expanding large data centers. This is driving the global corporate push toward distributed computing infrastructures. Regardless of the many far-reaching public policy issues inherent in privatizing research data, economies of scale have positioned the private sector as a serious player for cyberinfrastructure support in the United States.
Because digital component performance is continually improving, the scale of information technology (IT) environments is constantly increasing. As a result, IT networks are best managed as a unified “whole.” This “cloud computing,” also known as “fabric,” “application virtualization,” and “datacenter virtualization,” involves linking large pools of systems to provide IT services. This approach allows corporate data centers to operate more like the Internet by enabling computing across a distributed, globally accessible fabric of resources, rather than only on local machines or remote server farms. The private sector, with more investment capital and experience running massive data farms, is aggressively positioning itself for this role. Yahoo, Google, IBM, and Microsoft have announced initiatives to promote new software development methods that will help researchers address the challenges of Internet-scale eScience applications in the future.
Morphing Digital Research Libraries
The coming eResearch tsunami will profoundly affect the role of research libraries today and tomorrow. The scale of change confronting research libraries is unprecedented, and successfully responding will require disruptive thinking and novel solutions.
The dominance of Google’s search services, book digitization program, organization of information in the broader context, and ubiquitous presence constantly challenge research libraries to more finely focus their role in information delivery. In addition, researchers create and use massive data sets, and increasingly rely on interdisciplinary teams-not subject-specific colleagues-from numerous institutions around the globe. The grand challenge for research libraries will be to provide data services to researchers in the new era.
A variety of integrated, end-user information resources, all of which ideally should be available in accessible user environments, are missing today. A cursory list includes profiles of scientists and research groups; toolkits for data integration, text and data mining analysis, and validation; registries of instruments and sensors; registries of software toolkits; registries of data sets; and more.
Professionals responsible for managing such data repository collections are beginning to be called data scientists. They could just as well be data librarians or informationists. Regardless of the label, this is an emerging profession; libraries could play a significant role in building teams of professionals ready to assume these roles. Further, eResearch data collections tend to be distributed, requiring coordination across institutions. Research libraries have a long tradition of creatively coordinating resource sharing across multiple institutions. Putting this concept on steroids, they could work in the same vein with distributed data collections.
First on our priority list ought to be formulating new partnerships with data-driven researchers-in all fields. Libraries can foster collaboration networks and provide collaboration space (both virtual and physical) where researchers can work, in addition to building institutional data repositories.
New Organizational Structures
New hybrid organizations likely will emerge to tackle questions surrounding long-term custodianship of data repositories. It is premature to predict which organization(s) will succeed at that task. Any number of organizations, including commercial ventures, the grid community, supercomputer centers, research libraries, dedicated research groups, or new organizations we have not yet envisioned, could combine capabilities to ensure success.
Research libraries have traditionally been structured and staffed around disciplines. In contrast, eResearch embraces multidisciplinary approaches. eScience often requires virtual teams to form dynamically in the initial planning phases of a research project, work on a project, and then morph into something else when a less intense presence is needed. This requires fluid staffing structures and a more dynamic structural model than our current practice of assigning departmental or subject liaisons. Such professionals may be well integrated, but are not usually able to dynamically respond to emerging trends with intense needs. The agility required to mobilize support in this environment will require research libraries to work seamlessly across institutional boundaries.
New organizational models should reflect the environments they are attempting to support, recognizing the synergy and interdependence between scholars and information pioneers. To proactively support this environment, librarians must become part of the research process-full members of the research team. To do this, library staff members need to “go native” and embed themselves among the teams they support. Clearly this will have significant implications for the library’s staffing profile and workforce skill set.
What Research Libraries Can Do Now
At this stage, research libraries should focus on developing the functional requirements of a data-archiving infrastructure, and let the appropriate organizational forms emerge from those requirements. As with any paradigm shift, there are many challenges and opportunities for organizations that have the agility to adapt and move quickly, as well as for new players.
Changes in research libraries must be driven by and reflect the needs of the research communities they seek to support. Researchers will expect the same level of ubiquitous convenience and advanced capabilities from their reconstructed digital libraries as they get from widely available eScience workflows. Our responses will require a shift in focus from delivering products (e.g., reference services or publications) to process (e.g., supporting team science).
Collaboration, partnerships, and de facto best practices are vital for researchers to exploit heterogeneous sources of data. Many types of organizations including research centers, libraries, supercomputing centers, archives, and Internet companies have expertise in some dimension of data-driven scholarship. Such expertise is nearly always incidental to the major expertise of the organization. The challenges facing research libraries are to articulate and advance our role and unique capabilities into the virtual laboratory environment. Success will require developing a deep anticipatory understanding of what these researchers require to perform their work successfully.
Limited space precludes more than the briefest sketch of other transformation opportunities, among which are the following:
- a transparent system of grid-like libraries and library data services supporting data science and curation;
- formation of eResearch communities that are multidisciplinary and international;
- support for personal information management, as data sets and associated information become increasingly portable; and
- a research agenda and development of sustained information science research capabilities.
Economic Sustainability: A Grand Challenge
Adequate and sustained funding for long-lived data collections and their associated facilities remains a vexing problem, spawning a call for creative approaches both nationally and internationally. Data preservation facilities must be able to support and provide for their collections over the long term. However, the widely decentralized and nonstandard mechanisms for generating data of every type and format imaginable make this problem an order of magnitude more difficult than our experiences to date with archiving and preservation. Infrastructure needs to be funded to enable research, and we need to be prepared to make the point repeatedly that libraries are part of the infrastructure.
Many questions remain to be resolved, such as:
- Who owns the data, especially when it is collaboratively collected?
- Who can access the data, and under what use and export conditions?
- Which research data need to be retained, for how long, and in what format(s)?
- What level of data reliability is required?
- Who pays the costs for curation and preservation, and for what period(s) of time?
In an era of information, software, and systems openness, we control less and less. The cost of owning and managing data, hardware, and software is very high. How do we offset and share multi-institutional infrastructure investments? Because it takes a community to meet these challenges, how many research libraries need to work together to meet specific eResearch needs, and how do we collaborate in new, more effective ways? There are many questions for which we do not have the answers. Research libraries ought to be committed to finding them.
Conclusion
The emergence of eResearch, with its associated large data repositories, heralds not only a new way of doing science but also a challenging new world for libraries, provided that we aggressively seize the opportunities. Traditional library roles-those of organization, access, and preservation-must be augmented by new capabilities in automatically describing, annotating, and manipulating a wide spectrum of collaborative, data-intensive information resources. Spanning the gamut of capabilities from raw data to informal and formal communications, the ability to discover and track research results remains an essential, although radically different-looking, component of the research infrastructure. A powerful user-centric infrastructure that supports collaborative multidisciplinary science is now required. A grand challenge now faces us: the next generation of research infrastructure requires dynamic data repositories. Are we ready to step up to center stage?
Related Sources
Berman, Francine, and Henry Brady. 2005. Final Report: NSF SBE-CISE Workshop on Cyberinfrastructure and the Social Sciences. Available at vis.sdsc.edu/sbe/reports/SBE-CISE-FINAL.pdf.
National Science Foundation. 2003. Revolutionizing Science and Engineering through Cyberinfrastructure: Report of the National Science Foundation Blue Ribbon Advisory Panel on Cyberinfrastructure. Available at http://www.nsf.gov/od/oci/reports/toc.jsp.
National Science Foundation Cyberinfrastructure Council. 2007. Cyberinfrastructure Vision for 21st Century Discovery. Available at http://www.nsf.gov/od/oci/CI_Vision_March07.pdf.
Welshons, Marlo, ed. 2006. Our Cultural Commonwealth: The Report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences. New York, N.Y.: American Council of Learned Societies. Available at http://www.acls.org/programs/Default.aspx?id=644.
FOOTNOTES
1 The 2020 Science Group. 2002. Towards 2020 Science. Redmond, Wash.: Microsoft Corporation. Available at http://research.microsoft.com/towards2020science/background_overview.htm.