Number 88 • July/August 2012
ISSN 1944-7639 (online version)
Giving Language to the Data Curation Challenge: Interview with Lynn Yarmey
Not Just Another Aggregator: The Challenge Facing the Digital Public Library of America
Hidden Collections Registry is Live!
Register Now for the DuraSpace/ARL/DLF E-Science Institute
AIC Collections Care Survey Closes Aug. 30
Call for Editors—Databib
CLIR Issues is produced in electronic format only. To receive the newsletter electronically, please sign up at https://www.clir.org/pubs/issues/signup.html. Content is not copyrighted and can be freely distributed.
Follow us on Twitter @CLIRNews
Giving Language to the Data Curation Challenge:
Interview with Lynn Yarmey
Lynn Yarmey is lead data curator for the Advanced Cooperative Arctic Data and Information Services (ACADIS) project at the National Snow and Ice Data Center (NSIDC) in Boulder, Colorado. A 2010 recipient of the A. R. Zipf Fellowship in Information Management that CLIR manages, she spoke with CLIR Issues about her work and the challenges of data curation.
CI: You have a geophysics degree and an MLIS. Tell us why you felt the need to pursue a library degree given your science background and how your current job straddles both realms.
I was working at the Scripps Institution of Oceanography as a programmer analyst when I met Karen Baker, an information manager and informatics researcher with the Long-term Ecological Research (LTER) Program. She gave me a language for all the invisible work I had been doing but had no way of expressing. As a data analyst, I had experienced the difficulty in writing informally shared code, dealing with standardization, documenting “tribal knowledge,” etc. Library school provided a natural fit in terms of introducing me to the information science field and community. As a result of my oceanographic experience, I already had a sense of the data landscape going into library school, so it was much more about formalizing that knowledge base. Library school really helped me translate my professional practices into the context and language of LIS research.
My data curator position is new within NSIDC and in fact represents a new model, emphasizing generalization rather than specialization. I am funded by the NSF’s Advanced Cooperative Arctic Data and Information Services (ACADIS) project, in which NSIDC has partnered with the National Center for Atmospheric Research’s (NCAR) Earth Observing Laboratory (EOL), NCAR’s Computational and Information Systems Laboratory (CISL), and Unidata. I do a little of everything; my position includes a good chunk of project management in addition to facilitating communication among all four partners, developing and helping the research community implement practices upstream that will improve data standardization and reusability, and advertising data from other Arctic communities through development of a metadata broker, among other responsibilities. I am leading an ACADIS metatdata subgroup that is working across our partner organizations on issues such as data citation metadata and minimum metadata sharing within the project.
CI: With respect to data curation, you’ve said that libraries have at least accurately articulated the problems and worked to define the realm but have not been ambitious enough in staking claim to the realm of data curation. What did you mean by that? How do you think libraries can stake their claim?
I had just started graduate school when I made this comment, so I was a bit naïve at the time. But Chuck Henry’s blog post (“One Culture,” June 14, 2012) about facilitating collaboration really resonated with me. He talks about the paradox of “silos of intellectual achievement… segregated within the most robust, interconnected, and flexible system of information distribution in our history” and the revolutionary research that can be done if and when we bridge that gap. I see data as living within this paradigm, embodying the challenges and potential that exist in higher education. I believe data curation requires a holistic and collaborative sensibility.
A comprehensive data program would bring library and archives expertise on metadata, information organization and discovery, collection development, and preservation together with research, curricula, contracts and grants, visualization labs, technical support, technical storage, intellectual property communities, and others. We need to engage multiple perspectives to create this program; data work cannot be done sustainably in isolation. Each element requires the interconnectedness and interdependence that Chuck discusses in his post, and some of these relationships may need to be built from scratch. Add in the interinstitutional, national, and international aspects of modern research, and the complexities of meeting data curation needs multiply. However, given the existing connections with faculty, campus departments, consortia, and national governance models, libraries are ideally placed to facilitate data work.
Many projects are already expanding their definition of data work to include a set of collaborative services. For example, Purdue is leading the CLIR-funded Data Information Literacy project through which data and subject librarians are working with campus scientists to address educational needs. Librarians across the country are partnering with researchers to assist with data management plan reviews, many with the blessing of campus contracts and grants officers. My project, ACADIS, is creating field templates in coordination with the International Permafrost Association to help permafrost researchers capture contextual metadata.
CI: How can we engender more collaboration and what kind of nationwide conversation would facilitate that?
There are opportunities for leadership in various quarters to engage with data issues at a structural level. We need clear and open communication among technologists, researchers, librarians, and data curators. In many cases, this is an organizational and political task. But at a functional level, data can and should be part of library conversations across the board, from collection development, metadata and cataloging, and information literacy through repository development, and archiving. Nobody needs to change one hundred percent, but if everyone shifts a little bit and comes to the table with an open mind, we’ll get there.
Data curation is not just a library issue. We have so much expertise spread out across libraries, data centers, archives, and individual labs, but we don’t necessarily have a forum to share experiences and knowledge. I think we are at a time, or approaching the time, where the notion of data curation organizations as a federation is gaining traction. People are realizing that there isn’t a single data solution, and there are pockets of expertise focusing on various aspects of the data lifecycle.
Researchers, data curators, and technologists, inside and outside libraries, all need to be brought to a national table. Within each of those categories, we could talk about cross-scale, cross-lifecycle, and cross-community representation, although I am not sure that we as a data community are at a point where that level of detail would be worthwhile. However, the structure, placement, and leadership of that community would make for some very interesting discussions. My hope for such a group would be the beginnings of a federated web of data curation repositories, services, and infrastructures. What can data centers provide that libraries can’t, and vice versa? Where do local-scale data management efforts fit? What do we actually need from a metadata perspective, and who is best positioned to assist in each part of the metadata creation process? Where do top-down standards structures meet bottom-up needs? I think the data discussion has matured to the point where coordination across efforts is not only viable but also really important.
CI: What have we learned about data curation in the last two years and what lessons are there for libraries?
The most significant change is that we’re no longer looking at data as a purely technical problem with technical solutions. Data curation is being recognized for its multi-faceted role. This is fantastic, but we have to keep going in that direction. At the end of the day, we’re all data managers and data curators. We shouldn’t see data curation as a completely unique, external set of skills because librarians and scientists share these skills too. It’s more a matter of recognizing the commonalities, being able to articulate them, and starting the discussion there. This isn’t a new role for librarians; everyone has something to contribute and should go into data curation discussions with confidence.
Not Just Another Aggregator: The Challenge Facing the Digital Public Library of America
By Rachel Frick
How can the Digital Public Library of America (DPLA) be different? How can it leverage the research and development done by the digital library community and the broader computer science and network communities to serve the greater good? These questions are guiding many of the conversations about the Content and Scope workstream and the technical development work for the DPLA.
The DPLA is slated to launch its first build in 2013. Much of this year’s focus has been on the DPLAtform: a set of services to gather metadata about content and collections made accessible through the DPLA, and to enable developers to use the metadata to build new applications and integrate the metadata into existing sites and services.
The first build has the following goals:
- provide a technology platform that supports the DPLA front-end;
- build a front-end that demonstrates the potential of the DPLA platform; and
- provide an application programming interface (API) for open access to metadata in the repository
The RFP for front-end development was released on August 6. According to the release, this first iteration of the DPLA is to be “a gesture toward the possibilities for a future, fully built out DPLA.”
Managing expectations is a key challenge of working on a national-scale digital library. It will take time and many iterations. It will involve talented individuals and require the support of a wide variety of stakeholders. Balancing the long-term vision—and seemingly limitless possibilities—with what can be done today, while managing a diverse set of expectations, will require herculean effort. But the potential for realizing the DPLA makes the effort worthwhile.
Content and Scope Workstream
The Content and Scope workstream is starting with what is readily available: collections that are already digitized, in the public domain, and free of copyright restrictions. The first step is to collect the metadata representing digital objects in these collections. There are many examples of successful cultural heritage aggregations, including the Mountain West Digital Library, Calisphere, and Minnesota Digital Library, Kenutckiana, Georgia Digital Library, the IMLS DCC, Europeana, and the now defunct DLF Aquifer project.
One criticism of offering a national aggregation of metadata is that it is not enough—that there isn’t enough “new” and that it is, on the surface, a very basic service we have seen before. I challenge the notion, as we have never attempted aggregation at a national scale. If we view the idea of providing a national aggregation as we know it today and proceed as we have in the past, without rethinking how it is done, we will waste effort and lose an opportunity. We need to proceed thoughtfully and intentionally, identifying areas to improve and how to transform the way the service is built and presented, with an eye to how content will be expanded and the platform hacked and forked. There are numerous “big wins” and areas for potential transformational change. Building a cultural heritage data-store at a national level affords a leverage and range of possibilities not present at a local or even regional level.
The concept of “big wins” kicked off the second DPLA Content and Scope workshop held August 6 in San Deigo. We started the day by resolving that the initial collection of the DPLA would target already-digitized cultural heritage content. The idea was that DPLA could serve as a “datastore” of metadata representing digitized primary resources that were not part of existing large-scale book digitization projects. If this is our primary collection development objective, how do we as a community advance digital library development, and solve some of the bigger challenges that we could not resolve even when we tried to aggregate content on a local scale? What are the achievable big wins, when we look at aggregating through a distributed network, on a national scale?
Content and Scope: Looking Forward
Content and Scope workshop meeting participants identified the following as offering potential transformative change:
A means to enable the discovery and creation of emergent collections, and dynamic collection building by individual end-users as well as traditional memory institutions. Emergent collections are created virtually by combining information from a wide variety of collecting institutions, and ultimately, individuals, creating new collections that are not physically held by one institution.
Creating tools and applications to facilitate local and individual collection building, as envisioned by the DPLA use case of Joanie Utter. The DPLA could create collections in response to current events, historical event anniversaries, or topical events. A good example is the joint immigration/emigration exhibit with Europeana. Emergent collections could also engage local communities in conversation about themselves and their local history and how they and their communities connect to others throughout America.
An agnostic framework that can handle any type of metadata. There is a tendency to work with what we know (e.g., OAI-PMH/DC). The DPLA presents an opportunity to go beyond our comfort zone and to forge something new. Current trends in this area are the NISO-sponsored Resource Synch and entrepreneurial efforts of UVA and Princeton that are “atomizing” metadata records into their basic relationship counterparts, “remixing” the data, and representing it on the fly depending on the service that “calls” the data. Is it possible to break the metadata record, atomize the record’s elements, and store only the relationships represented in the record on a national scale? This can be one of many approaches.
The ability to provide geographical, thematic, and time (dates) points of reference/navigation through enriched metadata. Location, date, subject/themes, and/or event information can be used to help users navigate through a large pool of data, to provide context to individual items, and to build other collection based services. See the UIUC/ DLF Beta Sprint entry for examples of this type of navigation. This requires metadata enrichment at the point of ingestion, and a front-end service that can interpret the data into a navigational interface/service.
The potential to transform DPLA aggregated data to linked data, serving as a linked open data datastore that in turn can enrich other digital resources and collections. Europeana provides this service, enriching many resources as a result, like Wikipedia.
CCO for metadata: It is the only way we can provide remix, reuse, and/or semantic services, like a data store of linked open data. It is our special collections—our unique materials—that will provide the greatest impact in a national digital library
Tiered discovery. Not everything has to go in the big DPLA bucket and be indexed locally, but it should be presented in some unifying way through the main DPLA search function, as well as through data harvesting/remixing services. Again, the UIUC/DLF beta sprint offers an example of this type of discovery approach.
The highest tier could be locally indexed within DPLA, with fully parsed/contextualized data. Initially this is seen as harvested or data pushed via ATOM to a central cultural heritage DPLA datastore. Several DPLA data stores might exist, such as a scholarly communications data store (data from university/college institutional repositories etc.). A second tier might involve using a partner’s API. The example used was HathiTrust. A third tier might be matching to an existing collection where full-item metadata isn’t fully available, and only collection level can be found, exposing the longest tail.
Producing a national-scale metadata data-store representing our country’s cultural heritage collections provides a way to push the limits, test new waters, and be something greater, than “just” another aggregation. For updates and more detailed technical information, follow the DPLA development wiki at http://dp.la/wiki/Dev_portal.
Hidden Collections Registry is Live!
CLIR has launched a registry of information that Cataloging Hidden Special Collections and Archives staff have accumulated about unprocessed and recently processed library, archival, and museum collections. The registry’s 376 records reflect great diversity in collection holdings. Not only are hidden books, images, manuscripts, and artworks nominated for cataloging, but increasing numbers of audio and audiovisual formats, maps, architectural drawings, artifacts, and items of ephemera are also brought to the attention of reviewers each year.
The registry enables searching by subject, keyword, format, institution type, and year added to registry. We hope that increased awareness of these materials will help cultural heritage institutions attract partners, volunteers, or funding to support better preservation of their collections. In addition, we hope scholars and their students will use the registry to find previously unused and underused materials and promising new avenues for research. As we continue to add to our records, we will be working to optimize the registry’s usability and functionality, as well as to provide ways for contributors to correct or add to the information there.
The Cataloging Hidden Special Collections and Archives program was launched in 2008 with funding from The Andrew W. Mellon Foundation. Since then, the Foundation has invested nearly $16 million in revealing previously hidden collections of high scholarly value.
Recent Releases from CLIR
The Problem of Data, by Lori Jahnke and Andrew Asher; Spencer D. C. Keralis, with an introduction by Charles Henry (August 2012)
The Problem of Data examines data management and curation practices among university researchers and the current state of data curation education. It finds that few researchers and scholars are prepared to deal with the growing challenge.
“The massive scale of data creation and accumulation, together with increasing dependence on data in research and scholarship, are profoundly changing the nature of knowledge, discovery, organization, and reuse,” writes CLIR President Chuck Henry in his introduction. Yet we are responding with considerable difficulty to “what may be the most complex and urgent contemporary challenge for research and scholarship.”
In part one of the report, Lori Jahnke and Andrew Asher examine data curation practices among scholars at five institutions of higher education. Jahnke, anthropology librarian at Emory University and former CLIR postdoctoral fellow, and Asher, digital initiatives coordinator and scholarly communications officer at Bucknell University, conducted ethnographic interviews of graduate students, faculty, and researchers in a range of social science disciplines. Among their key findings:
- None had received formal training in data management practices, nor did they express satisfaction with their level of expertise
- Few researchers think about long-term preservation of their data
- The demands of publication output overwhelm long-term considerations of data curation
- There is a great need for more effective collaboration tools, as well as online spaces that support the volume of data generated and provide appropriate privacy and access controls
- Few researchers are aware of the data services that the library might be able to provide.
In part two of the report, Spencer D. C. Keralis, director of the Digital Scholarship Co-Operative at the University of North Texas and former CLIR postdoctoral fellow, provides a snapshot of the current state of data curation education. He finds that while LIS and iSchool programs are making efforts to develop data curation curricula “much work still needs to be done to prepare LIS graduates for roles as data professionals in and out of libraries.” He adds that “the LIS world largely remains a closed circuit, providing concentrations within tracks restricted to LIS enrollees.” Keralis notes that the trend in emerging curriculum development programs is to open up this closed circuit and allow post-baccalaureate students and professionals to take courses in data curation.
The report is available at https://www.clir.org/pubs/reports/pub154.
Core Infrastructure Considerations for Large Digital Libraries, by Geneva Henry (July 2012)
The study examines basic functional aspects of large digital libraries and draws on examples of existing digital libraries to illustrate their varying approaches to storage and content delivery, metadata approaches and harvesting, search and discovery, services and applications, and system sustainability.
“The decision to establish a large digital library leads necessarily to a complex set of considerations,” writes report author Geneva Henry. “Decisions in one area will affect decisions in other areas.” Henry, executive director of digital scholarship services at Rice University’s Fondren Library, wrote the report as part of a grant to CLIR from The Andrew W. Mellon Foundation to develop a prototype for the Digital Public Library of America (DPLA).
The author stresses that scalability is of fundamental importance to enable long-term growth of the system, and she recommends a modular system, following SOA principles, to enable flexibility, code reusability, and stronger system sustainability. She also underscores the importance of understanding the target audience and its needs when interacting with the digital library. As implementation begins, she recommends establishing a sandbox environment to experiment with differing technologies and architectures. Finally, it is important to decide on a realistic sustainability plan and publish the policies and guidelines that will help enforce the plan.
Core Infrastructure Considerations for Large Digital Libraries is available at https://www.clir.org/pubs/reports/pub153.
Register Now for the DuraSpace/ARL/DLF E-Science Institute
A few spaces are still available for the E-Science Institute, which will run from September 6, 2012 through December 13, 2012.
The E-Science Institute is designed to help academic and research libraries develop a strategic agenda for e-research support, with a particular focus on the sciences. The Institute consists of a series of interactive modules that take small teams of individuals from academic institutions through a dynamic learning process to strengthen and advance their strategy for supporting computational scientific research. The coursework begins with a series of exercises for teams to complete at their institutions, and culminates with an in-person workshop. Local institution assignments help staff establish a high level understanding of research support background needs and issues.
For more information on the Institute, or to register, visit http://duraspace.org/esi-logistics.
AIC Collections Care Survey Closes Aug. 30
The American Institute for Conservation (AIC) invites collections managers, preparators, and other collection specialists in preservation to participate in a survey to help identify the opportunities and challenges facing collections care professionals today. The survey, available at http://www.surveymonkey.com/s/collectionscaresurvey closes Aug. 30.
Call for Editors—Databib
Databib is a tool for helping people identify and locate online repositories of research data. More than 200 data repositories have been cataloged in Databib, with more being added every week. Users and bibliographers create and curate records that describe data repositories that users can browse and search.
Nominations for an editorial board are being solicited to ensure the coverage and accuracy of Databib. Editors ideally will have expertise in a specific research domain or knowledge of research data repositories in a particular geographic region as well as experience with descriptive metadata. The primary role of an editor is to review, edit, and approve submissions to Databib and contribute to the enhancement of the metadata and functionality of Databib for a voluntary, three-year term. The editorial board will meet (virtually) at least twice a year and will correspond as needed by email.
Please send nominations or questions to email@example.com, or visit http://databib.org/about.php for more information.