By Mark Thompson-Kolar
Domain Repository Activities
Beyond functioning simply as storehouses for digital data, repositories also advance scientific research by:
- Advocating for research transparency, data access, and data sharing.
- Collaborating with scientific disciplines to achieve interoperability across research communities.
- Facilitating data discovery and reuse through application of discipline-specific metadata.
- Innovating to develop systems that facilitate future archiving.
- Managing data to maintain its usability now and in the future.
- Providing access while ensuring critical protections related to confidentiality and intellectual property.
Source: Conference on “Sustaining Domain Repositories for Digital Data,” June 24-25, 2013 (funded by the Alfred P. Sloan Foundation).
Recent mandates to provide open access to research data funded with federal dollars have resulted in a new emphasis on the importance of data sharing. Domain repositories — data archives with close ties to specific scientific communities — play a vital role in facilitating data sharing and in ensuring that information is properly preserved for future research. However, in most disciplines no dedicated funding is available to support domain repositories in this critical data stewardship mission.
Agencies like the National Science Foundation provide short-term grants to a small number of repositories, but this type of funding mechanism is a poor fit for the repositories' long-term role. As a result, repositories face an uncertain future in the United States.
“There is a mismatch between our mission and the way we are funded,” said George Alter, director of the Inter-university Consortium for Political and Social Research, the largest domain repository for social and behavioral sciences research data. “One of our missions is to ensure data will be available for a long time, yet we're being funded by short-term grants.” ICPSR is a unit within the University of Michigan's Institute for Social Research.
Alter's concern is shared by about two dozen directors of domain repositories across the natural and social sciences who gathered at a 2013 conference supported by the Alfred P. Sloan Foundation to discuss their funding challenges and devise solutions.
“I was dubious how much we would have in common with people who manage repositories in diverse areas such as sociology and religion and biochemistry,” said attendee Robert Hanisch, senior scientist at the Space Telescope Science Institute and director of the U.S. Virtual Astronomical Observatory. “But the more we talked to each other, the more apparent it became that we were all dealing with exactly the same problems.”
Sometimes inaccurately understood as “just” storage places for scientific data, domain repositories advance research by enhancing the value of data through careful curation. A key role of domain-specific data curators is developing precise metadata — context-rich, domain-relevant descriptions of data — and applying them consistently to research data files to ensure researchers, librarians, and other archivists can easily find and understand the files.
Curation also ensures data will be usable by future researchers by migrating files to newer formats. This very detailed curatorial work is costly, as it requires highly trained staff using specialized software and computers. (See sidebar story for additional activities of domain repositories.)
“We curate and distribute data; we have to know a lot about the science, and we have to know a lot about technology,” said Helen Berman, director of the RCSB Protein Data Bank (PDB) at Rutgers University, a domain repository for biological science data. “We have to know a lot to serve a scientific community.”
Researchers access between 350 million and 400 million downloads of PDB's molecular data annually, she said, underscoring the importance of the repository's curated files to the scientific domain.
Sharing research data stimulates additional science, making the domain repositories important partners in the scientific enterprise. A 2010 study, “The Enduring Value of Social Science Research,” by ICPSR researchers Amy Pienta, Jared Lyle, and Alter analyzed publication metrics tied to research data collected with NSF and National Institutes of Health funding. It found that data sharing increases scientific productivity, as about twice as many scientific publications resulted when data were shared, and that data archiving yields the greatest returns on investment with research productivity (as measured by number of publications) being greater when data are archived.
Similarly, a 2011 article in Nature, “Data archiving is a good investment,” by Heather Piwowar, Todd Vision, and Michael Whitlock, indicates that “ongoing financial investment in data-archiving infrastructure yields an impressive scientific return.”
In a specific example, about 65 percent of the peer-reviewed publications based on Hubble Space Telescope data are by scientists who are doing their research using curated data from the Mikulski Archive for Space Telescopes (MAST), said Hanisch, who is a senior scientist at the Space Telescope Science Institute. “This is reuse, repurposing of data at a tremendous scale,” he said.
Clearly, careful investment in domain repositories provides an efficient and cost-effective way to utilize limited public funds to maximize scientific productivity. However, the funding process fails to reflect this benefit.
Repositories utilize a variety of funding models. Most common are short-term federal or private grants, and fee-based memberships. Others include deposit fees from researchers or sponsors, and institutional support from universities. None of these models directly provides for long-term storage of and access to research data.
Ruth Duerr is project lead for data stewardship at the National Snow and Ice Data Center in Boulder, Colo. A major problem repositories such as hers face is that securing data-management funding is “hard and getting harder,” she said. “For example, with NSF funding, we're competing against research projects. You always have to be showing that you're doing something new and interesting, but ‘new' and ‘interesting' and ‘data management' are not necessarily compatible concepts.”
Carol Ember, president of the Human Relations Area Files at Yale University, a cultural anthropology data repository, said she is concerned about risks to data over time. “Particularly worrisome to me is funding to continue to have this data be part of the research record for the indefinite future. You never get a grant to do your research indefinitely; you get it for three or five years. But where's the funding going to come from for the future, because people don't realize it costs money to handle the data — to migrate it to new formats and have it be there indefinitely.
“People think about keeping something in Dropbox,” she said. “It's there, but it's not. It's floating out in cyberspace.”
Duerr expressed related concerns. “Digital data are not that stable. Formats go in and out of fashion. Media go in and out of fashion. So unless you have very active programs to keep moving data forward, data will become unusable.”
Additionally, the audiences who need archived data files can change over time, she said. For example, archived sea ice research that originated within the cryospheric passive microwave remote-sensing community is now requested by polar bear biologists — or even journalists — who don't understand the concepts behind how the data were gathered. To make the data meaningful to the new audiences, her repository needed to create support materials that were more explanatory.
Complicating the situation for domain repositories, funders, and researchers is a new emphasis on open access to federally funded research data, related to the 2013 U.S. Office of Science and Technology Policy memorandum calling for large federal agencies to create plans for public access to research projects.
“The movement toward open access is a good thing in that it creates more equal access for the user community, but it also creates more of a burden for domain repositories because their funding avenues are narrowed,” Alter said.
For repositories that are membership organizations, dues are a way of covering the costs of curation and preservation. But the membership model, said Ember, “is almost incompatible with open access because the members are the ones supporting it, and therefore, they want access for themselves.
“Open access is an enormous, great thing for scholars or institutions that are not funded well,” she added, but then the burden shifts, most often to authors or donors, who may have to pay the costs of publishing and archiving themselves. “The message is, ‘Somebody's got to pay,' she said.
One solution endorsed by repository managers would be for government agencies to fund data repositories directly as research infrastructure. Under this model, a percentage of federal research funding would be set aside for data archiving and preservation in all disciplines.
“The infrastructure model basically says, ‘The U.S. funds research,' ” Ember said. “If the government provided the money to support infrastructure for preservation, it wouldn't disadvantage people who didn't have enough, or who are at a poorly funded institution, or did their research 20 years ago and no longer have support.”
Hanisch, at the Space Telescope Science Institute, commended NASA's approach. “NASA has provided a certain level of stable funding for many years for the data centers in astronomy and astrophysics. What we suggest is that if the NSF could put aside a few percent of the overall budget for grants to assure there is an infrastructure in place that researchers can contribute to, that would be the ideal thing. It need not require NSF to build their own infrastructure; they could partner with organizations that are already doing this.”
William Michener, principal investigator at DataONE, said, “The argument would be that data repositories are key research infrastructure that we need to support if we, in fact, want to support good research and store data for the long term.” DataONE is a federation that provides services for several environmental science data archives.
“We're sticking our heads in the sand if we don't recognize that this is a major challenge—one that we have to solve,” he added. “I would love to see a National Academy or National Research Council-level study of the problem. Bring in all the stakeholders: funders, scientists, institutional representatives, and others. Really dig deep and address this challenge.”
Mark Thompson-Kolar is Senior Editor of the Inter-university Consortium for Political and Social Research. He can be reached at 734-615-7904 or firstname.lastname@example.org.