Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Coders and librarians team up to save scientific data

Sharon Gaudin | March 21, 2017
Volunteers rush to archive data before it disappears from government websites

margaret janz datarefuge 
Margaret Janz, a data curation librarian at the University of Pennsylvania, is on the planning committee for the DataRefuge effort. Credit: Naomi Waltham-Smith

Janz is on the planning committee for DataRefuge, one of the organizations working to archive scientific data that has been sitting on government websites.

DataRefuge, which is a joint project between the Penn Libraries and the Penn Program for Environmental Humanities, was put together in November after the presidential election.

The group, working with the Environmental Data and Governance Initiative, helps organize data rescue events.

DateRefuge has held about 30 data archiving events, each one bringing in about 100 attendees, according to Janz. The New Hampshire event, which was held March 10, was one of the smaller turnouts. The organizers are also working on ways to keep their community engaged for the long haul.

"Deleting data is like burning books," said Matt Jones, a software developer at Massachusetts-based Yieldbot, was archiving data at the New Hampshire event. "I'm passionate about data and information.... I don't believe in throwing anything out. All data is relevant to somebody."

Volunteers with DataRefuge don't hack into sites nor do they steal the data. They are working to make copies of data that's in the public domain.

The volunteers receive training and then work during the events, sometimes continuing the effort at home.

Part of the work being done is called seeding, where participants nominate URLs to be stored in the Internet Archive, a San Francisco-based nonprofit, public digital library. If the archive's web crawler can extract the necessary data from a nominated page, it will.

datarescue matt jones1 100713820 orig 
Matt Jones, a software developer, archives data at a DataRefuge event in Dover, N.H. Credit: Sharon Gaudin

If the page is too complicated – say it has 100 different files or is highly interactive -- for the web crawler to work, then the seeders will note that and volunteers will get to work "harvesting" the information.

Using scripts and tools built with either the programming language Python or R, the harvesters will go through those pages manually, collecting data sets, such as weather maps or GIS files, that they need to save.

At the New Hampshire event, volunteers were divided into two groups – one using Python and one using R. Then they got to work harvesting from complicated pages.

Event organizers couldn't say how much data was harvested at that event, but at an earlier DataRescue event that was held at the University of New Hampshire in February, about 40 people volunteering one night were able to seed about 1,100 pages that could be harvested by the web crawler.


Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.