Stop Hosting Data and Code on your Lab Website

January 8, 2013
By

(This article was originally published at Getting Genetics Done, and syndicated at StatsBlogs.)

It's happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there's no trace of what you were looking for.

THE PROBLEM

This isn't an uncommon problem. See the following two articles:
Schultheiss, Sebastian J., et al. "Persistence and availability of web services in computational biology." PLoS one 6.9 (2011): e24914. 
Wren, Jonathan D. "404 not found: the stability and persistence of URLs published in MEDLINE." Bioinformatics 20.5 (2004): 668-672.
The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:
  • Only 72% were still available at the published address.
  • The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
  • The authors could only confirm positive functionality for 45%.
  • Only 274 of the 872 corresponding authors answered an email.
  • Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.
The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).

OpenHelix recently started this thread on Biostar as an obituary section for bioinformatics tools and resources that have vanished.

It's a fact that most of us academics move around a fair amount. Often we may not deem a tool we developed or data we collected and released to be worth transporting and maintaining. After some grace period, the resource disappears without a trace. 

SOFTWARE

I won't spend much time here because most readers here are probably aware of source code repositories for hosting software projects. Unless you're not releasing the source code to your software (aside: starting an open-source project is a way to stake a claim in a field, not a real risk for getting yourself scooped), I can think of no benefit for hosting your code on your lab website when there are plenty of better alternatives available, such as Sourceforge, GitHub, Google Code, and others. In addition to free project hosting, tools like these provide version control, wikis, bug trackers, mailing lists and other services to enable transparent and open development with the end result of a better product and higher visibility. For more tips on open scientific software development, see this short editorial in PLoS Comp Bio:

Prlić A, Procter JB (2012) Ten Simple Rules for the Open Development of Scientific Software. PLoS Comput Biol 8(12): e1002802. 

Casey Bergman recently analyzed where bioinformaticians are hosting their code, where he finds that the growth rate of Github is outpacing both Google Code and Sourceforge. Indeed, Github hosts more repositories than there are articles in Wikipedia, and has an excellent tutorial and interactive learning modules to help you learn how to use it. However, Bergman also points out how easy it is to delete a repository from Github and Google Code, where repositories are published by individuals who hold the keys to preservation (as opposed to Sourceforge, where it is extremely difficult to remove a project once it's been released).

DATA, FIGURES, SLIDES, WEB SERVICES, OR ANYTHING ELSE

For everything else there's Figshare. Figshare lets you host and publicly share unlimited data (or store data privately up to 1GB). The name suggests a site for sharing figures, but Figshare allows you to permanently store and share any research object. That can be figures, slides, negative results, videos, datasets, or anything else. If you're running a database server or web service, you can package up the source code on one of the repositories mentioned above, and upload to Figshare a virtual machine image of the server running it, so that the service will be available to users long after you've lost the time, interest, or money to maintain it.

Research outputs stored at Figshare are archived in the CLOCKSS geographically and geopolitically distributed network of redundant archive nodes, located at 12 major research libraries around the world. This means that content will remain available indefinitely for everyone after a "trigger event," and ensures this work will be maximally accessible and useful over time. Figshare is hosted using Amazon Web Services to ensure the highest level of security and stability for research data. 

Upon uploading your data to Figshare, your data becomes discoverable, searchable, shareable, and instantly citable with its own DOI, allowing you to instantly take credit for the products of your research. 

To show you how easy this is, I recently uploaded a list of "consensus" genes generated by Will Bush where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene (discussed in more detail in this previous post).

Create an account, and hit the big upload link. You'll be given a screen to drag and drop anything you'd like here (there's also a desktop uploader for larger files).



Once I dropped in the data I downloaded from Vanderbilt's website linked from the original blog post, I enter some optional metadata, a description, a link back to the original post:



I then instantly receive a citeable DOI where the data is stored permanently, regardless of Will's future at Vanderbilt:

Ensembl/Entrez hg19/GRCh37 Consensus Genes. Stephen Turner. figshare. Retrieved 21:31, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.103113

There are also links to the side that allow you to export that citation directly to your reference manager of choice.

Finally, as an experiment, I also uploaded this entire blog post to Figshare, which is now citeable and permanently archived at Figshare:

Stop Hosting Data and Code on your Lab Website. Stephen Turner. figshare. Retrieved 22:51, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.105125.



Please comment on the article here: Getting Genetics Done

Tags: , , ,


Subscribe

Email:

  Subscribe