Data Data Everywhere, and not a drop to drink
This is a blog entry I've been meaning to write for a while.

I am the guilty party who manages this messy desk. I've been working on this pile for the better part of a year. If you stopped by and asked me for something I can easily root through one of the piles and locate it for you. While the model works, it is not scalable, nor is it sustainable over time. At some point the fire inspector is going to show up.
How many of us approach our long term data storage in a similar manner? In In an earlier blog I touched on the subject of long term data management. While NSF currently does not have specific policy regarding data retention, there exists a Council on Governmental Relations policy on access to and retention of research data. NSF will be requiring a data plan and that part of the research grant be set aside for data retention. While researching information for this blog I came across a good example of such a plan, the Community Climate System Model (CCSM) Data Management Plan. I expect this will be typical for large grants.
We can buy oodles and oodles of disk and tape but without a clear and concise strategy, data management and access will look like my desk, frustrating not only ourselves but all those who have a rightful access to the data.
So while I have been procrastinating on writing about this subject Joe at scalability.org wrote an excellent piece on long term data storage. In his blog Joe picks up on something that popped up a few days ago by Robin Harris at StorageMojo on Microsoft eliminating "easy" access to old file formats.
Joe makes some excellent points on the long term viability of stored data. What good is storing it if we don't have the means to access it in the future? As Robin points out, we can't trust closed source solutions. If you have only a couple of dozen files maybe you can remember to resave them in the new format every few years, but this is an administrative nightmare we do not need.
Not to pick on Microsoft, other vendors and our own practices are just as guilty. From a personal perspective, nearly 20 years ago (gasp, has it been that long??) I wrote my Masters Thesis using a then new-fangled text processor called Sprint from Borland. Using Sprint made sense at the time. However today I can no longer access any of the text data. Don't even ask me to open the original data files or even try to open the plot files. I still have my old 8088 machine around that I used to create my text and data but I doubt it will even start up, let alone read in these files. (In case you're wondering I use that old machine to hold up some stacks of paper in my basement. Don't ask why...)
With the coming data explosion we're going to need a different approach than business as usual when it comes to data management. The storage part is going to be easy. All that takes is money. A fundamental question for the storage part of the equation will be is it cheaper to store the data or to rerun the experiment. Disks are cheap and getting cheaper and like cpu power the cost eventually will approach 'zero'.
It is in the long term management where the challenge lies. What file formats do we use? What is the storage and search criteria we use so we can reference our data both now and in the future? My prediction for the retrieval and access part is we'll turn to some type of keyword-based search engine form of management with access rights built on top. We're going to need some help from our friends in Digital Library Technologies for help.
As for file formats, lets be careful out there. Two that pop to mind are Common Data Format and Hierarchal Data Format. These are not the end-all and be-all of storage formats but I firmly believe our thinking should be along these lines.
After we spend a good hunk of our careers collecting all terabytes of fMRI data we should have a way to access it 20 years hence.
Microsoft's bricking some old data formats that are not even old enough to vote should be a stark reminder that we cannot continue business as usual.
The Community Climate System Model (CCSM) Data Management Plansure looks like a good model to follow.
Comments, as always, are welcome!

