Understanding (e.g.) DOIs for data sets

| 3 Comments | 0 TrackBacks
Data citation is a topic that frequently comes up in conversations around data management. During a call with a community of data curators yesterday, I was asked whether ScholarSphere supported DOIs for citing data sets.

I have to admit that while I understand the value of data citation — tracking use & re-use, measuring impact of data sets independent of their publications, giving credit to data publishers, &c. — I continually get stuck on how identifiers such as DOIs from DataCite or ARKs from EZID fit into the picture. Or, rather, why such indirect identifiers are valued more than the native HTTP URIs that are minted and managed by data repositories. Here I assume that these data repositories are run by institutions whose missions & business interests include a commitment to persistence of content and identifiers held within their repositories. (Is that a faulty or na├»ve assumption?)

The argument for indirect identifiers — identifiers that point at and resolve to other identifiers — like DOIs usually goes like this: hey there, cultural heritage organizations and publishers have done a pretty poor job of persisting their identifiers so far, partly because they didn’t grok the commitment they were undertaking, or because they weren’t deliberate about crafting sustainable URIs from the outset, or because they selected software with brittle URIs, or because they fell flat on some area of sustainability planning (financial, technical, or otherwise), and so because you can’t trust these organizations or their software with your identifiers, you should use this other infrastructure for minting and managing quote persistent unquote identifiers.

SIDEBAR: That’s a lot of becauses, all of which (to be perfectly frank) are painfully true. As an employee of a service provider within a very large academic library, I find this unacceptable. The solution from my perspective is not to punt responsibility for persistent identifiers. The solution is to confront each of those becauses and learn from our mistakes, and (as information service providers who oughta know better) to better steward and manage identifiers for data sets (and other deposits). I digress.

Are there other compelling arguments for using indirect identifiers to cite data sets? This is where you come in.

Back to the main point. Here is the million-dollar question about using (e.g.) DOIs for data sets: who manages these DOIs? Is it the service provider (such as DataCite, or Penn State ScholarSphere)? Or is it the owner of the data set?

If it’s the service provider, how are they to know when data owners move their content elsewhere? And how does that scale?

If it’s the data owner, uh, really? Do we realistically expect data owners to manage their own DOIs? I may be being cynical here, but I somehow don’t see that happening on any scale that has an appreciable impact on the broader issue of data citability and identifier persistence.

No TrackBacks

TrackBack URL: https://blogs.psu.edu/mt4/mt-tb.cgi/388504


Wow. I thought only I got identifier angst...

This deserves a much more considered and lengthy reply, but I am traveling at the moment and drafting anything coherent is a challenge. I'll note a few things though:

I agree with a lot of what was said in the twitter stream on this subject. There is a lot of magical thinking surrounding DOIs, some of it technical, some of it social. A lot more of it is coming to the fore as we see new new DOI registration agencies, new models for assigning DOIs, and as we see DOIs assigned to different content types. There is certainly some advantage to harnessing some of that hokum because it latches onto existing researcher citation behavior, but we also have to be very careful of the potential side effects. This is something that we at CrossRef are very aware of and are trying to address.

The DOI is not the answer to everything. I work for CrossRef and so naturally, I think we're awesome and that the CrossRef DOI infrastructure for persist-able citation is da bomb, but I also drafted the specification for the ORCID identifier (http://goo.gl/cL4WY) which does not use the DOI. This was an easy decision for ORCID to make technically, but it took a lot of convincing politically- largely due to some of the cargo-cult ideas that exist around DOIs.

Anyway- your post is just prompting me to write-up something I've been meaning to write for ages. As soon as I am repatriated I'll draft something more coherent.

My first inclination was to say that it is faulty or naive to assume the repository operator is committed and competent to ensure persistence.

But then I realized, if you don't have this, a DOI won't do you much good either -- who else is going to update the DOI to correctly resolve to the actual content -- which of course requires the content to continue to be accessible _somewhere_.

No matter what, the content host has to have a commitment to persistence (and resources and competence to provide it), or the whole game is lost.

So one thing a DOI might be is _part_ of a content hosts technical strategy for persistence -- an extra layer of indirection on top, that will let you switch your software infrastructure at a later date, even if the new software does not use identical conventions for URI's. An extra layer of abstraction/indirection is a pretty useful for thing for ensuring persistence, is it not? And why should you re-invent and re-implement such a layer yourself, if DOI (or similar) serves perfectly well and is already there for you, and you find the costs reasonable?

Similarly, DOI (or similar) as this indirection layer is useful in that it is _organizationally neutral_. Most http URI's are, by the nature of DNS, tied to a particular organization. jhu.edu or psu.edu or what have you. But what if, at some future point, some data formally hosted by penn state, penn state wants to transfer hosting to some other organization (a disciplinary-specific repo host? A cooperative consortial repo? The government? The internet archive? penn state ceases to exist and merges with another university? Who knows, any of these could happen in the next few decades). The indirection can be useful there in not tying the identifier (or the URL for resolving the identifier to data) to a particular business entity.

So I think DOI (or similar) means of identification and resolution abstracted from a particular hosting infrastructure or a particular business organization -- can be a useful part of a persistence-maintaining strategy.

But, where I think you are right about something you imply -- DOI's are NEITHER neccesary NOR sufficient for a good reliable persistence-maintaining strategy. It is wrong to think that just because you have a DOI you have a good persistence strategy (most repo hosts are not doing a very good job of persistence planning at the moment in my opinion), and it is wrong to think that just because someone _doesn't_ use DOI's they _don't_ have a good persistence strategy (most consumers and other actors aren't very good at _evaluating_ the persistence strategy or competence of those hosting data they care about either). DOI is potentially a useful tool, but it's a mistake to think it's a surrogate for careful planning or critical analysis of someone elses planning.

My understanding of the value proposition for DOIs is that by going with DOIs, we better align with existing publisher practices. It isn't so much that DOIs offer a better 'solution' for data identification/location/citation/disambiguation than others (see Duerr et al 2011 for a comparison: http://www.springerlink.com/content/52760gq3h200gw38/) but there is existing buy-in from at least one stakeholder.

I agree that the value of indirect identifiers is in the abstraction out of the organizational context. People and data are mobile, and organizations are not always stable. Rather than being a statement about the failings of service providers, I think the push for identifiers is more about recognizing the need to cross boundaries, to better align the technology with reality.

Leave a comment

Search This Blog

Full Text  Tag

Recent Entries

Understanding (e.g.) DOIs for data sets
Data citation is a topic that frequently comes up in conversations around data management. During a call with a…
Ingest: Lessons learned
Now that we have a by-no-means-complete-but-still-useful list of common barriers to ingest, I thought I'd share the lessons learned. We…
Ingest is a barrier to ingest
Last week I attended the latest iteration of one of my favorite conferences, Code4Lib 2011, which included a full-day CURATEcamp…