Ingest: Lessons learned

| 0 Comments | 1 TrackBack
Now that we have a by-no-means-complete-but-still-useful list of common barriers to ingest, I thought I'd share the lessons learned.  We hope to apply these lessons in building CAPS, our prototype curation services platform.

  • Create a namespace, or namespaces, for identifiers that far exceeds foreseeable needs -- Our first namespace can accommodate 7,072,810,000 identifiers.  We're using the Archival Resource Key specification for identifiers (each of which will be mapped to HTTP URIs), and the Python-based arkpy library for minting.
  • Decouple the ingest process from the publication process -- We plan to build a small suite of applications and tools upon our curation services platform, the first of which is for what we've been unimaginatively calling "generic ingest & management."  The application is for authenticated, authorized users only -- it's a tool for curatorial operations not for end-user display.  The ingest application will never automatically publish objects and it makes the assumption that all objects are private to the curator until the curator decides otherwise.
  • Plan for scale and test performance from the outset -- The current phase of development on CAPS was given a very ambitious deadline so we have not had the time to focus on performance and scale as much as we would have liked.  We have a list of areas to address in the next phase, however, and a laundry list of technologies to vet and test for our scaling needs.  We've also lined up a small team of folks to help out with system testing & QA.
  • Make metadata input optional -- We believe that curators, not systems, curate, and thus allow them to decide how richly objects ought to be described.  We intend to provide curators with tools that allow (and perhaps encourage) rich metadata to be attached to objects but as far as the "generic" ingest application (and the curation services platform underneath) is concerned, all elements in the data dictionary are repeatable and none are required.  We will be building similar "profiled" ingest applications for specific purposes in the near future, such as an ingest application for electronic business records, which will, however, be more stringent about metadata (and also about file formats, which the generic ingest app couldn't care less about).
  • Allow stakeholders to drive decisions and, above all, communicate with users -- This may be a meta-lesson, and it feels like the most important of them all.  Our development team for CAPS consists of our lead developer, a digital curator, an archivist, a metadata librarian, a project manager, and an architect.  Our project team is made up of our development team plus stakeholders from across the University Libraries including representation from our Digitization & Preservation department, the Arts & Architecture library, and University Archives.  Our project team meets for one hour a week -- a Herculean task to find a mutually convenient slot, let me tell you -- and our development team meets for fifteen minutes every morning.  The point here is that our stakeholders -- the primary eventual users of the ingest application -- are invested in the project, and they get a chance to see, criticize, and drive what we've developed every week as it evolves. Because we're in the same room so often, we get to communicate certain points regularly, e.g., what you ingest is only as permanent as you wish; identifiers are not precious at all; the curation platform is there for you to use in ways useful to you, so the typical "don't put that into the repo yet" mindset doesn't apply; your stuff is as findable as the richness of your metadata, but we can provide other ways to find your stuff (full-text search, etc.); and so forth.


Last week I attended the latest iteration of one of my favorite conferences, Code4Lib 2011, which included a full-day CURATEcamp hackfest as a pre-conference session (sponsored by the Digital Library Federation).  Rather than writing up a full report of the event -- no one really reads those, right? -- I wanted to comment on a conversation from the hackfest.

A group gathered to discuss digital forensics, specifically in the context of forensics work done pre-ingest.  I've heard other folks talk about pre-ingest processes and so I wondered aloud: what does it say about our repositories, and the ingest process, that we do so much pre-ingest?  The consensus was that the ingest process is frequently expensive.  A subgroup split off to explore this.

The ingest process is a topic I'm keenly interested in since we (Penn State's digital stewardship program) are in the middle of building a prototype ingest application ("CAPS").  If we can learn some lessons from our peers about how to make ingest easier and faster, the timing is right to build on these lessons and make novel, more interesting mistakes rather than boring, well-known ones.

Here are the barriers to ingest that were identified:

  • Identifiers are precious -- Ingesting an object usually kicks off a series of processes, one of which mints a new identifier for an object.  There is a perception that identifiers are a limited commodity, that they are somehow precious or rare.  
  • Promise of permanence -- There is a perception that ingesting an object creates a contract for the permanence of that object.  The contract may be illusory depending on the "repository" into which the object was ingested.
  • Findability -- Once an object is ingested, it is difficult to find.  I would have liked to pursue this point a bit further.  What it suggests to me is that in some contexts, the repository has not been sufficiently incorporated into the workflows or work environments of those doing the ingest, so it feels like alien territory rather than the local filesystems and mapped drives they are accustomed to.  Pure speculation on my part.
  • Complex downstream workflows -- Given that ingest is a series of processes, there is concern that "just ingesting something" might cause breakage downstream.  For instance, if an object is ingested, is it automatically published somewhere end-users can get to it, and has the object been fully prepared for publication?  One such workflow might be automatic generation of derivatives, which is an expensive operation for certain formats and large files.
  • Rights -- Related to the above bullet, there is concern that end-user access rights be cleared in advance to ingest, for fear that the object will wind up in the wrong hands.
  • Metadata -- The ingest process requires too much metadata input.  This concern is tied to findability above, and together they suggest an all-too-familiar tension: how much metadata is enough to make an object findable later, and how much is enough to make the ingest process cumbersome?
  • Psychological factors -- There is a mindset wherein curation happens outside of the repository and preservation happens inside -- that these are distinctly different activities which happen serially if at all -- in which case one might be loath to ingest an object until it's "ready" for the repository, whatever that means.
  • Personal time -- The ingester simply lacks the time to push the right buttons.
  • Software performance -- The ingest process is slow due to lack of optimization, lack of attention to scale, lack of performance tuning, and so forth.
There are a number of lessons to be learned from the above.  I'll write soon about those and how we're applying them to our CAPS project at Penn State.

Have any barriers to add to the list?
One minor concern I brought to the conference, which has roots in my attendance at the 2007 conference, was whether it would be too system-oriented to be relevant, since Penn State doesn't plan to use Fedora, DSpace, or ePrints.  I was pleased to see the increased attention to alternative approaches to preservation and to repositories as a set of services rather than (necessarily) as a system.

Penn State's institutional digital stewardship program is investigating curation microservices, such as those developed by the University of California Curation Center, as an architecture for digital curation. So I came to OR2010 with an eye towards development in this space.  I wasn't the only one; both the PASIG session and the DuraSpace strategic overview identified microservices as a trend, and a number of microservices seem likely to be built into the 1.7 release of DSpace.

I attended the curation microservices BOF, which was well-attended taking into account it was up against a developer challenge event -- institutions represented include Universitat Autònoma de Barcelona, Harvard, U. of Hull, California Digital Library, MIT, UNC-Chapel Hill, San Diego Supercomputer Center, Penn State, Northwestern, U. of Pennsylvania, and Princeton.

We discussed our interests in the topic, experiences w/ the microservices approach, development of a community around microservices, the California Digital Library's role in sustaining said community, and governance of collaborative software development and of the community.

The BOF covered a lot of ground in a short period of time, and we agreed to start having periodic open teleconferences to share information about microservices development.  We'll also utilize the digital-curation Google Group for virtual communication, and use events such as Open Repositories, IDCC, and iPRES -- in addition to Curation Technology Camp (CURATEcamp) events -- for microservices get-togethers.
  • Reviewing digital library platforms for the e-Content Stewardship Council -- Patricia and I have completed all user interviews and platform demonstration sessions, and have finished evaluating all four in-scope platforms (CONTENTdm, Olive, DPubS, and ETD-db) along a set of twenty-odd criteria defined in a comparative analysis project at Purdue.  Next up is identifying themes from the evaluation for our report's executive summary.  We had hoped to finish this work in May, but apparently summer is a hard time to get stuff done.  Who knew?
  • Institutional repository of electronic records -- Work has begun on our e-records system via the inclusion of records use cases in another pilot project.  More on that later.
  • Learning more about "big data" and continuing the data management discussion -- I attended Research Data Access and Preservation Summit in April. A number of themes emerged from RDAP: 1) methods for involving researchers in curation activities, 2) the user-friendliness of the data deposit process, and 3) the boundary between preservation and curation, caused by the dynamic nature of research data and barriers to repository ingest such as complicated processes and a write-once assumption.  We at Penn State have not yet gotten our big data focus group, under ITANA, off the ground but hope to do so later this year.
  • Storage strategies -- Following the dissolution of the Data Storage Working Group, Digital Library Technologies continued the discussion of storage strategies to guide purchase, allocation, and management of storage from the short- to the mid-term.  We have just this week written a project charter to explore the idea, culminating in a strategic plan for storage in December.
  • Evaluating next-generation information discovery tools for the libraries with the Libraries' Department of Information Technology -- The RFP process has finished and we have selected a product that meets our many needs.  We will be announcing our decision as soon as the ink dries on the paper.
  • Working on requirements for a draft institutional identifier standard with the NISO I2 working group -- The I2 group distributed a survey about features and requirements of the draft I2 standard, and has begun analyzing the results.  Feedback has been provided primarily from the library sector, and has largely validated our work thus far.
  • Attending Open Repositories 2010 -- See conference report.
  • Planning Curation Technology Camp (CURATEcamp) 2010 -- Since I last wrote about the camp, the conference planning group has been busy dotting "i"s and crossing "t"s.  We're all looking forward to the camp which is coming up soon (mid-August).
  • Curation microservices pilot -- A short-term pilot project involving software developers and curators will explore a number of strategic aims of the Content Stewardship Program: defining curatorial requirements, building and testing a curation architecture, engaging software developers and curators at other institutions, treating data in a cross-platform manner, exploring roles and workflows that cross unit boundaries, and building a testbed for electronic records curation services.  Project work will include curating copies of a small sample of data selected from e-records, CONTENTdm, Olive, DPubS, and ETD-db; building and integrating existing lightweight digital curation tools based upon curation microservice specifications; applying those tools and specifications to curate the sample dataset; examining the benefits, costs, and limitations of the microservice approach; and determining if microservice-based curation architecture is viable at Penn State.
  • MetaArchive implementation roadmap -- Penn State is now a member of the MetaArchive distributed digital preservat cooperative.  I am working with a team of four on an implementation roadmap, detailing a timeline, new roles that will need to be defined for our involvement, and hardware specifications.  This is a short-term project.
I wrote before about a potential curation technology unconference which has been dubbed CURATEcamp 2010.  Not in my wildest dreams could I have imagined just how receptive folks have been to the idea. 

An ad hoc planning team -- consisting of folks from Penn State, the California Digital Library, the University of California-San Diego, and the Library of Congress -- has been hard at work bringing this idea to life.  On June 15th, we announced on that registration for the event opened.  Eight days later, we announced that all seventy-five slots had been filled.  Fret not, though; you can still be added to a waitlist

The camp is now yours, digital curation community -- let's see what you've got.

CURATEcamp 2010 is but one of many events within our community.  I'd like to highlight some others.

Whereas CURATEcamp 2010 focuses on the curation microservices approach, CURATEcamp II is a bit more general.

ABSTRACT: As the community of digital curation practitioners has grown, so has the need for collaboration and community.  A small number of communities have been formed around digital curation, a few of which focus on the technical aspects of the practice.  Extant communities address the implementation and support needs of specific curation platforms, without broader focus on common services and potential points of intersection. There is however a rich ecosystem of tools, practices, and standards around these platforms, and some that require no such platforms, that have potential to benefit the wider community of practitioners. CURATEcamp II is an unconference-style workshop for practitioners of digital curation to share best practices and discuss tools and technologies in a free-form and highly interactive forum.  Topics of interest might include identifiers, versioning, transfer, packaging, object structure, filesystem usage, archiving / storage, metadata standards / vocabularies, discovery, and interoperability. The unconference format ensures that all participants are actively engaged in the workshop and gives everyone an opportunity to contribute.  Activities may include roundtable discussions, presentations, whiteboard sessions, collaborative software development, and whatever else emerges from the collective creativity of participants.

AIMS: CURATEcamp II is an opportunity to build a community of practice around curation tools, that bridges system-specific gaps that have formed in the community. It will encourage discussion about curation tools and practices across software-, project-, and institution-specific boundaries, and attempt to identify best practices and points of collaboration across these boundaries.  The community that CURATEcamp II nurtures is intended to persist beyond the end of the IDCC, so another point of discussion will be around how to maintain connections between face-to-face gatherings. The informal approach of CURATEcamp II might also serve as a way to model knowledge sharing for the curator community, not unlike what occurs at BarCamp events, which are loosely structured but highly productive participatory sessions.

AUDIENCE: CURATEcamp II will be of interest to digital curation practitioners (curators and technologists alike), especially those who have been using and building tools and architectures, and digital curators with experience assessing or evaluating curation tools and services.  
This is an exciting time to be working in the digital curation community!  Wondering how to get involved?  Hop on over to the digital-curation Google group and join the discussion; it's just getting started.

P.S. CURATEcamp 2010 would not be happening without the active engagement of all the folks doing the planning.  Thanks to: Declan from UCSD for proposing the idea for the camp over Belgian beer at the Thirsty Monk during Code4Lib 2010; Dan from LC for focusing on the practical; Ed from LC for evangelism and support; Perry from CDL for all of his work with the conference venue and the registration system; my colleagues at Penn State for their support; and both Penn State and CDL, without whose contributions and commitment the camp would not have been possible.
In my last braindump, I wrote:

The I2 working group is putting finishing touches on a draft standard and on core metadata required to identify institutions.  We hope to share this draft and put out a request for comments in the coming months.  I've been modeling the I2 domain in RDF both for more RDF experience and also with the hope that an eventual I2 core service will be exposed as linked data.

I've now documented this experience on my other blog, which is the home of my I2 ramblings.

My my, has it really been three months since I wrote up my agenda?  I've been busy chipping away at the agenda so I thought I'd document my progress now that Q2 is underway.

  • Reviewing digital library platforms for the e-Content Stewardship Council

    The platform review project that our digital collections curator and I have undertaken continues.  We began the project by having folks demonstrate each platform and how they use it, and have been busy with small, informal interview sessions with many of the same folks but also others who work outside of the Libraries.  We have a few more interview sessions to conduct and document, so the data gathering portion of the project is nearly complete.  In the meantime we've been discussing evaluation criteria.  We started off with a short list of criteria, but then noticed the criteria Purdue are using for their comparative analysis of institutional repository software and adopted those instead.  We sketched out a structure for the final report, which we hope to finish in May.

  • Reviewing functional requirements for an institution-wide repository of electronic records

    This work is still under way.  We have a set of well-documented functional requirements for an e-records repository service but have yet to make progress on building anything.  We've been talking about applying for a grant to help fund some additional staffing which might be used to help build out proof-of-concept curation services (preservation, provenance, description, discovery) for e-records.  I'm really keen on applying curation micro-services, such as those used at CDL, to the e-records domain.  I see this effort as benefiting both the curation micro-services community and the e-records community -- not to mention our own electronic records initiatives here at Penn state.  An all-around win, if you ask me, but then I'm biased.  This will be a major activity in the latter half of this year continuing into the next.

  • Learning more about "big data" and continuing the data management discussion

    Our content stewardship program will doubtless need to address research data.  We're not there yet.  In the meantime, Penn State's ITANA chapter will be pulling together a working group on the technological and architectural challenges of research data.  Jeff Nucciarone and I will be chairing the group.  In the meantime, research data has been on my mind for two reasons: Michael Lesk gave a talk at the information school urging libraries to turn their attention to research data; and I'll be attending the Research Data Access and Preservation Summit in Phoenix later this week.

  • Evaluating the DLT archival storage prototype and joining the technical team of the Data Storage Working Group

    The Data Storage Working Group effort has been repurposed.  The steering team will continue to meet informally and discuss archival storage and curation needs across the campuses.  The technical team has been dissolved, and the majority of us (who already work together in DLT in support of the same mission) will continue to work in this space.

  • Evaluating next-generation information discovery tools for the libraries with the Libraries' Department of Information Technology

    The RFP process continues.  We hope to have wrapped up our evaluation by the beginning of summer.

  • Evaluating change management solutions with a team from Penn State's ITANA group

    I haven't found much time to stay involved with this team, unfortunately, but their work continues apace.

  • Working on requirements for a draft institutional identifier standard with the NISO I2 working group

    The I2 working group is putting finishing touches on a draft standard and on core metadata required to identify institutions.  We hope to share this draft and put out a request for comments in the coming months.  I've been modeling the I2 domain in RDF both for more RDF experience and also with the hope that an eventual I2 core service will be exposed as linked data.

  • Attending Code4Lib 2010

    You can tell how good a code4lib conference is by how little you remember of it.  By that measure, this year's conference was the best yet.  Some of the highlights for me: 1) Linked data, a pattern for exposing resources and metadata via the web, continues to be a hot topic among cutting-edge library developers. There was a focus this year on how to participate in the linked data web in practical and lowish-barrier ways.  The speed with which concepts move, at code4lib, from "novel, and interesting to a few" to "widely talked about and deployed" is dizzying; 2) Software development practices continue to mature in libraries.  We're talking more and more about test-driven design and agile development.  While these methodologies are beneficial to developers themselves, I find this remarkable because it means the gap between coders and stakeholders is being bridged, and that means better and more usable software, and happier users; 3) Repositories are not typically a hot topic at code4lib, but there were a number of prepared talks, lightning talks, and breakout sessions on the topic.  Fedora tends to be the repository most often talked about, if only because it is the repository that requires the most hacking -- and these are the people doing the hacking.  What I found interesting this year was the dissatisfaction with monolithic repository software packages, and the movement towards "homebrewed", though standards-based, repository services, such as those being advocated by the California Digital Library.

  • Meetings, meetings, meetings

    The meetings, they continue. 

  • Continuing to absorb as many of the following as possible: strategic plans, project portfolios, process management documents, and various and sundry reports, wikis, and blogs

    And this continues as well, though it's hard to find time to contextualize when you've got actual tasks and deadlines.

And here are some new and upcoming things.

  • I've written about my search for a practice-oriented curation technology/architecture community, and I'm glad to say I've made some progress on finding said community.  I've been part of a conversation revolving loosely around the digital-curation group and that conversation has now turned to planning a curation technology workshop which we're called CURATEcamp (CURAtion TEchnology Camp).  I hope to have more details to share soon.
  • I am attending Open Repositories 2010 in Madrid this July.  I expect to learn about how folks are using repository systems such as Fedora, DSpace, and ePrints, but am more interested in all the other stuff happening on the periphery.  There has also been talk of a curation micro-services birds-of-a-feather session, which might serve as a good event to get potential CURATEcampers talking.
  • I'll be in Washington, DC in a few weeks working on a team to evaluate IMLS National Leadership grant applications.  This will be a new experience for me, and one to which I need to devote a significant chunk of time between now and then, so I'm excited.  It will be interesting to see what folks are doing outside of Penn State, and also to get an idea for what sorts of projects wind up getting funded.
  • I have some vague ideas for project charters but have yet to really flesh them out.  One involves some collaborative development on tools around curation microservices, to be used and evaluated by honest-to-goodness curators with honest-to-goodness data, and the other is about benchmarking some distributed filesystems.
  • Techies at Penn State need to talk more.  I want a BarCamp-style event for PSU techies so that we can discuss issues across departmental boundaries.  Administrators have been nothing but supportive of the idea, and now I just need to find some time to sketch what I have in mind.
  • Digital Library Technologies, my department, is hiring!  We're looking for someone to come develop software to support our content stewardship program.  Like writing code?  Interested in how data is curated, stored, and discovered at scale?  Consider applying.  (Will link to position when it goes public later this week.)
Braindump complete.  Brain now empty, except to say: boy, State College sure is lovely in the spring.

I attended the Penn State library faculty research colloquium on Wednesday, during which I learned all sorts of things about the interesting research being done by my colleagues in the University Libraries.  One very interesting talk was by Doris Malkmus, one of our archivists, who was studying how history professors use primary sources, online and otherwise, in their undergraduate lectures.  It was no surprise to me that the #1 discovery method for online primary sources was Google, and that institutional repositories hardly rank at all.

(Sidebar: I wonder: with online content fragmenting, multiplying, and getting remixed and aggregated, does the definition of "primary source" strain for digital networked resources?)

This discovery elicited a number of responses about how difficult search engine optimization is and how we really need to ramp up our marketing efforts.

I wouldn't argue with either reaction, really.  I do sense a huge missed opportunity here, though, one that we are perfectly capable of not missing.  And let me be perfectly clear: I'm no SEO expert.  But let me also say that I've seen, firsthand, major SEO advancements in libraries I've worked at, and much of the work was pretty straightforward.

I tweeted my "SEO for dummies" list and got a couple of very good responses in addition to a retweet or two.  Here's what I said:

Googleability = increased findability + low-cost marketing. How do to it: 1. allow crawlers; 2. clean URLs; 3. rich item metadata; 4. links.
To this list, folks suggested I add "0. stable application" and "5. sitemaps", both key suggestions, though I don't have much experience with sitemaps so I won't say more about those.

What's my point?  It's not rocket science to get our web resources discoverable on Google and the other major search engines. 

What's the value in that?  More people are going to find library materials via a Google search than by navigating the dark alleys and dead ends of library websites.  Yes, our silo boundaries have been useful to us to keep dissimilar materials apart for management and such, but no, they are totally useless to our users.  My former colleague Ed Summers reminded me today that a silo is not really a silo if it's on the web.  Merely being on the web isn't enough, though, and here are the simple and practical lessons I've learned that may be the difference between getting found on Google and "NO JUICE FOR YOU!"

  1. Stable application: If your site isn't reliably up, user-agents will have a hard time finding it.  That means disgruntled users and crawlers who never fully find you.
  2. Allow web crawlers: Unless you have a really compelling (read: legal) reason to disallow crawlers (and robots and spiders, oh my), you really ought to allow them.  But only if you care about discoverability.  If your app cannot handle the load of crawlers, go back to #1 and start over.  Hire an engineer who knows about scale and performance, preferably.  (See anecdote 1 later in this post.)
  3. Clean URLs: I'm not sure this is entirely necessary for SEO, to be honest, but it does seem like a common practice among those who are good web citizens.
  4. Rich item metadata: Collection-level metadata is not good enough.  Collections are a useful abstraction for librarians but less so for users.  Rather than impose a collection view upon users, move relationships among items and common metadata elements into item pages.  (See anecdote 2 later in this post.)
  5. Links: Link out to stuff.  Get folks to link in to your stuff.  (See anecdote 3 later in this post.)

Anecdote 1: The Library of Congress has a digital newspaper application called Chronicling America.  At the time it was created, it served as a test bed for some technologies that had not seen wide uptake at the Library, but in time its developers realized the architecture couldn't keep up with the traffic coming in from the web crawlers.  A robots.txt file was created restricting crawlers and time went by.  The application was rebuilt from the ground up with the intent "to increase the usability of [the] application by providing faster responses to HTTP requests, allowing these requests via standardized APIs, as well as allowing all pages to be crawled by search engines."  The results were remarkable: average hits per day grew from roughly 75,000 to nearly 500,000. 

Anecdote 2: When the Library of Congress went live with the World Digital Library, clearly helped by a massive press event at UNESCO in Paris (the largest such event in UNESCO history, apparently), its developers watched the mentions roll in via Twitter Search.  The most interesting thing I learned that day is despite all the cool maps and timelines and facets, users were primarily linking directly to item pages (each of which was helped by surfacing all of the rich descriptive metadata as well as links to related and similar items).

Anecdote 3: The digital initiatives team at the University of Washington libraries has done some studies assessing the impact of adding links to their digital collections from Wikipedia pages.
  Usage spiked after the links were put in place, thanks to Wikipedia's popularity and the mechanics of Google's PageRank algorithm for judging relevance.

These are practical steps we can take, and frankly may be the best marketing (judging cost v. impact) libraries can do to increase usage of our digital materials.

What's in a title?

| 2 Comments | 3 TrackBacks
Hamlet.  The Declaration of Independence.  Gutenberg Bible.  Holy Roman Emperor.  All of these are words, but more than being words they are titles.  Titles are names we give things, things such as works of art and stations we hold. 

Names are important to the extent that being able to talk about things is important.  Names identify things.  The way we indicate things when we talk about them, typically, is names or substitutes therefor, e.g., deixis

Ergo, titles are important.  That's been my line of thinking, at least, as I continue to struggle -- here meant in a light sense, such as "I am struggling with eating this delicious cookie" -- with what precisely it means to have a title with the word "architect" in it. 

I say that with a twang of cognitive dissonance, for I know and understand very well what I do on a daily basis and what I will be doing in the near future.  Perhaps then titles are not very important, or I should say, more important than titles is knowing what is expected of you and exceeding those expectations.  Shape your title rather than allowing it to shape you.

Ergo, maybe titles aren't equally important in all contexts.

Self-help tropes aside, I still wonder about what folks' expectations of a digital library architect are.  There is a line of thinking in libraries that our problems are unique rather than of a class.  Some argue fiercely that library issues are, in fact, not special

I'm undecided.  For instance, would a digital library architect have any concerns or areas of expertise an IT/enterprise architect would not?  Or does digital library architecture amount to little more than a re-brand reflecting the "we're special" way of thinking?

Related, I would wager that the number of titular digital library architects is much smaller than the number of folks doing architecture work in digital libraries.  Digital repository librarians and library systems analysts, etc., I'm looking at you.

Why am I thinking about this?  In my last two jobs as a software developer in academic and research libraries I was spoiled by being in a large and vibrant community of similar folks: code4lib and also the Access folks up north.  I'm looking for the same in my current job: some forum, conference, mailing list, or what have you, where there is discussion of architectural issues in the digital libraries context.

For now, I have contented myself that a digital library architect is a technologist who thinks architecturally about digital libraries.  What does that mean?  Someone who, to mix metaphors mightily, puts his or her arms around the big picture (rather like an art thief). 

What does that mean?  Someone who knows all of the systems and standards and protocols and workflows and operations and the connections between them, in an institution, in the context (typically) of serving digital content over the web (though this context is expanding into other areas such as institutional e-records management and research data curation).  Said someone will probably have been hired in fact not merely to know all of that mess but to think systematically and strategically about whether all of that mess meets needs and requirements and best practices, and not only think about that but work deliberately to make that so.

Is there a community for such folks?  There are many possible related communities (and here I'm intentionally casting a broad net by mingling conferences, lists, and professional organizations): code4lib, Access, Open Repositories, digital-curation, ITANA, EDUCAUSE, ASIS&T, DigCCurr, iPres, SPARC, CNI, DLF/CLIR, and so on.  Heck if I know.  Do you?

My agenda for Q1 2010

| 0 Comments | 3 TrackBacks
My agenda for the first few months of 2010 is becoming clearer. Consider this a snapshot of the things that will be crossing my desk and bouncing around my mind.

  • Reviewing digital library platforms for the e-Content Stewardship Council
  • Reviewing functional requirements for an institution-wide repository of electronic records
  • Learning more about "big data" and continuing the data management discussion
  • Evaluating the DLT archival storage prototype and joining the technical team of the Data Storage Working Group
  • Evaluating next-generation information discovery tools for the libraries with I-Tech
  • Evaluating change management solutions with a team from Penn State's ITANA group
  • Working on requirements for a draft institutional identifier standard with the NISO I2 working group
  • Attending Code4Lib 2010
  • Meetings, meetings, meetings
  • Continuing to absorb as many of the following as possible: strategic plans, project portfolios, process management documents, and various and sundry reports, wikis, and blogs
My plate is rather full; fortunately, for now, I've got a big appetite.

Search This Blog

Full Text  Tag

Subscribe

Twitter