Sunday, January 2, 2011

Our hidden digital libraries (July 27, 2008)

Two of my very talented colleagues, Kat Hagedorn and Josh Santelli, just published a nice piece in D-Lib entitled "Google Still Not Indexing Hidden Web URLs." Kat and Josh and I have discussed this problem off and on, stimulated in part by our frustrations in getting the OAI data collected by OAIster into search services like Google.

In preparation for a recent talk in China (the challenges and opportunities for digital libraries), I talked to Kat and Josh about the extent of OAIster data not findable through standard web searches. That so much of our digital library content is not findable through standard search engines has always been a troublesome issue, and I would have expected that with the passage of time, this particular problem have been solved. It hasn't, and that has made me wonder about what we do in digital libraries and how we do it.

Kat's and Josh's numbers are compelling. OAIster focuses on the hidden web--resources not typically stored as files in a crawlable web directory--and so OAIster, with its 16 million records, is a particularly good resource for finding digital library resources. Kat and Josh conclude that more than 55% of the content in OAIster can't be found in Google.

As much as I like Kat's and Josh's analysis, I draw a different conclusion from the data. They write that, "[g]iven the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources." This perspective is one many of us share. We're inclined to point a finger at Google (or other search engines) and wish they tried harder to look into our arcane systems. We believe that if only Google and others had a deeper appreciation of our content or tried harder, this problem would go away. I've been fortunate enough to be able to try to advance this argument one-on-one with the heads of Google and Google Scholar, and their responses are similar--too much trouble for the value of the content. As time has passed, I've come to agree.

Complexity in digital library resources is at the heart of our work, and is frankly one reason why many of us find the work so interesting. Anyone who thinks that the best way to store the 400,000 pages (140+ million words) of the texts in the Patrologia Latina is as a bunch of static web pages knows nothing of the uses or users of that resource or what's involved in managing it. Similarly, to effectively manage the tens of thousands of records for a run-of-the-mill image collection, you can't store them as individual HTML pages lacking well-defined fields and relationships. These things are obvious to people in our profession.

We often go wrong, however, when we try to share our love of complexity with the consumers. We've come to understand that success in building our systems involves making complicated uses possible without at the same time requiring the user to have a complicated understanding of the resource. What we must also learn is that a simplified rendering of the content, so that it can be easily found by the search engines, is not an unfortunate compromise, but rather a necessary part of our work.

Will it be possible in all cases to break down the walls between the complex resources and the simple ways that web crawlers need to understand them? Absolutely not. The growing sophistication of the search engines does ensure that it gets easier with time, however. About a decade ago, we tried populating directories with tiny HTML files created from records in image databases. The crawlers gave up after picking up a few thousand records, apparently daunted by the vastness of this content. Now, however, this sort of approach works and only requires patience as the crawlers make repeated passes over the content. Large and complex text collections can by modeled as simplified text files, and the search engines can be tricked into pointing the user to the appropriate entry point for the work from which the text is drawn.

One thing the analysis of the OAIster data shows is that, as a community, we have not availed ourselves of these relatively simple solutions to making our resources more widely discoverable. Not all of the challenges of modeling digital library resources are this easy. There are bigger challenges that require more creative solutions, but creating these solutions is part of the job of putting the resources online, not a nuisance or distraction from that job.

1 comment:

  1. When I first published this, feedback included a range of comments including an argument to be suspicious of corporations that profit from discovery, to pointers to efforts to improve access to the Deep Web (, to encouragement to use XML sitemaps. I also noted that I'd overlooked Roy Tennant's important piece on this very problem ("A Map to Destinations Uncrawled").

    Regarding trust of corporations, Ryan Shaw at the Berkeley i-School argued that "This is precisely why we must avoid putting corporations in charge of our cultural heritage: despite claims to be 'organizing the world’s information,' they really are only interested in organizing the subset of information that will bring them advertising revenues. Unless one believes that this subset is all that’s worth organizing, one ought to firmly reject the idea that Google and its ilk are anything more than advertising companies that provide useful tools. Adopt SEO techniques and manipulate them to draw traffic to our resources? Certainly. Trust them as stewards of those resources? Hell no."