Sunday, January 2, 2011

Did I say "theoretical"? Openness and Google Books digitization (Apr. 25, 2008)

When I wrote this piece in 2008, HathiTrust was in the works but had not yet been born, and the piece stimulated the sort of dialogue that I believe is very important. Rather than try to reattach coments made at that time by Carl Malamud and Brewster Kahle, I include them here in this post because of their relevance to the debate. Many circumstances have since changed (e.g., HathiTrust and its preservation-orientation is a significant piece of the landscape, HathiTrust makes full book downloads available to authenticated users, and our efforts have grown so much that the mere availability--what I argue is a form of openness--has been embraced as transformational). These changes make some of the argument feel dated, but it feels like an important record nonetheless. I've attempted to capture the flow and content of the original blog entry and comments, and for reference purposes have stored a PDF of the original piece with comments on Scribd.
I was recently quoted in an AP article (published here in Salon) as saying that Brewster Kahle's position with regard to the openness of Google-digitized public domain content is "theoretical." Well, I sure thought I said "polemical," but them's the breaks. Brewster argues that Google's work in digitizing the public domain essentially locks it up--puts it behind a wall and makes it their own--and that this is a loss in a world that loves openness. The contrast here is meant to be with the work of the Open Content Alliance, where the same public domain work might be be shared freely, transferred to anyone, anywhere, and used for any purpose. I don't want to get into the quibble here about the constraints on that apparently open-ended set of permissions (i.e., that an OCA contributor may end up putting constraints on materials that look worse than Google's constraints). What's key here for me, though, is the real practical part of openness--what most people want and what's possible through what Michigan puts online.

I think all of this debate begs us to ask the question "what is open"? For the longest time (since the mid-1990's), Michigan digitized public domain content and made it freely viewable, searchable and printable. Anyone, anywhere could come to a collection like Making of America and read, search and print to his heart's delight. If the same user wanted to download the OCR, that too was made possible and, in fact, the Distributed Proofreader's project has made good use of this and other MOA functionality. We didn't make it possible for anyone to get a collection of our source files because we were actively involved in setting up Print-on-Demand (POD), POD typically has up-front, per-title costs, and making the source files available would have cost us some sales that might otherwise pay for that initial investment. As we moved into the agreement with Google, we made clear our intention to do the same "open" thing with the Google-digitized content, and to throw in our lot with a (then) yet-to-be-defined multi-institutional "Shared Digital Repository." In fact, now we have hundreds of thousands of public domain works online, all of which are readable, searchable and printable by anyone in the world in much the same way.

So, what's the beef? The OCA FAQ states that for them this openness means that "textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF." By all means! I hope it's clear by what I wrote above that this is an utterly accurate description of what happens when Google digitizes a volume from Michigan's collection and Michigan puts it online. It's also, incidentally, what Google makes possible, but even if Google didn't, Michigan could and would be rushing in to fill that breach. The challenges to Google's openness always seem to ignore what's actually possible through our copies at Michigan. This sort of polarizing rhetoric seems to be about making a point that's not accurate in the service of an attack on Google's primacy in this space: we don't want them to dominate the landscape, so let's characterize their Bad version as being the opposite of our Good version. This notion that what Google does is closed is not an accurate description of Google's version of these books, and even less so a description of Michigan's.

Could the Google books be more open? Absolutely. Along with Carl Malamud, for example, I would love to see all of the government documents that have been digitized by Google available for transfer to other entities so that the content could be improved and integrated into a wide variety of systems, thus opening up our government as well as our libraries. I believe that will happen, in fact, and that Google will one day (after they've had a chance to gain some competitive advantage) open up far more. In the meantime, however, when we talk about "open," let's mean it the way that the OCA FAQ means it. Let's mean it in the same way that the bulk of our audience means it. Let's talk about the ability to read, cite and search the contents of these books, and let's call the Google Books project and particularly Michigan's copies Open. Let's stop being theoretical, er, I mean polemical.

Carl Malamud responded by quoting my "for saving or printing using formats such as PDF" and then went on to argue:
> for saving or printing using formats such as PDF

John, pardon me if I don’t grock Mirlyn, but I pulled up a public domain document (a congressional hearing). I was able to pull the text up and page through, but there didn’t appear to be an easy way to save a single page, let alone the entire hearing. Perhaps that function is available to Michigan students, but I suspect the rest of the citizens of Michigan are in the same boat as the rest of using the crippled interface.

I think it is fine that Michigan and Google have their arrangement, but it is disturbing when we see a state-funded institution like U. of Michigan putting up artificial barriers to access.

Your Mirlyn site is ok as far as web sites go, but letting a thousand flowers bloom always leads to more innovation. It would be great if any grad student in Ann Arbor (or anyplace else) could download your govdocs docs and come up with a better user interface.

(In addition to more innovation, that policy would lead to a more informed citizenry, which is generally considered an important part of democracy and I suspect is part of your state-sponsored mandate.)
I responded to Carl, saying:
Gosh, Carl, I think the best way I can respond is not only to say that I whole-heartedly agree with your call for more vigorous sharing, but to point to my fourth paragraph, where I point to your work and urge the same thing. Look, my point is that while this is good, and we are fighting for deeper sharing, this sort of thing is a fairly narrow piece of the openness issue.

On your point about the functionality and the opaqueness of getting PDFs, we’ll take that into account in our usability. It’s there, and we can do better. I should not that for us larger PDF chunks is also a resource issue, but that we’re very close to releasing a new version that gives you 10 pages at a time. Personally, I like the screen resolution PNG files and very much dislike PDF as a format, but that’s a usability position and not a philosophical one.
Later, Brewster Kahle weighed in on the issue of openness with:
John– while it may not be appropriate to start this in a comment, but I am quite taken aback by your seeming implication that “open” includes what google is doing and what UMich is doing.

“Open” started to be widely used in the Internet community in association with certain software. Richard Stallman calls it “free”, but “open” has also come to be used as well. Lets start with that.

“Open Source” in that community means the source code can be downloaded in bulk, read, analyzed, modified, and reused.

“Open Content” has followed much the same trajectory. Creative Commons evolved a set of licenses to help the widespread downloading of creative works, or “content”. Downloading, and downloading in bulk, is part of this overall approach as we see it at the Internet Archive.

Researchers (and more general users, but we can stick with researchers because they are a community that research libraries are supposed to serve) require downloadability to materials so they can be read, compared, analyzed, and recontextualized.

Page at a time interfaces, therefore, would not be “open” in this sense. Downloadable crippled versions would not be open in the Open Source or Open Content sense either.

As a library community, we can build on the traditions from the analog world of sharing widely even as we move into the digital world. We see this as why we get public support.

Lets build that open world.

We would be happy to work with UMich to support its open activities.
My response:
I think this is precisely the sort of rhetoric that’s muddying the waters right now, Brewster. There is no uniformly defined constituency called “researchers” who “require downloadability.” I know ‘em, I work with ‘em, and I know that’s not true. Access (and openness) is defined on a continuum. What we do is extraordinarily open and has made a tremendous difference for research and the in the lives of ordinary users. This sort of differentiation in the full accessibility of source materials is one of the key incentives that has brought organizations like Google and Microsoft to the table, and if it didn’t make sense, the OCA wouldn’t go to pains to stipulate that “all contributors of collections can specify use restrictions on material that they contribute.” Is more open better? Damned right. That’s one reason why for two years we’ve been offering OCA the texts Michigan digitizes as part of its own in-house work. But is what we’re doing with Google texts open? Absolutely.
Carl followed with a practical example:
I’m not sure I get all these degrees of open … let me add a hypothetical if that helps clear this up.

What if a bunch of students in Ann Arbor organized themselves into a Democracy Club and started grabbing all the public domain documents they can find on MBooks and uploading them to some site such as scribd.com or pacer.resource.org for recycling? If the docs are open (and we’re just talking “works of the government” which are clearly in the public domain), would you consider that a mis-use of your system and try and stop it or would that fall inside of the open side of the open continuum we’re all trying to mutually understand in this dialogue?

Hypothetically speaking, of course. I’m not advocating that students form a Democracy Club and crawl your site to recycle public domain materials, I’m just trying to understand if the restrictions on reuse are passive ones like obscuring how to download files or if these are active restraints where the library is involved in enforcing restrictions on access to public domain materials.

Again, I’m not at all suggesting that students interested in furthering the public domain form Democracy Clubs and start harvesting documents from the public taxpayer-financed web sites at UMich and re-injecting them into the public domain.
My response brought in technologies that have since been introduced:
What if? If there really were that sort of interest, I’d hope that we’d have a chance to talk to the students and make sure they were aware of powerful options to make “in situ” use of the openly accessible government documents that they find in MBooks. I’d want to make sure they knew that in late June we’re releasing a “collection builder” application that will allow them to leverage our investment in permanent (did I say permanent?) curation of these materials so that the materials could be found and used after the current crop of students comes and goes, that the students could add to the body of works as more get digitized from our collection and the collections of other partner libraries (e.g., Wisconsin’s are coming in soon) and that we would want to hear what sorts of services (an RSS feed of newly added gov docs?) might aid them in their work. I’d want to talk to them about the issue of authority and quality, and would see if there were ways that their efforts could help improve the works in MBooks rather than dispersing the effort to copies in multiple places. And if they needed computational resources to do things like data mining, I’d let them know that we’re glad to help. But if none of this satisfied them, would we try to stop them? Assuming Google digitized the works, according to our agreement (4.4.1) we would make “reasonable efforts … to prevent [them] from … automated and systematic downloading” of the content, something we currently do and which does not undermine the ability of those same students to read, search and print the documents. Lots of openness there.
Finally, an anonymous writer wondered "Why give Google competitive advantage? How is this the role of a libraries vis a vis a vendor?" to which I responded:
An interesting question. Whether it *should* be (the role) or not isn’t really a question at this point, with decades of examples of libraries working with “vendors” in ways that leveraged library collections for what you might view as a vendor’s competitive advantage. Obviously, some deals were more selfless than others, and many have involved royalties or discounts provided to the libraries. In fact, *some* sort of relationship is absolutely necessary because of the inaccessibility of the materials in our collections (publishers, for example, don’t have the titles and frequently don’t know what they’ve published). On the other hand, the nature of the deal is what’s at issue, and we believe that having a copy of the files for long-term preservation and meaningfully open access, and fairly liberal terms for the way that we use the materials, is a good exchange for the hundreds of millions of dollars worth of work entailed in doing the scanning.

My first Big Green Egg pizza (Aug. 11, 2008)

I have to say that I was skeptical about the Big Green Egg making a real difference in cooking a pizza, but I'm convinced that this is a game changer. But let's start at the beginning.

If you're devoted to making pizza, you know heat is a big part of success. Our home oven, part of a big dual-fuel setup, uses convection and does a solid 550 degrees without resorting to crazy stuff like hacking the latch for the self-cleaning oven. (Yeah, believe it or not, it's been done.) A colleague with a similar obsession has complained that his oven doesn't reach these temps, and pictures of his pizzas show it. It's gotta be hot and it's gotta cook quickly: you want it brown without the pizza getting dried out. And although heat is a key piece, what every pizza maker knows s/he really wants is a wood-fired pizza oven. Now this is not as absurd a dream as you'd think. There are several models designed for home use (see forno bravo or le panyol, for example), and at least one I've run into is designed so that you can use it indoors--maybe it doubles as an inefficient heat source. However, as technically feasible as a home wood-fired oven is, it feels like a big investment. I can dream, of course.

In the meantime, we recently ran into something called the Big Green Egg. This thing, a kamado-style cooker, generates extremely high heat, serves primarily as a grill, doubles as an oven (and a smoker), and (though still pricey) costs a lot less than a wood-fired oven. Maria and I grill a lot and wanted to incorporate things like spatchcocked chickens into our repertoire, so we decided to give the Big Green Egg a shot.

My first effort at this was relatively successful, with a few problems that leave me opportunities for refining things. I should note that it's relatively easy to get the Big Green Egg up to a mighty 650 degrees and the BGE has available to it accessories (like the "plate-setter") that makes this process pretty straightforward. With the plate setter in place, I went with an American Metalcraft PS1575, a pizza stone made from fire brick and thus much safer for these high heats. (The PS1575 is supposed to be 15.75" in diameter. Mine was a full 16" and may have contributed to some minor damage to my gasket.) This left ample room to slide the pizza onto the stone without losing ingredients over the side. At 650 degrees, the pizza cooked in slightly more than 10 minutes. As you'll see in the two pictures below, this created a nicely cooked crust with a little (and very tasty) burning below, and browned toppings. For this first effort, I stuck with a classic margherita:

top view of pizza

side view of pizza

This was, without a doubt, the best home pizza crust we've ever done, and considering the number of pizzas we've cooked, that's saying something. The crust was noticeably more flavorful and the whole thing did have a slightly smoky taste. I'll admit that I expected the pizza to be no different from the ones we've cooked in the oven, but I was definitely wrong.

As usual, I won't try to reproduce the wealth of information on the web about cooking a pizza on a Big Green Egg. The very helpful Naked Whiz site does a very nice job covering all elements of cooking pizza on the Big Green Egg, including addressing issues of the size of the pizza stone.

What challenges lie ahead? I've had a hard time getting my BGE over 650 degrees and would like to try a slightly higher temperature. In putting the pizza into the BGE, I need to get in and out a little more quickly to avoid losing temperature. And I'm going to need to explore what the issues are around the gasket burning, a problem that might be related to the size of the stone, but which was just as likely to be a result of the gasket having been poorly installed (and protruding into the BGE).

Next generation Library Systems (Nov. 16, 2007)

The problem
With the backdrop of the widely touted lessons of Amazoogle—an expression I can barely stand to write—three of the more interesting emerging developments of late have been OCLC’s WorldCat Local, Google Book Search, and Google Scholar. As Lorcan Dempsey argued, the "massive computational and data platforms [of Google, Amazon and EBay] exercise [a] strong gravitational web attraction," a sort of undeniable central force in the solar system of our users’ web experience. What has happened with WorldCat Local, Google Book Search and Google Scholar has extended that same sort of pull to key scholarly discovery resources. No one needed the OCLC environmental scans to be reminded that our users look to Google before they turn to the multi-million dollar scholarly resources that we purchase for them, and everyone was aware that Amazon satisfied a broad range of discovery needs more effectively than the local catalog. Now, however, mainstream “network services” like Amazon and Google web search, deficient in their ability to satisfy scholarly discovery, are complemented by similarly “massive computational and data platforms” that specialize in just that—finding resources in the scholarly sphere. These forces, and perhaps more like them in the future, should influence the way that we design and build our library systems. If we ignore these types of developments, choosing instead to build systems with ostensibly superior characteristics, systems that sit on the margins, we effectively ensure our irrelevance, building systems for an idealized user who is practically non-existent.

Our resources, skills and investments have helped to create an opportunity for us to shape a next generation of library systems, simultaneously cognizant of the strong network layer and our needs and responsibilities as a preeminent research library. At Michigan, we have designed and built our past systems, each in partial isolation from the other system, reflecting the state of library technology and our response to user needs. We were not wrong in the way that we developed our systems, but rather we were right for those times. In building things in this way, we have developed an LMS support team with extraordinary talent and responsiveness, a digital library systems development effort that blazed trails and continues to be valued for the solidity of its product, and base-funded IT infrastructure that is utterly rock-solid--all great, but generally as independently conceived efforts.[1] What libraries like ours must do now is reconceive our efforts in light of the changed environment. The reconceptualization should, as mentioned, not only be built with an awareness of the new destinations our users choose, but also with a recognition that we have a special responsibility for the long-term curation of library assets. Even at its most successful, Google Scholar does not include all of the roughly $8m in electronic resources that we purchase for the campus, and Google Book Search is not designed to support the array of activities that we associate with scholarship.

Knowing that we must change where we invest our resources is one thing; knowing where we must invest is another. I don’t believe I should (or could) paint an accurate picture of the sorts of shifts we should make. On the other hand, I can lay out here a number of key principles that should guide our work.

Principles
1. Balanced against network services
: I believe this is probably the most important principle in the design of what we must build. We must not try to do what the network can do for us. We must find ways to facilitate integration with network services and ensure that our investment is where our role is most important (e.g., not trying to compete with the network services unless we think we can and should displace them in a key area). For example, we have recognized that Google will be a point of discovery, and so rather than trying to duplicate what they do well for the broad masses of people, we should (1) put all things online in a way that Google can discover; and (2) because we recognize that Google won’t build services in ways that serve all scholarly needs, work to strategically complement what they do. In the first instance (i.e., making sure that Google can discover resources), we will always need to block them, for legal or other reasons, from discovering content.[2] These types of exceptions should add nuance to what we do in exposing content. In the second instance, when it comes to building complementary services, we’ll need to be both smart (and well-informed) and strategic.

2. Openness: What we develop should easily support our building services and, even more importantly, should allow others to build them. It should take advantage of existing protocols, tools and services. Throughout this document, I want to be very clear that these principles or criteria don’t necessarily point to a specific tool or a specific way of doing things. Here, I would like to note that the importance of openness, though great, does not necessarily point to the need to do things as open source. As O’Reilly has written in his analysis of the emergence of Web 2.0, this is what we see in Amazon’s and Google’s architectures, where the mechanisms for building services are clearly articulated, but no one sees the code for their basic services: the investment shifts from shareable software to services. Similarly, our being open to having external services built on top of our own should not imply that our best or only route is open source software. What is particularly important is the need to have data around which others would like to build tools and services: openness in resources that few wish to include is really only beautifying a backwater destination.

3. Open source: Despite what I noted above about openness, we should try, wherever possible, to do our work with open source licensing models and we should try to leverage existing open source activities. In part, this is simply because, in doing so, we’ll be able to leverage the development efforts of others. We should also aim for this because of the increasing cost of poorly functioning commercial products in the library marketplace. Note, though, that when we choose to use open source software, it’s important to pick the right open source development effort—one that is indeed open and around which others are developing. Much open source software is isolated, with few contributions. We should aim for openness in our services over slavish devotion to open source. We should also choose this route when we can simply because it's the best economic model for software in our sphere.

4. Integration: Tight integration is not the most important characteristic of the systems we should build, nor should this sort of integration be an end in itself; however, we have an opportunity to optimize integration across all or most of our systems, making an investment in one area count for others. In Michigan’s MBooks repository, we have already begun to demonstrate some of the value in this type of integration by relying on the Aleph X-Server for access to bibliographic information, and we should continue to make exceptions to tighter integration only after careful deliberation. A key example is the use of metasearch for discovery of remote and local resources: we should need to address only a single physical or virtual repository for locally-hosted content. We should give due consideration to the value of “loose” integration (e.g., automatically copying information out of sources and into target systems), but the example of the Aleph X-Server has been instructive and shows the way this sort of integration can provide both increased efficiency and greater reliability in results.

5. Rapid development: If we take a long time to develop our next generation architecture, it will be irrelevant before we deploy it. I know this pressure is a classic tension point between Management and Developers: one perspective holds that we’re spending our time on fine-looking code rather than getting a product to the user, and the other argues that work done rapidly will be done poorly. This dichotomy is false. The last few years of Google’s “perpetual beta” and a rapidly changing landscape have underlined the need to build services quickly, while the importance of reliability and unforgiving user expectations have helped to emphasize the value of a quality product. We can’t do one without the other, and I think the issue will be scaling our efforts to the available resources, picking the right battles, and not being overambitious.

Directions
These sorts of defining principles are familiar and perhaps obvious, but what is less obvious is where all of this points. Although there are some clear indications that these sorts of principles are at play in, for example, the adoption of WorldCat Local or the integration of Fedora in VTLS’s library management system, there are also contradictory examples (e.g., the rush to enhance the local catalog, and many more silo-like systems like DSpace), and I’ve heard no articulations of an overarching integrated environment. If we undertake a massive restructuring of our IT infrastructure rather than strategic changes in some specific areas, or tweaking in many areas, it may appear to be an idiosyncratic and expensive development effort that robs one's larger library organization of limited cycles for enhancements to existing systems. On the other hand, if we don’t position ourselves to take advantage of the types of changes I mentioned at the outset, we will polish the chrome on our existing investments for a few years until someone else gets this right or libraries are entirely irrelevant. Moreover, if we make the right sorts of choices in the current environment, we should also be able to capitalize on the efforts of others, thus compounding the return on each library’s investment. And of course, situating this discussion in a multi-institutional, cooperative effort minimizes the possibility that building the new architecture robs our institutions of scarce cycles.

It’s important, also, to keep in mind that this kind of perspective (i.e., the one I’m positing here) doesn’t presume to replace our existing technologies with something different. Many libraries have made many good choices on technologies that are serving their institutions well, and to the extent that they are the best or most effective tool for aligning with the principles I’ve laid out, we should use them. The X-Servers of Aleph and MetaLib are excellent examples of tools that allow the sort of integration we imagine. At UM, our own DLXS and the new repository software we developed are powerful and flexible tools without the overhead of some existing DL tools. But in each case, it may make more sense to migrate to a new technology because we are elaborating a model of broader integration (both locally and with the ‘net) that others may also use. Where there is a shared development community (e.g., Fedora, Evergreen or LibraryFind), we can benefit from a community of developers. In all of this, we’ll need a strategy, and a strategy that remains flexible as the landscape changes.

It’s time to see our environment as being comprised of a set of inventory management responsibilities (both print and digital, both local and remote) that leverages a growing and maturing array of network services so that our users can effectively discover and use the resources available to them. I think that requires a change in the way we think about our technologies and a much more strategic arrangement of those technologies in relation to each other. We may be stuck with a bunch of local print “repositories” because of the nature of print and the history of library development. That’s not the case for our digital repository, however. On top of this, we need to conceptualize the sorts of services we need (e.g., ingest, exposure, other types of dissemination, archiving, etc.) and the tools that can best accomplish these things.

Notes
[1] Incidentally, I also believe that Michigan’s organizational model, comprised as it is of five distinct IT departments, is ideally suited to building the next generation of access and management technologies. Core Services should continue to provide a foundation of technology relevant to all of our activities, and should continue to develop and maintain system integration services used by all of the Library’s IT units. Library Systems will need to continue to support operational activities such as circulation and cataloging at the same time that it manages our most important database of descriptive metadata. DLPS should continue to focus on technologies that manage and provide access to the digital objects themselves—the data described by those metadata. Web Systems is ideally suited to provide a top layer of discovery and “use” tools that tap into both local data resources and those things we license remotely. I believe that our current organizational model shares out responsibility effectively and allows for a sort of specialization that is complementary; however, I wouldn’t rule out different organizational models if they made sense in the course of this process. For those readers outside the UM Library, the fifth department is Desktop Support Services, responsible not only for the desktop platform but also for the infrastructure supporting it.

[2] For example, with regard to Deep Blue, our institutional repository, in Michigan’s agreement with Wiley, approximately 33% of the Wiley-published/UM-authored content is restricted to UM users; and in our agreement with Elsevier, we may make it possible for Google to discover metadata but not fulltext. Similar things are bound to occur in the materials we put online in services other than Deep Blue.

Our hidden digital libraries (July 27, 2008)

Two of my very talented colleagues, Kat Hagedorn and Josh Santelli, just published a nice piece in D-Lib entitled "Google Still Not Indexing Hidden Web URLs." Kat and Josh and I have discussed this problem off and on, stimulated in part by our frustrations in getting the OAI data collected by OAIster into search services like Google.

In preparation for a recent talk in China (the challenges and opportunities for digital libraries), I talked to Kat and Josh about the extent of OAIster data not findable through standard web searches. That so much of our digital library content is not findable through standard search engines has always been a troublesome issue, and I would have expected that with the passage of time, this particular problem have been solved. It hasn't, and that has made me wonder about what we do in digital libraries and how we do it.

Kat's and Josh's numbers are compelling. OAIster focuses on the hidden web--resources not typically stored as files in a crawlable web directory--and so OAIster, with its 16 million records, is a particularly good resource for finding digital library resources. Kat and Josh conclude that more than 55% of the content in OAIster can't be found in Google.

As much as I like Kat's and Josh's analysis, I draw a different conclusion from the data. They write that, "[g]iven the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources." This perspective is one many of us share. We're inclined to point a finger at Google (or other search engines) and wish they tried harder to look into our arcane systems. We believe that if only Google and others had a deeper appreciation of our content or tried harder, this problem would go away. I've been fortunate enough to be able to try to advance this argument one-on-one with the heads of Google and Google Scholar, and their responses are similar--too much trouble for the value of the content. As time has passed, I've come to agree.

Complexity in digital library resources is at the heart of our work, and is frankly one reason why many of us find the work so interesting. Anyone who thinks that the best way to store the 400,000 pages (140+ million words) of the texts in the Patrologia Latina is as a bunch of static web pages knows nothing of the uses or users of that resource or what's involved in managing it. Similarly, to effectively manage the tens of thousands of records for a run-of-the-mill image collection, you can't store them as individual HTML pages lacking well-defined fields and relationships. These things are obvious to people in our profession.

We often go wrong, however, when we try to share our love of complexity with the consumers. We've come to understand that success in building our systems involves making complicated uses possible without at the same time requiring the user to have a complicated understanding of the resource. What we must also learn is that a simplified rendering of the content, so that it can be easily found by the search engines, is not an unfortunate compromise, but rather a necessary part of our work.

Will it be possible in all cases to break down the walls between the complex resources and the simple ways that web crawlers need to understand them? Absolutely not. The growing sophistication of the search engines does ensure that it gets easier with time, however. About a decade ago, we tried populating directories with tiny HTML files created from records in image databases. The crawlers gave up after picking up a few thousand records, apparently daunted by the vastness of this content. Now, however, this sort of approach works and only requires patience as the crawlers make repeated passes over the content. Large and complex text collections can by modeled as simplified text files, and the search engines can be tricked into pointing the user to the appropriate entry point for the work from which the text is drawn.

One thing the analysis of the OAIster data shows is that, as a community, we have not availed ourselves of these relatively simple solutions to making our resources more widely discoverable. Not all of the challenges of modeling digital library resources are this easy. There are bigger challenges that require more creative solutions, but creating these solutions is part of the job of putting the resources online, not a nuisance or distraction from that job.