Sunday, January 2, 2011

Did I say "theoretical"? Openness and Google Books digitization (Apr. 25, 2008)

When I wrote this piece in 2008, HathiTrust was in the works but had not yet been born, and the piece stimulated the sort of dialogue that I believe is very important. Rather than try to reattach coments made at that time by Carl Malamud and Brewster Kahle, I include them here in this post because of their relevance to the debate. Many circumstances have since changed (e.g., HathiTrust and its preservation-orientation is a significant piece of the landscape, HathiTrust makes full book downloads available to authenticated users, and our efforts have grown so much that the mere availability--what I argue is a form of openness--has been embraced as transformational). These changes make some of the argument feel dated, but it feels like an important record nonetheless. I've attempted to capture the flow and content of the original blog entry and comments, and for reference purposes have stored a PDF of the original piece with comments on Scribd.
I was recently quoted in an AP article (published here in Salon) as saying that Brewster Kahle's position with regard to the openness of Google-digitized public domain content is "theoretical." Well, I sure thought I said "polemical," but them's the breaks. Brewster argues that Google's work in digitizing the public domain essentially locks it up--puts it behind a wall and makes it their own--and that this is a loss in a world that loves openness. The contrast here is meant to be with the work of the Open Content Alliance, where the same public domain work might be be shared freely, transferred to anyone, anywhere, and used for any purpose. I don't want to get into the quibble here about the constraints on that apparently open-ended set of permissions (i.e., that an OCA contributor may end up putting constraints on materials that look worse than Google's constraints). What's key here for me, though, is the real practical part of openness--what most people want and what's possible through what Michigan puts online.

I think all of this debate begs us to ask the question "what is open"? For the longest time (since the mid-1990's), Michigan digitized public domain content and made it freely viewable, searchable and printable. Anyone, anywhere could come to a collection like Making of America and read, search and print to his heart's delight. If the same user wanted to download the OCR, that too was made possible and, in fact, the Distributed Proofreader's project has made good use of this and other MOA functionality. We didn't make it possible for anyone to get a collection of our source files because we were actively involved in setting up Print-on-Demand (POD), POD typically has up-front, per-title costs, and making the source files available would have cost us some sales that might otherwise pay for that initial investment. As we moved into the agreement with Google, we made clear our intention to do the same "open" thing with the Google-digitized content, and to throw in our lot with a (then) yet-to-be-defined multi-institutional "Shared Digital Repository." In fact, now we have hundreds of thousands of public domain works online, all of which are readable, searchable and printable by anyone in the world in much the same way.

So, what's the beef? The OCA FAQ states that for them this openness means that "textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF." By all means! I hope it's clear by what I wrote above that this is an utterly accurate description of what happens when Google digitizes a volume from Michigan's collection and Michigan puts it online. It's also, incidentally, what Google makes possible, but even if Google didn't, Michigan could and would be rushing in to fill that breach. The challenges to Google's openness always seem to ignore what's actually possible through our copies at Michigan. This sort of polarizing rhetoric seems to be about making a point that's not accurate in the service of an attack on Google's primacy in this space: we don't want them to dominate the landscape, so let's characterize their Bad version as being the opposite of our Good version. This notion that what Google does is closed is not an accurate description of Google's version of these books, and even less so a description of Michigan's.

Could the Google books be more open? Absolutely. Along with Carl Malamud, for example, I would love to see all of the government documents that have been digitized by Google available for transfer to other entities so that the content could be improved and integrated into a wide variety of systems, thus opening up our government as well as our libraries. I believe that will happen, in fact, and that Google will one day (after they've had a chance to gain some competitive advantage) open up far more. In the meantime, however, when we talk about "open," let's mean it the way that the OCA FAQ means it. Let's mean it in the same way that the bulk of our audience means it. Let's talk about the ability to read, cite and search the contents of these books, and let's call the Google Books project and particularly Michigan's copies Open. Let's stop being theoretical, er, I mean polemical.

Carl Malamud responded by quoting my "for saving or printing using formats such as PDF" and then went on to argue:
> for saving or printing using formats such as PDF

John, pardon me if I don’t grock Mirlyn, but I pulled up a public domain document (a congressional hearing). I was able to pull the text up and page through, but there didn’t appear to be an easy way to save a single page, let alone the entire hearing. Perhaps that function is available to Michigan students, but I suspect the rest of the citizens of Michigan are in the same boat as the rest of using the crippled interface.

I think it is fine that Michigan and Google have their arrangement, but it is disturbing when we see a state-funded institution like U. of Michigan putting up artificial barriers to access.

Your Mirlyn site is ok as far as web sites go, but letting a thousand flowers bloom always leads to more innovation. It would be great if any grad student in Ann Arbor (or anyplace else) could download your govdocs docs and come up with a better user interface.

(In addition to more innovation, that policy would lead to a more informed citizenry, which is generally considered an important part of democracy and I suspect is part of your state-sponsored mandate.)
I responded to Carl, saying:
Gosh, Carl, I think the best way I can respond is not only to say that I whole-heartedly agree with your call for more vigorous sharing, but to point to my fourth paragraph, where I point to your work and urge the same thing. Look, my point is that while this is good, and we are fighting for deeper sharing, this sort of thing is a fairly narrow piece of the openness issue.

On your point about the functionality and the opaqueness of getting PDFs, we’ll take that into account in our usability. It’s there, and we can do better. I should not that for us larger PDF chunks is also a resource issue, but that we’re very close to releasing a new version that gives you 10 pages at a time. Personally, I like the screen resolution PNG files and very much dislike PDF as a format, but that’s a usability position and not a philosophical one.
Later, Brewster Kahle weighed in on the issue of openness with:
John– while it may not be appropriate to start this in a comment, but I am quite taken aback by your seeming implication that “open” includes what google is doing and what UMich is doing.

“Open” started to be widely used in the Internet community in association with certain software. Richard Stallman calls it “free”, but “open” has also come to be used as well. Lets start with that.

“Open Source” in that community means the source code can be downloaded in bulk, read, analyzed, modified, and reused.

“Open Content” has followed much the same trajectory. Creative Commons evolved a set of licenses to help the widespread downloading of creative works, or “content”. Downloading, and downloading in bulk, is part of this overall approach as we see it at the Internet Archive.

Researchers (and more general users, but we can stick with researchers because they are a community that research libraries are supposed to serve) require downloadability to materials so they can be read, compared, analyzed, and recontextualized.

Page at a time interfaces, therefore, would not be “open” in this sense. Downloadable crippled versions would not be open in the Open Source or Open Content sense either.

As a library community, we can build on the traditions from the analog world of sharing widely even as we move into the digital world. We see this as why we get public support.

Lets build that open world.

We would be happy to work with UMich to support its open activities.
My response:
I think this is precisely the sort of rhetoric that’s muddying the waters right now, Brewster. There is no uniformly defined constituency called “researchers” who “require downloadability.” I know ‘em, I work with ‘em, and I know that’s not true. Access (and openness) is defined on a continuum. What we do is extraordinarily open and has made a tremendous difference for research and the in the lives of ordinary users. This sort of differentiation in the full accessibility of source materials is one of the key incentives that has brought organizations like Google and Microsoft to the table, and if it didn’t make sense, the OCA wouldn’t go to pains to stipulate that “all contributors of collections can specify use restrictions on material that they contribute.” Is more open better? Damned right. That’s one reason why for two years we’ve been offering OCA the texts Michigan digitizes as part of its own in-house work. But is what we’re doing with Google texts open? Absolutely.
Carl followed with a practical example:
I’m not sure I get all these degrees of open … let me add a hypothetical if that helps clear this up.

What if a bunch of students in Ann Arbor organized themselves into a Democracy Club and started grabbing all the public domain documents they can find on MBooks and uploading them to some site such as scribd.com or pacer.resource.org for recycling? If the docs are open (and we’re just talking “works of the government” which are clearly in the public domain), would you consider that a mis-use of your system and try and stop it or would that fall inside of the open side of the open continuum we’re all trying to mutually understand in this dialogue?

Hypothetically speaking, of course. I’m not advocating that students form a Democracy Club and crawl your site to recycle public domain materials, I’m just trying to understand if the restrictions on reuse are passive ones like obscuring how to download files or if these are active restraints where the library is involved in enforcing restrictions on access to public domain materials.

Again, I’m not at all suggesting that students interested in furthering the public domain form Democracy Clubs and start harvesting documents from the public taxpayer-financed web sites at UMich and re-injecting them into the public domain.
My response brought in technologies that have since been introduced:
What if? If there really were that sort of interest, I’d hope that we’d have a chance to talk to the students and make sure they were aware of powerful options to make “in situ” use of the openly accessible government documents that they find in MBooks. I’d want to make sure they knew that in late June we’re releasing a “collection builder” application that will allow them to leverage our investment in permanent (did I say permanent?) curation of these materials so that the materials could be found and used after the current crop of students comes and goes, that the students could add to the body of works as more get digitized from our collection and the collections of other partner libraries (e.g., Wisconsin’s are coming in soon) and that we would want to hear what sorts of services (an RSS feed of newly added gov docs?) might aid them in their work. I’d want to talk to them about the issue of authority and quality, and would see if there were ways that their efforts could help improve the works in MBooks rather than dispersing the effort to copies in multiple places. And if they needed computational resources to do things like data mining, I’d let them know that we’re glad to help. But if none of this satisfied them, would we try to stop them? Assuming Google digitized the works, according to our agreement (4.4.1) we would make “reasonable efforts … to prevent [them] from … automated and systematic downloading” of the content, something we currently do and which does not undermine the ability of those same students to read, search and print the documents. Lots of openness there.
Finally, an anonymous writer wondered "Why give Google competitive advantage? How is this the role of a libraries vis a vis a vendor?" to which I responded:
An interesting question. Whether it *should* be (the role) or not isn’t really a question at this point, with decades of examples of libraries working with “vendors” in ways that leveraged library collections for what you might view as a vendor’s competitive advantage. Obviously, some deals were more selfless than others, and many have involved royalties or discounts provided to the libraries. In fact, *some* sort of relationship is absolutely necessary because of the inaccessibility of the materials in our collections (publishers, for example, don’t have the titles and frequently don’t know what they’ve published). On the other hand, the nature of the deal is what’s at issue, and we believe that having a copy of the files for long-term preservation and meaningfully open access, and fairly liberal terms for the way that we use the materials, is a good exchange for the hundreds of millions of dollars worth of work entailed in doing the scanning.

My first Big Green Egg pizza (Aug. 11, 2008)

I have to say that I was skeptical about the Big Green Egg making a real difference in cooking a pizza, but I'm convinced that this is a game changer. But let's start at the beginning.

If you're devoted to making pizza, you know heat is a big part of success. Our home oven, part of a big dual-fuel setup, uses convection and does a solid 550 degrees without resorting to crazy stuff like hacking the latch for the self-cleaning oven. (Yeah, believe it or not, it's been done.) A colleague with a similar obsession has complained that his oven doesn't reach these temps, and pictures of his pizzas show it. It's gotta be hot and it's gotta cook quickly: you want it brown without the pizza getting dried out. And although heat is a key piece, what every pizza maker knows s/he really wants is a wood-fired pizza oven. Now this is not as absurd a dream as you'd think. There are several models designed for home use (see forno bravo or le panyol, for example), and at least one I've run into is designed so that you can use it indoors--maybe it doubles as an inefficient heat source. However, as technically feasible as a home wood-fired oven is, it feels like a big investment. I can dream, of course.

In the meantime, we recently ran into something called the Big Green Egg. This thing, a kamado-style cooker, generates extremely high heat, serves primarily as a grill, doubles as an oven (and a smoker), and (though still pricey) costs a lot less than a wood-fired oven. Maria and I grill a lot and wanted to incorporate things like spatchcocked chickens into our repertoire, so we decided to give the Big Green Egg a shot.

My first effort at this was relatively successful, with a few problems that leave me opportunities for refining things. I should note that it's relatively easy to get the Big Green Egg up to a mighty 650 degrees and the BGE has available to it accessories (like the "plate-setter") that makes this process pretty straightforward. With the plate setter in place, I went with an American Metalcraft PS1575, a pizza stone made from fire brick and thus much safer for these high heats. (The PS1575 is supposed to be 15.75" in diameter. Mine was a full 16" and may have contributed to some minor damage to my gasket.) This left ample room to slide the pizza onto the stone without losing ingredients over the side. At 650 degrees, the pizza cooked in slightly more than 10 minutes. As you'll see in the two pictures below, this created a nicely cooked crust with a little (and very tasty) burning below, and browned toppings. For this first effort, I stuck with a classic margherita:

top view of pizza

side view of pizza

This was, without a doubt, the best home pizza crust we've ever done, and considering the number of pizzas we've cooked, that's saying something. The crust was noticeably more flavorful and the whole thing did have a slightly smoky taste. I'll admit that I expected the pizza to be no different from the ones we've cooked in the oven, but I was definitely wrong.

As usual, I won't try to reproduce the wealth of information on the web about cooking a pizza on a Big Green Egg. The very helpful Naked Whiz site does a very nice job covering all elements of cooking pizza on the Big Green Egg, including addressing issues of the size of the pizza stone.

What challenges lie ahead? I've had a hard time getting my BGE over 650 degrees and would like to try a slightly higher temperature. In putting the pizza into the BGE, I need to get in and out a little more quickly to avoid losing temperature. And I'm going to need to explore what the issues are around the gasket burning, a problem that might be related to the size of the stone, but which was just as likely to be a result of the gasket having been poorly installed (and protruding into the BGE).

Next generation Library Systems (Nov. 16, 2007)

The problem
With the backdrop of the widely touted lessons of Amazoogle—an expression I can barely stand to write—three of the more interesting emerging developments of late have been OCLC’s WorldCat Local, Google Book Search, and Google Scholar. As Lorcan Dempsey argued, the "massive computational and data platforms [of Google, Amazon and EBay] exercise [a] strong gravitational web attraction," a sort of undeniable central force in the solar system of our users’ web experience. What has happened with WorldCat Local, Google Book Search and Google Scholar has extended that same sort of pull to key scholarly discovery resources. No one needed the OCLC environmental scans to be reminded that our users look to Google before they turn to the multi-million dollar scholarly resources that we purchase for them, and everyone was aware that Amazon satisfied a broad range of discovery needs more effectively than the local catalog. Now, however, mainstream “network services” like Amazon and Google web search, deficient in their ability to satisfy scholarly discovery, are complemented by similarly “massive computational and data platforms” that specialize in just that—finding resources in the scholarly sphere. These forces, and perhaps more like them in the future, should influence the way that we design and build our library systems. If we ignore these types of developments, choosing instead to build systems with ostensibly superior characteristics, systems that sit on the margins, we effectively ensure our irrelevance, building systems for an idealized user who is practically non-existent.

Our resources, skills and investments have helped to create an opportunity for us to shape a next generation of library systems, simultaneously cognizant of the strong network layer and our needs and responsibilities as a preeminent research library. At Michigan, we have designed and built our past systems, each in partial isolation from the other system, reflecting the state of library technology and our response to user needs. We were not wrong in the way that we developed our systems, but rather we were right for those times. In building things in this way, we have developed an LMS support team with extraordinary talent and responsiveness, a digital library systems development effort that blazed trails and continues to be valued for the solidity of its product, and base-funded IT infrastructure that is utterly rock-solid--all great, but generally as independently conceived efforts.[1] What libraries like ours must do now is reconceive our efforts in light of the changed environment. The reconceptualization should, as mentioned, not only be built with an awareness of the new destinations our users choose, but also with a recognition that we have a special responsibility for the long-term curation of library assets. Even at its most successful, Google Scholar does not include all of the roughly $8m in electronic resources that we purchase for the campus, and Google Book Search is not designed to support the array of activities that we associate with scholarship.

Knowing that we must change where we invest our resources is one thing; knowing where we must invest is another. I don’t believe I should (or could) paint an accurate picture of the sorts of shifts we should make. On the other hand, I can lay out here a number of key principles that should guide our work.

Principles
1. Balanced against network services
: I believe this is probably the most important principle in the design of what we must build. We must not try to do what the network can do for us. We must find ways to facilitate integration with network services and ensure that our investment is where our role is most important (e.g., not trying to compete with the network services unless we think we can and should displace them in a key area). For example, we have recognized that Google will be a point of discovery, and so rather than trying to duplicate what they do well for the broad masses of people, we should (1) put all things online in a way that Google can discover; and (2) because we recognize that Google won’t build services in ways that serve all scholarly needs, work to strategically complement what they do. In the first instance (i.e., making sure that Google can discover resources), we will always need to block them, for legal or other reasons, from discovering content.[2] These types of exceptions should add nuance to what we do in exposing content. In the second instance, when it comes to building complementary services, we’ll need to be both smart (and well-informed) and strategic.

2. Openness: What we develop should easily support our building services and, even more importantly, should allow others to build them. It should take advantage of existing protocols, tools and services. Throughout this document, I want to be very clear that these principles or criteria don’t necessarily point to a specific tool or a specific way of doing things. Here, I would like to note that the importance of openness, though great, does not necessarily point to the need to do things as open source. As O’Reilly has written in his analysis of the emergence of Web 2.0, this is what we see in Amazon’s and Google’s architectures, where the mechanisms for building services are clearly articulated, but no one sees the code for their basic services: the investment shifts from shareable software to services. Similarly, our being open to having external services built on top of our own should not imply that our best or only route is open source software. What is particularly important is the need to have data around which others would like to build tools and services: openness in resources that few wish to include is really only beautifying a backwater destination.

3. Open source: Despite what I noted above about openness, we should try, wherever possible, to do our work with open source licensing models and we should try to leverage existing open source activities. In part, this is simply because, in doing so, we’ll be able to leverage the development efforts of others. We should also aim for this because of the increasing cost of poorly functioning commercial products in the library marketplace. Note, though, that when we choose to use open source software, it’s important to pick the right open source development effort—one that is indeed open and around which others are developing. Much open source software is isolated, with few contributions. We should aim for openness in our services over slavish devotion to open source. We should also choose this route when we can simply because it's the best economic model for software in our sphere.

4. Integration: Tight integration is not the most important characteristic of the systems we should build, nor should this sort of integration be an end in itself; however, we have an opportunity to optimize integration across all or most of our systems, making an investment in one area count for others. In Michigan’s MBooks repository, we have already begun to demonstrate some of the value in this type of integration by relying on the Aleph X-Server for access to bibliographic information, and we should continue to make exceptions to tighter integration only after careful deliberation. A key example is the use of metasearch for discovery of remote and local resources: we should need to address only a single physical or virtual repository for locally-hosted content. We should give due consideration to the value of “loose” integration (e.g., automatically copying information out of sources and into target systems), but the example of the Aleph X-Server has been instructive and shows the way this sort of integration can provide both increased efficiency and greater reliability in results.

5. Rapid development: If we take a long time to develop our next generation architecture, it will be irrelevant before we deploy it. I know this pressure is a classic tension point between Management and Developers: one perspective holds that we’re spending our time on fine-looking code rather than getting a product to the user, and the other argues that work done rapidly will be done poorly. This dichotomy is false. The last few years of Google’s “perpetual beta” and a rapidly changing landscape have underlined the need to build services quickly, while the importance of reliability and unforgiving user expectations have helped to emphasize the value of a quality product. We can’t do one without the other, and I think the issue will be scaling our efforts to the available resources, picking the right battles, and not being overambitious.

Directions
These sorts of defining principles are familiar and perhaps obvious, but what is less obvious is where all of this points. Although there are some clear indications that these sorts of principles are at play in, for example, the adoption of WorldCat Local or the integration of Fedora in VTLS’s library management system, there are also contradictory examples (e.g., the rush to enhance the local catalog, and many more silo-like systems like DSpace), and I’ve heard no articulations of an overarching integrated environment. If we undertake a massive restructuring of our IT infrastructure rather than strategic changes in some specific areas, or tweaking in many areas, it may appear to be an idiosyncratic and expensive development effort that robs one's larger library organization of limited cycles for enhancements to existing systems. On the other hand, if we don’t position ourselves to take advantage of the types of changes I mentioned at the outset, we will polish the chrome on our existing investments for a few years until someone else gets this right or libraries are entirely irrelevant. Moreover, if we make the right sorts of choices in the current environment, we should also be able to capitalize on the efforts of others, thus compounding the return on each library’s investment. And of course, situating this discussion in a multi-institutional, cooperative effort minimizes the possibility that building the new architecture robs our institutions of scarce cycles.

It’s important, also, to keep in mind that this kind of perspective (i.e., the one I’m positing here) doesn’t presume to replace our existing technologies with something different. Many libraries have made many good choices on technologies that are serving their institutions well, and to the extent that they are the best or most effective tool for aligning with the principles I’ve laid out, we should use them. The X-Servers of Aleph and MetaLib are excellent examples of tools that allow the sort of integration we imagine. At UM, our own DLXS and the new repository software we developed are powerful and flexible tools without the overhead of some existing DL tools. But in each case, it may make more sense to migrate to a new technology because we are elaborating a model of broader integration (both locally and with the ‘net) that others may also use. Where there is a shared development community (e.g., Fedora, Evergreen or LibraryFind), we can benefit from a community of developers. In all of this, we’ll need a strategy, and a strategy that remains flexible as the landscape changes.

It’s time to see our environment as being comprised of a set of inventory management responsibilities (both print and digital, both local and remote) that leverages a growing and maturing array of network services so that our users can effectively discover and use the resources available to them. I think that requires a change in the way we think about our technologies and a much more strategic arrangement of those technologies in relation to each other. We may be stuck with a bunch of local print “repositories” because of the nature of print and the history of library development. That’s not the case for our digital repository, however. On top of this, we need to conceptualize the sorts of services we need (e.g., ingest, exposure, other types of dissemination, archiving, etc.) and the tools that can best accomplish these things.

Notes
[1] Incidentally, I also believe that Michigan’s organizational model, comprised as it is of five distinct IT departments, is ideally suited to building the next generation of access and management technologies. Core Services should continue to provide a foundation of technology relevant to all of our activities, and should continue to develop and maintain system integration services used by all of the Library’s IT units. Library Systems will need to continue to support operational activities such as circulation and cataloging at the same time that it manages our most important database of descriptive metadata. DLPS should continue to focus on technologies that manage and provide access to the digital objects themselves—the data described by those metadata. Web Systems is ideally suited to provide a top layer of discovery and “use” tools that tap into both local data resources and those things we license remotely. I believe that our current organizational model shares out responsibility effectively and allows for a sort of specialization that is complementary; however, I wouldn’t rule out different organizational models if they made sense in the course of this process. For those readers outside the UM Library, the fifth department is Desktop Support Services, responsible not only for the desktop platform but also for the infrastructure supporting it.

[2] For example, with regard to Deep Blue, our institutional repository, in Michigan’s agreement with Wiley, approximately 33% of the Wiley-published/UM-authored content is restricted to UM users; and in our agreement with Elsevier, we may make it possible for Google to discover metadata but not fulltext. Similar things are bound to occur in the materials we put online in services other than Deep Blue.

Our hidden digital libraries (July 27, 2008)

Two of my very talented colleagues, Kat Hagedorn and Josh Santelli, just published a nice piece in D-Lib entitled "Google Still Not Indexing Hidden Web URLs." Kat and Josh and I have discussed this problem off and on, stimulated in part by our frustrations in getting the OAI data collected by OAIster into search services like Google.

In preparation for a recent talk in China (the challenges and opportunities for digital libraries), I talked to Kat and Josh about the extent of OAIster data not findable through standard web searches. That so much of our digital library content is not findable through standard search engines has always been a troublesome issue, and I would have expected that with the passage of time, this particular problem have been solved. It hasn't, and that has made me wonder about what we do in digital libraries and how we do it.

Kat's and Josh's numbers are compelling. OAIster focuses on the hidden web--resources not typically stored as files in a crawlable web directory--and so OAIster, with its 16 million records, is a particularly good resource for finding digital library resources. Kat and Josh conclude that more than 55% of the content in OAIster can't be found in Google.

As much as I like Kat's and Josh's analysis, I draw a different conclusion from the data. They write that, "[g]iven the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources." This perspective is one many of us share. We're inclined to point a finger at Google (or other search engines) and wish they tried harder to look into our arcane systems. We believe that if only Google and others had a deeper appreciation of our content or tried harder, this problem would go away. I've been fortunate enough to be able to try to advance this argument one-on-one with the heads of Google and Google Scholar, and their responses are similar--too much trouble for the value of the content. As time has passed, I've come to agree.

Complexity in digital library resources is at the heart of our work, and is frankly one reason why many of us find the work so interesting. Anyone who thinks that the best way to store the 400,000 pages (140+ million words) of the texts in the Patrologia Latina is as a bunch of static web pages knows nothing of the uses or users of that resource or what's involved in managing it. Similarly, to effectively manage the tens of thousands of records for a run-of-the-mill image collection, you can't store them as individual HTML pages lacking well-defined fields and relationships. These things are obvious to people in our profession.

We often go wrong, however, when we try to share our love of complexity with the consumers. We've come to understand that success in building our systems involves making complicated uses possible without at the same time requiring the user to have a complicated understanding of the resource. What we must also learn is that a simplified rendering of the content, so that it can be easily found by the search engines, is not an unfortunate compromise, but rather a necessary part of our work.

Will it be possible in all cases to break down the walls between the complex resources and the simple ways that web crawlers need to understand them? Absolutely not. The growing sophistication of the search engines does ensure that it gets easier with time, however. About a decade ago, we tried populating directories with tiny HTML files created from records in image databases. The crawlers gave up after picking up a few thousand records, apparently daunted by the vastness of this content. Now, however, this sort of approach works and only requires patience as the crawlers make repeated passes over the content. Large and complex text collections can by modeled as simplified text files, and the search engines can be tricked into pointing the user to the appropriate entry point for the work from which the text is drawn.

One thing the analysis of the OAIster data shows is that, as a community, we have not availed ourselves of these relatively simple solutions to making our resources more widely discoverable. Not all of the challenges of modeling digital library resources are this easy. There are bigger challenges that require more creative solutions, but creating these solutions is part of the job of putting the resources online, not a nuisance or distraction from that job.

Thursday, December 30, 2010

Revisiting Pizza on the Big Green Egg (May 25, 2009)

For the record, I wanted to add a few small modifications to what I've written about pizza on the Big Green Egg.  When I first started these, I thought the pizzas on the BGE were pretty good, but didn't really measure up to what I've had from a wood-fired pizza oven.  I've learned a few things, though, and can sincerely say that when the stars align I can produce something as good as any pizza I've ever had from a wood-fired oven.

pizza margherita

So, a few notes:
  1. Smaller pizza stone:  Earlier, I recommended a larger (16") pizza stone like the American Metalcraft PS1575, despite warnings by none other than the guru of the ceramic cooker, the Naked Whiz, that the larger stone may result in scorching your gasket.  Indeed, my gasket is long gone and not really missed, even for low-and-slow cooking.  Nevertheless, after breaking my fire brick stone in an adventure I'll explain later, I went for the 14" stone from BGE, which I believe contributed to my being able to get a higher temperature more easily.
  2. Raise your grid; skip the plate setter:  You need to get your pizza stone up to the level of the opening of your BGE.  Most of my early efforts were done with the BGE plate setter.  I recently switched to using a raised grid, without the plate setter, and this had two very beneficial results.  First, there was a clearer path to the dome, and with the heat's upward path impeded, you seem to get a hotter temp more quickly.  Second, with little or nothing between your stone and the first, the stone seems to get hotter.  Without an IR thermometer, I couldn't swear to it, but the difference in the crust was obvious from the first time I did it.
  3. Give your dough some time:  I've refined my dough recipe and wrote about that earlier.  The proportions have turned out to be dead-on, but one thing I've added to that process after having read it in a number of places is letting the dough proof for 14 hours or more.  Yeah, I know, sounds like a complicator, but it's actually a simplifier.  Get everything set up to divide the dough (the first 30 minutes of so worth of work), split it into two plastic containers and pop them into the fridge overnight.  When you're a couple of hours away from cooking, take them out and transfer them to covered bowls.  They'll get to room temperature and rise a bit more, and will also have even more elasticity.
Here are a couple of pizzas that illustrate what I'm writing about here.  The first (at the top of this post) is a standard pizza margherita.  Note the slight bit of char on the crust, which was very tasty.  The dough sprang up and got that wonderful loft within a minute or two of going onto the stone.  The total cooking time was four and a half minutes, and though it could have possibly gone a shorter amount of time, everyone agreed that it was soft, neither dry nor underdone, and tasted wonderful.   The second, here, bacon/arugala pizzawas a mixture of some locally cured red pepper bacon (Tracklements) with buffalo mozzarella, topped with some fresh local arugala.  You can also see, here in the cut-away, crust close-upthe nice job the crust did, also in 4.5 minutes.  The temperature inside the dome of the BGE was about 600 degrees.

I mentioned earlier breaking my fire brick pizza stone.  I thought I might experiment with trying to simulate the wood-burning oven by keeping the top on and propping open the lid of the BGE with fire-proof ceramic wedges.  This was a disaster.  The pizza blackened on the bottom, the heat seemed out of control and irregular, and even the pizza stone--even though it was fire brick--cracked down the middle.

The future of LIS programs (Nov. 30, 2007)

In late October, 2007, I was invited to a summit on the future of Library and Information Science (LIS) programs in our I-schools. The LIS specialization, particularly at Michigan, has been in some disarray. Surrounded by compelling and successful programs in areas such as archives and records management and human computer interaction, the LIS specialization has been seen by some as the rearguard program, supporting the last remnants of a profession that, if not dying, is assumed to be significantly threatened. This stands in stark contrast to librarianship, where in nearly every sphere (e.g., public and academic libraries) we see vital issues being addressed and new futures being forged. For the summit each invitee was asked to write a short position paper organized around the notions represented in the headings, below. Mine follows.

Introduction
I am an academic librarian who works in research libraries, so I see the questions being posed here (and the issue of LIS education generally) through that lens. My perspective is tied significantly to the interplay of information resources and the research uses to which they are put. There are, I think, many reasonable ways to approach these questions, but mine is about this interplay and the need for professionals in my sphere to support an array of activities around research and teaching, including authentication and curation of the products of research.

Technical and social phenomena we see coming in the next 10 years
The technical and social phenomena that seem most significant surround a tension in the perception that disintermediation plays an increasingly evident role in the information space of research institutions.

On the one hand, we see intensifying disintermediation, and along with that an increasingly rich array of tools and technology that facilitate academic users interacting directly with their sources, and directly with the means for dissemination. At the same time, in tension with this disintermediation, we see a drive by competing mediating open systems to facilitate that disintermediation: Google's preeminence makes it an obvious example of this sort of mediation; smaller players (Flickr, Facebook, others) may only fill niche roles, but have come to play the same sort of mediating role.

The irony in this dynamic is that many (or even most) of the most compelling resources have not been peer-to-peer resources, but networked resources like Google or even WorldCat. Consequently, in this world of growing disintermediation, we do not see, primarily, peer-to-peer services predominating, but rather very compelling social networking services that act as a powerful set of intermediaries. Openness at the network layer has become much more important than even "open source" because the services (rather than the software) are the destinations. At the outset, then, in this small space, what I would like to highlight is a growing sense of agency by users in the academic research world, and agency facilitated not by specialized software on their desktops, but by mediating services that those users can leverage to accomplish remarkable things.

In this context of what we've come to think of as "in the flow" (i.e., in the flow of engagement between the user and the mediating network resource), academic research libraries are challenged to perform core functions (functions, such as archiving and instruction, that have not diminished in importance) at the same time that they are challenged to perform their work with users "in the flow." Significantly, the research library must continue to serve a critical curatorial role for cultural heritage information despite the sense that the information being used is everywhere and perhaps thus cared for by the network. While they engage with this challenge of what sometimes feels like trying to catch the wind in a net, academic research libraries must craft a new role more clearly focused on engagement with scholarly communication. They must simultaneously reach out to and become a natural part of the working environment and methods of their users, and engage in the strategic curation of the human record.[1] Around this apparent or real disintermediation with increasingly powerful intermediaries, we need to ensure perpetual access and the right sorts of services to our communities.

Key unanswered questions that should drive research
The problem, as I see it, is that the set of questions evolves as quickly as the environment. So, for example, some current questions include:
  • What are the tools, services and systems that optimize the information seeking, use and creation activities of our users? Even in the age of Google, Amazon and Flickr, academic research library systems play a role in discovery of information. For example, although Google Scholar has been shown to be more effective in discovery than metasearch applications, vast numbers of key resources are not indexed by GS and are only found through the cumbersome and arcane specialized interfaces provided by publishers and vendors.[2] Finding effective ways to intercede and assist users (without also putting cumbersome "help" in their way) is one of the challenges for our community. Similarly, a better understanding of the way our users interact with resources is beginning to make it possible for us to layer onto the network an array of tools (e.g., Zotero or the LibX toolbar) that make it possible for users to integrate networked resources into their scholarship. And, finally, libraries have become the equivalent of publishers in the new, networked environment, and ensuring that we perform that role along with curation in seamless and effective ways is one of our current challenges.[3] All of this raises a number of embedded questions, some related to understanding the behavior of users, others to deploying the most effective technologies, and yet others to judging what the next great technological innovation will be and where we can situate ourselves.
  • How can we most effectively curate the human record in a world that is simultaneously more interconnected and, in some ways, more fragmented?
    • It’s worth noting that even though the network holds out promise for unifying formally-defined "library collections" in a way never before imagined, the fact that many resources are rare or valuable or have significant artifactual value means that the "scatter" of unique parts of collections that we already know well will only become more pronounced (if only by contrast). For example, our making digital surrogates available will remove most, but not all, need for scholars to travel to Michigan to use the papyrus collection.
    • This problem of the artifact obviously represents a marginal case. More significantly, as we are increasingly able to provide electronic access to our print collections, we are faced with the need to develop effective strategies for storing print and balancing access with minimizing waste. It obviously doesn’t make sense to store a copy of ordinary works at each of more than 100 research libraries in the United States, but how can an amalgamation of collections be performed in ways that respect current user preferences for print and takes into account bibliographic ambiguity (e.g., is my copy the same as your copy, and when there are differences, how much variation should be preserved)? We need to document this in a way that ensures a comprehensive sense of curatorial responsibility so that, for example, one institution does not withdraw a "last copy" of a volume by assuming (incorrectly) that it is acting in isolation.
    • Finally, and perhaps most compellingly, there is the question of what constitutes effective digital curation and how (and to what extent) we should balance that curation with access. There is much that we know about appropriate digital formats, migration, and the design of effective archiving services, but this has not been put to the test with the grand challenge that is looming. Moreover, as we provide access, we are challenged by questions of usability, and even more by the question of how we best situate our access services relative to network services. We should not duplicate Google's work in Google Book Search, but there are services Google may not or will not offer, and that we should in agile and relevant ways.
The curriculum we should provide to train professionals in this changing environment
Working from this perspective, it strikes me that the LIS curriculum should focus on developing a method of engagement rather than primarily training to answer specific questions. Of course that focus on methodology must be grounded in an exploration of specific contemporary questions, but it should be made clear that the circumstances of those questions are likely to change (i.e., the journey will be more important than the destination). Perhaps this is obvious or has always been the case, but the incredible fluidity of the environment now calls for precisely this type of response. Some recent experience may help to illustrate this:
  • In our efforts to better understand how mass digitization work succeeds and fails, we have needed to understand the distribution of certain types of materials in our collection. Being able to articulate the question and then pursue strategies for mitigating problems (and increase opportunities) has called for analytical skills and an understanding of research methods, including statistical skills. In a recent specific case, we needed to understand the interaction between particular methods of digitization and different methods of printing (e.g., reproduction of typescript versus offset printing). The methods of digitization are squarely within the field of current librarianship, as is an understanding of the types of materials we collect and own; and it is equally true that both the digitization methods and types of materials will change with time. What I would emphasize is that it is the skills involved in the inquiry that are paramount. Though they are in no way divorced from the specific problems that one tackles, they are the most important part of the educational process.
  • In filling the niche left by Google because of legal constraints and a genuine lack of interest in academic uses of materials, we have embarked on a process of system design and software development. This effort has required of staff not only the ability to write effective code (or manage writing that code), but also the ability to chart courses informed by usability, by an understanding of the law (particularly copyright law), and by a deep understanding of the digital archiving effort (both in formats and in strategies for storage). There is no doubt in my mind that librarians will continue to play a role in the effective design of information systems, and that navigating these parameters (i.e., usability, legal issues, sustainability of the systems and, more importantly, the content) will continue to play a role in the systems we design. Just as with the previous example, those skills cannot be developed or exercised in some way that is abstracted from the materials, the users, and the uses. Again, just as with the previous example, current contexts will change, and the skills and instincts will continue to be the enduring element in our future librarians.
Because of space constraints, these are only two examples, but examples that show the range of skills and approaches necessary in the current environment. The current environment is extremely fluid in the ways that information is made available and in the ways that users, specifically those in our academic community, interact with it. Too often, academic libraries are defined by that which is held in them (witness the importance of the ARL volume count for defining research libraries). Libraries are, above all else, the people, processes, and resources that connect users and information and, unlike organizations like Google or Amazon, libraries are predicated on a commitment to enduring, reliable access to that information. Libraries curate the growing body of human knowledge and through that curation ensure its longevity and reliability; libraries need to make sure that the right kinds of services and interactions are taking place "in the flow," where (disintermediation or not) users have much more agency and much more direct interaction with networked resources. LIS education should focus its efforts on ensuring that the next generation of academic librarians has an awareness of the issues and an aptitude for designing solutions in that world.

Notes
[1] It is probably also the case that libraries, in order to have the opportunity to play these service roles in the future, must prove the importance of the curatorial function and their ability to perform it.
[2] For example, see Haya, Glenn et al. "Metalib and Google Scholar: a user study," in Online Information Review, Vol. 31 No. 3, 2007, pp. 365-375.
[3] See, for example, the work of the UM Library's Scholarly Publishing Office (http://spo.lib.umich.edu/) in creating new scholarly publications with sustainable methods, or Deep Blue, the Library's institutional repository (http://deepblue.lib.umich.edu/).

Mastering the crust (Nov. 15, 2007)

It's probably just that I'm a slow learner, but getting a great crust took me a few years. A good crust is fairly easily in reach and a good crust alone is worth the effort, but stepping it up a notch requires finding the right balance of temperature, tools and ingredients.

Temperature: While the dough is rising, pre-heat your oven with the pizza stone inside it. Here's one of the big challenges. Of course you'd prefer a wood-fired pizza oven, but that's not gonna happen for most of us. You'll want an oven that holds a very high temperature and keeps fairly even heat. I tend to run our electric convection oven at about 530°. This allows the crust to brown nicely in a very short period of time and avoids drying out the crust. Although putting the pizza stone at the top of the oven will make sure it's in the hottest part of the oven, if you're able to get the temp up that high, it won't really matter, and having a few extra inches of working space in sliding the pizza off the peel can be helpful; put the stone on a middle rack with lots of room above.

Tools: In addition to the oven, you'll want a few things like a nice pizza stone (a good, heavy one will hold the heat better) and a decent peel. It also helps to have a brush (to brush oil on the dough).

Dough: Getting a good dough is about balance. If your water is too hot, it'll kill the yeast; too cold, and the yeast won't become active enough. In my opinion, ditto on the flours: too much white flour, you'll lose out on texture and taste; and, for my approach, too much whole wheat and semolina, you'll miss out on the delicate flavors that balance against everything else. All that said, I've found that the preparation of the sponge is one of the most forgiving parts of making a good dough.
Yeast "sponge"
   approx. 2t active dry yeast
   a little less than 2/3c of warm water (about 105°)
   1T whole wheat flour
   1T honey
   about 2T white wine
   about 1t olive oil
Combine these ingredients, minus the white wine, and let sit for about 5 minutes. The yeast should begin to foam. (If the yeast doesn't foam, it may be because the yeast was too old or because the water temperature wasn't right. If you suspect the culprit was the yeast, the only solution is to toss the sponge and the yeast and start all over.) After the yeast begins to foam, add the wine and mix well.
Flour
While the yeast is activating, combine the following dry ingredients in a bowl:
   1/2c semolina
   1/3c fresh organic whole wheat flour (it'll give your dough a nice, almost nutty flavor)
   about 1/2c unbleached white flour, preferably organic
   1-2t sea salt
Mix the sponge into the flour mixture and turn out onto a floured surface. Knead 5-10 minutes, until the dough has a springy, resilient feel. In addition to the unbleached white flour you mixed in at the outset, as you're kneading, add as much additional flour as you need to have the dough be just a tad less than sticky. When you've kneaded enough, you'll be able to push the dough down with your hand and it'll rebound in a few seconds. Drizzle a small amount (1/2t) of oil in a bowl, roll the ball of dough around the inside of the bowl, and let rise for an hour in a slightly warm, draft-free place. I place a slightly damp towel over the bowl and put the bowl in the unused side oven in our two-oven range. After the dough has risen to about 1.5-2 times its original size, put it out on the counter with a bit of flour and knead it down so that the air is out of the dough--about two minutes.

Rolling out the dough
You'll want to avoid using a rolling pin to roll out the dough, as a rolling pin is likely to take too much of the air out of the dough and give you a harder, less flavorful dough. Start by pressing the ball of dough out with the heel of your hand until it begins to form a flatish circle, and then continue to press the dough from the inside out, again, with the heel of your hand. Rotate the ball around as you press outward. Occasionally sprinkle the ball with a small amount of flour and flip it over, using the flour on the bottom to keep the dough from sticking to your surface. Once it reaches roughly half the size of your pizza, tossing the dough (spinning it as it goes up) in the air actually helps to stretch the dough without taking more air out of the dough. Continue to rotate the crust on your surface, pushing outward with the heel of your hand, until it's reached the size you'd like for your pizza, about 14" in diameter.

Finishing up
Put a liberal amount of rough cornmeal on a pizza peel and the toss the dough onto the peel.
Brush a thin coat of olive oil on the dough, particularly the outside eadges.
When you top it, avoid being overgenerous with the toppings, particularly the cheese. A thinner layer is better for the flavor of the dough and the toppings.
Especially if you've been able to get 530° for your oven, cook for about 10-12 minutes. I try to turn the pizza from back to front about halfway through, even though the convection oven evenly distributes the heat, as the back of the oven still cooks more quickly.