Sunday, January 2, 2011

Did I say "theoretical"? Openness and Google Books digitization (Apr. 25, 2008)

When I wrote this piece in 2008, HathiTrust was in the works but had not yet been born, and the piece stimulated the sort of dialogue that I believe is very important. Rather than try to reattach coments made at that time by Carl Malamud and Brewster Kahle, I include them here in this post because of their relevance to the debate. Many circumstances have since changed (e.g., HathiTrust and its preservation-orientation is a significant piece of the landscape, HathiTrust makes full book downloads available to authenticated users, and our efforts have grown so much that the mere availability--what I argue is a form of openness--has been embraced as transformational). These changes make some of the argument feel dated, but it feels like an important record nonetheless. I've attempted to capture the flow and content of the original blog entry and comments, and for reference purposes have stored a PDF of the original piece with comments on Scribd.
I was recently quoted in an AP article (published here in Salon) as saying that Brewster Kahle's position with regard to the openness of Google-digitized public domain content is "theoretical." Well, I sure thought I said "polemical," but them's the breaks. Brewster argues that Google's work in digitizing the public domain essentially locks it up--puts it behind a wall and makes it their own--and that this is a loss in a world that loves openness. The contrast here is meant to be with the work of the Open Content Alliance, where the same public domain work might be be shared freely, transferred to anyone, anywhere, and used for any purpose. I don't want to get into the quibble here about the constraints on that apparently open-ended set of permissions (i.e., that an OCA contributor may end up putting constraints on materials that look worse than Google's constraints). What's key here for me, though, is the real practical part of openness--what most people want and what's possible through what Michigan puts online.

I think all of this debate begs us to ask the question "what is open"? For the longest time (since the mid-1990's), Michigan digitized public domain content and made it freely viewable, searchable and printable. Anyone, anywhere could come to a collection like Making of America and read, search and print to his heart's delight. If the same user wanted to download the OCR, that too was made possible and, in fact, the Distributed Proofreader's project has made good use of this and other MOA functionality. We didn't make it possible for anyone to get a collection of our source files because we were actively involved in setting up Print-on-Demand (POD), POD typically has up-front, per-title costs, and making the source files available would have cost us some sales that might otherwise pay for that initial investment. As we moved into the agreement with Google, we made clear our intention to do the same "open" thing with the Google-digitized content, and to throw in our lot with a (then) yet-to-be-defined multi-institutional "Shared Digital Repository." In fact, now we have hundreds of thousands of public domain works online, all of which are readable, searchable and printable by anyone in the world in much the same way.

So, what's the beef? The OCA FAQ states that for them this openness means that "textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF." By all means! I hope it's clear by what I wrote above that this is an utterly accurate description of what happens when Google digitizes a volume from Michigan's collection and Michigan puts it online. It's also, incidentally, what Google makes possible, but even if Google didn't, Michigan could and would be rushing in to fill that breach. The challenges to Google's openness always seem to ignore what's actually possible through our copies at Michigan. This sort of polarizing rhetoric seems to be about making a point that's not accurate in the service of an attack on Google's primacy in this space: we don't want them to dominate the landscape, so let's characterize their Bad version as being the opposite of our Good version. This notion that what Google does is closed is not an accurate description of Google's version of these books, and even less so a description of Michigan's.

Could the Google books be more open? Absolutely. Along with Carl Malamud, for example, I would love to see all of the government documents that have been digitized by Google available for transfer to other entities so that the content could be improved and integrated into a wide variety of systems, thus opening up our government as well as our libraries. I believe that will happen, in fact, and that Google will one day (after they've had a chance to gain some competitive advantage) open up far more. In the meantime, however, when we talk about "open," let's mean it the way that the OCA FAQ means it. Let's mean it in the same way that the bulk of our audience means it. Let's talk about the ability to read, cite and search the contents of these books, and let's call the Google Books project and particularly Michigan's copies Open. Let's stop being theoretical, er, I mean polemical.

Carl Malamud responded by quoting my "for saving or printing using formats such as PDF" and then went on to argue:
> for saving or printing using formats such as PDF

John, pardon me if I don’t grock Mirlyn, but I pulled up a public domain document (a congressional hearing). I was able to pull the text up and page through, but there didn’t appear to be an easy way to save a single page, let alone the entire hearing. Perhaps that function is available to Michigan students, but I suspect the rest of the citizens of Michigan are in the same boat as the rest of using the crippled interface.

I think it is fine that Michigan and Google have their arrangement, but it is disturbing when we see a state-funded institution like U. of Michigan putting up artificial barriers to access.

Your Mirlyn site is ok as far as web sites go, but letting a thousand flowers bloom always leads to more innovation. It would be great if any grad student in Ann Arbor (or anyplace else) could download your govdocs docs and come up with a better user interface.

(In addition to more innovation, that policy would lead to a more informed citizenry, which is generally considered an important part of democracy and I suspect is part of your state-sponsored mandate.)
I responded to Carl, saying:
Gosh, Carl, I think the best way I can respond is not only to say that I whole-heartedly agree with your call for more vigorous sharing, but to point to my fourth paragraph, where I point to your work and urge the same thing. Look, my point is that while this is good, and we are fighting for deeper sharing, this sort of thing is a fairly narrow piece of the openness issue.

On your point about the functionality and the opaqueness of getting PDFs, we’ll take that into account in our usability. It’s there, and we can do better. I should not that for us larger PDF chunks is also a resource issue, but that we’re very close to releasing a new version that gives you 10 pages at a time. Personally, I like the screen resolution PNG files and very much dislike PDF as a format, but that’s a usability position and not a philosophical one.
Later, Brewster Kahle weighed in on the issue of openness with:
John– while it may not be appropriate to start this in a comment, but I am quite taken aback by your seeming implication that “open” includes what google is doing and what UMich is doing.

“Open” started to be widely used in the Internet community in association with certain software. Richard Stallman calls it “free”, but “open” has also come to be used as well. Lets start with that.

“Open Source” in that community means the source code can be downloaded in bulk, read, analyzed, modified, and reused.

“Open Content” has followed much the same trajectory. Creative Commons evolved a set of licenses to help the widespread downloading of creative works, or “content”. Downloading, and downloading in bulk, is part of this overall approach as we see it at the Internet Archive.

Researchers (and more general users, but we can stick with researchers because they are a community that research libraries are supposed to serve) require downloadability to materials so they can be read, compared, analyzed, and recontextualized.

Page at a time interfaces, therefore, would not be “open” in this sense. Downloadable crippled versions would not be open in the Open Source or Open Content sense either.

As a library community, we can build on the traditions from the analog world of sharing widely even as we move into the digital world. We see this as why we get public support.

Lets build that open world.

We would be happy to work with UMich to support its open activities.
My response:
I think this is precisely the sort of rhetoric that’s muddying the waters right now, Brewster. There is no uniformly defined constituency called “researchers” who “require downloadability.” I know ‘em, I work with ‘em, and I know that’s not true. Access (and openness) is defined on a continuum. What we do is extraordinarily open and has made a tremendous difference for research and the in the lives of ordinary users. This sort of differentiation in the full accessibility of source materials is one of the key incentives that has brought organizations like Google and Microsoft to the table, and if it didn’t make sense, the OCA wouldn’t go to pains to stipulate that “all contributors of collections can specify use restrictions on material that they contribute.” Is more open better? Damned right. That’s one reason why for two years we’ve been offering OCA the texts Michigan digitizes as part of its own in-house work. But is what we’re doing with Google texts open? Absolutely.
Carl followed with a practical example:
I’m not sure I get all these degrees of open … let me add a hypothetical if that helps clear this up.

What if a bunch of students in Ann Arbor organized themselves into a Democracy Club and started grabbing all the public domain documents they can find on MBooks and uploading them to some site such as scribd.com or pacer.resource.org for recycling? If the docs are open (and we’re just talking “works of the government” which are clearly in the public domain), would you consider that a mis-use of your system and try and stop it or would that fall inside of the open side of the open continuum we’re all trying to mutually understand in this dialogue?

Hypothetically speaking, of course. I’m not advocating that students form a Democracy Club and crawl your site to recycle public domain materials, I’m just trying to understand if the restrictions on reuse are passive ones like obscuring how to download files or if these are active restraints where the library is involved in enforcing restrictions on access to public domain materials.

Again, I’m not at all suggesting that students interested in furthering the public domain form Democracy Clubs and start harvesting documents from the public taxpayer-financed web sites at UMich and re-injecting them into the public domain.
My response brought in technologies that have since been introduced:
What if? If there really were that sort of interest, I’d hope that we’d have a chance to talk to the students and make sure they were aware of powerful options to make “in situ” use of the openly accessible government documents that they find in MBooks. I’d want to make sure they knew that in late June we’re releasing a “collection builder” application that will allow them to leverage our investment in permanent (did I say permanent?) curation of these materials so that the materials could be found and used after the current crop of students comes and goes, that the students could add to the body of works as more get digitized from our collection and the collections of other partner libraries (e.g., Wisconsin’s are coming in soon) and that we would want to hear what sorts of services (an RSS feed of newly added gov docs?) might aid them in their work. I’d want to talk to them about the issue of authority and quality, and would see if there were ways that their efforts could help improve the works in MBooks rather than dispersing the effort to copies in multiple places. And if they needed computational resources to do things like data mining, I’d let them know that we’re glad to help. But if none of this satisfied them, would we try to stop them? Assuming Google digitized the works, according to our agreement (4.4.1) we would make “reasonable efforts … to prevent [them] from … automated and systematic downloading” of the content, something we currently do and which does not undermine the ability of those same students to read, search and print the documents. Lots of openness there.
Finally, an anonymous writer wondered "Why give Google competitive advantage? How is this the role of a libraries vis a vis a vendor?" to which I responded:
An interesting question. Whether it *should* be (the role) or not isn’t really a question at this point, with decades of examples of libraries working with “vendors” in ways that leveraged library collections for what you might view as a vendor’s competitive advantage. Obviously, some deals were more selfless than others, and many have involved royalties or discounts provided to the libraries. In fact, *some* sort of relationship is absolutely necessary because of the inaccessibility of the materials in our collections (publishers, for example, don’t have the titles and frequently don’t know what they’ve published). On the other hand, the nature of the deal is what’s at issue, and we believe that having a copy of the files for long-term preservation and meaningfully open access, and fairly liberal terms for the way that we use the materials, is a good exchange for the hundreds of millions of dollars worth of work entailed in doing the scanning.

No comments:

Post a Comment