What the world needs now is not another metasearch engine. Mind you, having more and better and even free metasearch engines is a good thing, but there are already many metasearch engines, each with different strengths and weaknesses, and even some that are free and open source (e.g., see
Oregon State’s LibraryFind). Metasearch isn’t an effective solution for the problem at hand.
Let’s start with the problem: each of our libraries invests millions of dollars each year in a wide array of electronic resources for the campus, and we’d like to make it possible for our users to get the best possible information from these electronic resources in the easiest possible way. When presented with this problem over the years, libraries have tacitly posed two possible solutions: (1) bring all of the information together into a single database, or (2) find some way to search across all of these resources with a single search. I suspect no one in our community has the audacity to suggest the first option as a solution because it’s
crazy talk. On the other hand, though, for more than a decade we’ve held out the hope of being able to search across many databases as a solution. Wikipedia perhaps says it best in defining the term
metasearch: "Metasearch engines create what is known as a virtual database. They do not compile a physical database or catalogue of [all of their sources]. Instead, they take a user's request, pass it to several other heterogeneous databases and then compile the results in a homogeneous manner based on a specific algorithm." Elsewhere, in the more polished
entry for federated search (a more old-fashioned reference to the same concept), the author notes that federated searching solves the problem of scatter and lack of centralization, making a wide variety of documents “searchable without having to visit each database individually.”
Metasearch is a librarian’s idealistic solution to an intractable problem.
[1] Metasearching works, and there are standards that help ensure that it does. So why doesn’t metasearch work to solve the larger problem I laid out at the beginning? There are many reasons: small variability in network performance, vast variations in the ways that different vendors database systems work, even greater variation in the information found in those different databases, and an overwhelming number of sources. We complain at Michigan that our vendor product, MetaLib, is only able to search eight databases at once, but if there were no limits would we ask it to search the roughly 800 resources we currently list for our users? Surely these problems are tractable. Networks get more robust, standards are designed to iron out differences in systems, and 800 hardly seems like a large number. Nevertheless, networks are in fact very robust right now and those standards only persist in trying to hamstring vendors who are trying to distinguish themselves from their competitors, and 800
is a very large number. Despite all we do, even in the simplest metasearch applications today, when we repeat the
same query against the
same set of databases, we retrieve
different results (IMHO, one of the greatest sins imaginable in a library finding tool). We toss out important pieces of functionality in some of the resources in order to find the right lowest common denominator. (Think about the plight of our hapless user when one database consists of fulltext and another is only bibliographic information: a search of the first resource needs to be crafted carefully to avoid too-great recall, and a the search of the second needs the broadest set of possible terms to avoid too high a level of precision.) This is not to say that it doesn’t make perfect sense to use metasearch to attack, say, a small group of similarly constructed and perhaps overlapping engineering databases rather than submitting the same search against each in some serial fashion.
Although metasearch doesn’t work to conduct discovery over the great big world of licensed content, creating a comprehensive database does work to conduct discovery over a vast array of resources. Recent years have seen several presumptive dominant comprehensive databases. Elsevier’s Scopus (focusing on STM and social science content)
claims that its “[d]irect links to full-text articles, library resources and other applications like reference management software, make Scopus quicker, easier and more comprehensive to use than any other literature research tool.” Scopus is just one of the most recent entrants in an arena where California’s
Online Education Database, with its slogan of “Research Beyond Google,” can
claim to present “119 Authoritative, Invisible, and Comprehensive Resources.” Ironically, in describing the problem of getting at an “invisible web” estimated to be 500 times the size of the visible web, the OEDB poses itself as
going beyond Google, when the obvious place to turn in all of this is
Google Scholar.
Google Scholar (GS) is absolutely
not a replacement for the vast array of resources we license for our users. Criticisms of Google Scholar abound. Perhaps most troubling to an academic audience, GS is secretive about its coverage: no information exists either inside GS or by any watchdog group analyzing the extent of its coverage in any area or for any publisher. Moreover, it will probably always be the case that some enterprises in our sphere fund the work of finding and indexing the literature of a discipline, online and offline, by charging for subscriptions, thus putting them in direct opposition to GS and keeping their indexes out of GS. (Consider, for example, the Association of Asian Studies with its
Bibliography of Asian Studies or the Modern Language Association and the
MLA Bibliography, each funding its bibliographic sleuthing by selling access to the resulting indexes. To give their information to GS is to destroy the same funding that makes it possible for them to collect the information.) And yet, as we learned in the recent article “Metalib and Google Scholar: a User Study,” undergraduates are more effective in finding needed information through Google Scholar than through our metasearch tools.
[2]
If metasearch is an ineffective tool for comprehensive “discovery” and Google Scholar has its own shortcomings, the need and the opportunity in this space is
not creating a more effective metasearch tool; rather, the challenge is to bring these two strategies together in a way that best serves the interests of an insatiable academic audience, whether undergraduate, graduate or faculty.
Recently, Ken Varnum (our head of Web Systems) and I brainstormed about a few approaches and followed this with a conversation with Anurag Acharya, who developed Google Scholar. I toss out the the strategies that follow to seed this conversation space with a few ideas, not to pretend to be exhaustive or to point to the best possible solution. These need to be further developed and tested before exploring them further. In each of these, the scenario begins with an authenticated user querying Google Scholar. While the GS results are coming back and are presented to the user, into either a separate frame (Anurag’s recommendation, based on usability work at Google) or into a separate pop-up window, we present information about other sources that might prove useful.
1. Capitalize on user information to augment GS searches: When a user authenticates, we have at our disposal a number of attributes about the user such as status, currently enrolled courses, and degree programs. With this, we initiate a metasearch of databases we deem to be relevant and either return, in that frame or window, ranked results or links to hit counts and databases. One advantage of this approach is that it’s fairly straightforward with few significant challenges. We would probably want to capitalize on work done by Groningen in their
Livetrix implementation, where they eschew the standard MetaLib interface for a connection to the MetaLib X-Server so that they can better tailor interaction with the remote databases and present results. The obvious disadvantage to this approach is that we make an assumption about a user based on his or her subject focus: when a faculty member in English searches Google Scholar for information on mortality statistics in 16c England, we’re likely to have missed the mark by searching
MLA Bibliography.
2. Capitalize on query information to augment GS searches: In this scenario, we find some way to intercept query terms to try to map concepts to possible databases. We would use the same basic technical approach described above (i.e., GS in the main frame or window; other results in a separate frame or window) to ensure that the user immediately gets on-target results, but through sophisticated linguistic analysis we find and introduce the user to other databases that might bear fruit. This approach avoids the deficiency of the first by making no assumptions about a user’s interest based on his or her degree/departmental affiliation. It does, however, create great challenges for us in creating quick and accurate mapping relationships between brief (one- or two-word) query terms and databases. Although a library might be able to undertake the first strategy with only modest resources, this second approach requires partnership with researchers in areas such as computational linguistics.
3. Introduce the user to the possibility of other resources: This more modest approach only requires the library interface to figuratively tap the user on the shoulder and point out that, in addition to GS, other resources may be helpful. So, for example, we might submit the user’s query to GS
while we submit the same query to Scopus and Web of Science, two other fairly comprehensive resources, produce hit counts, and suggest to the user that s/he look at results from these two databases or some of our other 800 resources.
4. Use GS results to augment GS: Use the results from GS, rather than queries to GS, to derive the content of the “you could also look at…” pane. By clustering things that come back, we could provide some subject areas that might be useful. Clustering is tricky, of course, for the same reason that metasearch is tricky—we’re not working with a lot of text and with dissimilar text lengths—but if we could pull back the full text of documents via the OpenURL links GS provides, and then cluster that, we might have some useful information. Again, a library might benefit from collaboration with some area of information science research, particularly on the semantic aspects. The biggest challenge here would be in doing something that doesn’t introduce significant delay (and thus annoyance); however, we might accomplish this by offering it as an
option to users (i.e., as in “good stuff here, but think you might want more and better?”).
Our challenge is to help our users through the maze of e-resources without interrupting their journey, getting them to results as quickly as possible; by combining results from Google Scholar with licensed resources we can help them get fast results
and become more aware of the wealth of resources available to them. All of these ideas are off-the-cuff and purposely sketchy. Ken and I have spent little time exploring the opportunities or pitfalls. Some approaches will lend themselves to collaboration more than others (e.g., collaboration with HCI and linguistics researches), but all benefit from further study (How much more effective is this approach than traditional metasearch? Than Google Scholar alone? How satisfied is the user with the experience compared to those other approaches?).
Notes
[1] Note the interestingly self-serving article by Tamar Sadeh, from Ex Libris, where she concludes, “Metasearch systems have several advantages over Google Scholar. We anticipate that in the foreseeable future, libraries will continue to provide access to their electronic collections via their branded, controlled metasearch system” (
HEP Libraries Webzine, Issue 12 / March 2006,
http://library.cern.ch/heplw/12/papers/1/).
[2] Haya, Glenn, Else Nygren, and Wilhelm Widmark. “
Metalib and Google Scholar: a User Study”
Online Information Review 31(3)(2007): 365-375. I found one review of the article by an enlightened librarian where he concludes that the moral of the study is that we need to do a better job training our users to use metasearch