Thursday, December 30, 2010

Metasearch vs. Google Scholar (Nov. 5, 2007)

What the world needs now is not another metasearch engine. Mind you, having more and better and even free metasearch engines is a good thing, but there are already many metasearch engines, each with different strengths and weaknesses, and even some that are free and open source (e.g., see Oregon State’s LibraryFind). Metasearch isn’t an effective solution for the problem at hand.

Let’s start with the problem: each of our libraries invests millions of dollars each year in a wide array of electronic resources for the campus, and we’d like to make it possible for our users to get the best possible information from these electronic resources in the easiest possible way. When presented with this problem over the years, libraries have tacitly posed two possible solutions: (1) bring all of the information together into a single database, or (2) find some way to search across all of these resources with a single search. I suspect no one in our community has the audacity to suggest the first option as a solution because it’s crazy talk. On the other hand, though, for more than a decade we’ve held out the hope of being able to search across many databases as a solution. Wikipedia perhaps says it best in defining the term metasearch: "Metasearch engines create what is known as a virtual database. They do not compile a physical database or catalogue of [all of their sources]. Instead, they take a user's request, pass it to several other heterogeneous databases and then compile the results in a homogeneous manner based on a specific algorithm." Elsewhere, in the more polished entry for federated search (a more old-fashioned reference to the same concept), the author notes that federated searching solves the problem of scatter and lack of centralization, making a wide variety of documents “searchable without having to visit each database individually.”

Metasearch is a librarian’s idealistic solution to an intractable problem.[1] Metasearching works, and there are standards that help ensure that it does. So why doesn’t metasearch work to solve the larger problem I laid out at the beginning? There are many reasons: small variability in network performance, vast variations in the ways that different vendors database systems work, even greater variation in the information found in those different databases, and an overwhelming number of sources. We complain at Michigan that our vendor product, MetaLib, is only able to search eight databases at once, but if there were no limits would we ask it to search the roughly 800 resources we currently list for our users? Surely these problems are tractable. Networks get more robust, standards are designed to iron out differences in systems, and 800 hardly seems like a large number. Nevertheless, networks are in fact very robust right now and those standards only persist in trying to hamstring vendors who are trying to distinguish themselves from their competitors, and 800 is a very large number. Despite all we do, even in the simplest metasearch applications today, when we repeat the same query against the same set of databases, we retrieve different results (IMHO, one of the greatest sins imaginable in a library finding tool). We toss out important pieces of functionality in some of the resources in order to find the right lowest common denominator. (Think about the plight of our hapless user when one database consists of fulltext and another is only bibliographic information: a search of the first resource needs to be crafted carefully to avoid too-great recall, and a the search of the second needs the broadest set of possible terms to avoid too high a level of precision.) This is not to say that it doesn’t make perfect sense to use metasearch to attack, say, a small group of similarly constructed and perhaps overlapping engineering databases rather than submitting the same search against each in some serial fashion.

Although metasearch doesn’t work to conduct discovery over the great big world of licensed content, creating a comprehensive database does work to conduct discovery over a vast array of resources. Recent years have seen several presumptive dominant comprehensive databases. Elsevier’s Scopus (focusing on STM and social science content) claims that its “[d]irect links to full-text articles, library resources and other applications like reference management software, make Scopus quicker, easier and more comprehensive to use than any other literature research tool.” Scopus is just one of the most recent entrants in an arena where California’s Online Education Database, with its slogan of “Research Beyond Google,” can claim to present “119 Authoritative, Invisible, and Comprehensive Resources.” Ironically, in describing the problem of getting at an “invisible web” estimated to be 500 times the size of the visible web, the OEDB poses itself as going beyond Google, when the obvious place to turn in all of this is Google Scholar.

Google Scholar (GS) is absolutely not a replacement for the vast array of resources we license for our users. Criticisms of Google Scholar abound. Perhaps most troubling to an academic audience, GS is secretive about its coverage: no information exists either inside GS or by any watchdog group analyzing the extent of its coverage in any area or for any publisher. Moreover, it will probably always be the case that some enterprises in our sphere fund the work of finding and indexing the literature of a discipline, online and offline, by charging for subscriptions, thus putting them in direct opposition to GS and keeping their indexes out of GS. (Consider, for example, the Association of Asian Studies with its Bibliography of Asian Studies or the Modern Language Association and the MLA Bibliography, each funding its bibliographic sleuthing by selling access to the resulting indexes. To give their information to GS is to destroy the same funding that makes it possible for them to collect the information.) And yet, as we learned in the recent article “Metalib and Google Scholar: a User Study,” undergraduates are more effective in finding needed information through Google Scholar than through our metasearch tools.[2]

If metasearch is an ineffective tool for comprehensive “discovery” and Google Scholar has its own shortcomings, the need and the opportunity in this space is not creating a more effective metasearch tool; rather, the challenge is to bring these two strategies together in a way that best serves the interests of an insatiable academic audience, whether undergraduate, graduate or faculty.

Recently, Ken Varnum (our head of Web Systems) and I brainstormed about a few approaches and followed this with a conversation with Anurag Acharya, who developed Google Scholar. I toss out the the strategies that follow to seed this conversation space with a few ideas, not to pretend to be exhaustive or to point to the best possible solution. These need to be further developed and tested before exploring them further. In each of these, the scenario begins with an authenticated user querying Google Scholar. While the GS results are coming back and are presented to the user, into either a separate frame (Anurag’s recommendation, based on usability work at Google) or into a separate pop-up window, we present information about other sources that might prove useful.

1. Capitalize on user information to augment GS searches: When a user authenticates, we have at our disposal a number of attributes about the user such as status, currently enrolled courses, and degree programs. With this, we initiate a metasearch of databases we deem to be relevant and either return, in that frame or window, ranked results or links to hit counts and databases. One advantage of this approach is that it’s fairly straightforward with few significant challenges. We would probably want to capitalize on work done by Groningen in their Livetrix implementation, where they eschew the standard MetaLib interface for a connection to the MetaLib X-Server so that they can better tailor interaction with the remote databases and present results. The obvious disadvantage to this approach is that we make an assumption about a user based on his or her subject focus: when a faculty member in English searches Google Scholar for information on mortality statistics in 16c England, we’re likely to have missed the mark by searching MLA Bibliography.

2. Capitalize on query information to augment GS searches: In this scenario, we find some way to intercept query terms to try to map concepts to possible databases. We would use the same basic technical approach described above (i.e., GS in the main frame or window; other results in a separate frame or window) to ensure that the user immediately gets on-target results, but through sophisticated linguistic analysis we find and introduce the user to other databases that might bear fruit. This approach avoids the deficiency of the first by making no assumptions about a user’s interest based on his or her degree/departmental affiliation. It does, however, create great challenges for us in creating quick and accurate mapping relationships between brief (one- or two-word) query terms and databases. Although a library might be able to undertake the first strategy with only modest resources, this second approach requires partnership with researchers in areas such as computational linguistics.

3. Introduce the user to the possibility of other resources: This more modest approach only requires the library interface to figuratively tap the user on the shoulder and point out that, in addition to GS, other resources may be helpful. So, for example, we might submit the user’s query to GS while we submit the same query to Scopus and Web of Science, two other fairly comprehensive resources, produce hit counts, and suggest to the user that s/he look at results from these two databases or some of our other 800 resources.

4. Use GS results to augment GS: Use the results from GS, rather than queries to GS, to derive the content of the “you could also look at…” pane. By clustering things that come back, we could provide some subject areas that might be useful. Clustering is tricky, of course, for the same reason that metasearch is tricky—we’re not working with a lot of text and with dissimilar text lengths—but if we could pull back the full text of documents via the OpenURL links GS provides, and then cluster that, we might have some useful information. Again, a library might benefit from collaboration with some area of information science research, particularly on the semantic aspects. The biggest challenge here would be in doing something that doesn’t introduce significant delay (and thus annoyance); however, we might accomplish this by offering it as an option to users (i.e., as in “good stuff here, but think you might want more and better?”).

Our challenge is to help our users through the maze of e-resources without interrupting their journey, getting them to results as quickly as possible; by combining results from Google Scholar with licensed resources we can help them get fast results and become more aware of the wealth of resources available to them. All of these ideas are off-the-cuff and purposely sketchy. Ken and I have spent little time exploring the opportunities or pitfalls. Some approaches will lend themselves to collaboration more than others (e.g., collaboration with HCI and linguistics researches), but all benefit from further study (How much more effective is this approach than traditional metasearch? Than Google Scholar alone? How satisfied is the user with the experience compared to those other approaches?).

Notes
[1] Note the interestingly self-serving article by Tamar Sadeh, from Ex Libris, where she concludes, “Metasearch systems have several advantages over Google Scholar. We anticipate that in the foreseeable future, libraries will continue to provide access to their electronic collections via their branded, controlled metasearch system” (HEP Libraries Webzine, Issue 12 / March 2006, http://library.cern.ch/heplw/12/papers/1/).

[2] Haya, Glenn, Else Nygren, and Wilhelm Widmark. “Metalib and Google Scholar: a User StudyOnline Information Review 31(3)(2007): 365-375. I found one review of the article by an enlightened librarian where he concludes that the moral of the study is that we need to do a better job training our users to use metasearch

27 comments:

  1. Jeremy Frumkin, 2007/11/07: The definition of metasearch presented here is narrower in scope than what I define metasearch to be. Metasearch and federated search are not synonyms; metasearch can incorporate federated search, but is not limited to it.

    ReplyDelete
  2. Jeremy Frumkin, 2007/11/07: Actually, I have heard others talk about bringing all of the information together in a single database, and I myself have put forward the idea (most recently I at least referred to such in my LITA keynote). The problem is not that it is crazy talk; the largest barrier to this issue isn’t technical, but rather with our licensing and business agreements with database and content providers.

    ReplyDelete
  3. Jeremy Frumkin, 2007/11/07: I wouldn’t say that metasearch (as you’ve defined it) is an idealistic solution; it’s a pragmatic bootstrap.

    ReplyDelete
  4. Jeremy Frumkin, 2007/11/07: One of the problems you describe, namely the issue of searching across a small group of targeted collections vs. searching across a large aggregate, is not a problem limited to metasearch tools, but one that extends to the likes of google scholar et al. I believe you are trying to make the point that current federated search tools are limited to searching across a limited number of collections, and this is true due to the latencies in response time via dynamically querying disparate collections – however, you also downplay the time-lag issue here, and I don’t believe that’s at all a small consideration in regards to the barriers that are inherent to a federated search.

    Getting back to small, focused searches vs. searching everything, this is an extremely important point, one that you get to later in your post, and perhaps one that deserves to be broken out here in more detail.

    ReplyDelete
  5. Jeremy Frumkin, 2007/11/07: Check your sources here – the invisible (or deep) web is not the same in scope as our licensed content. OEDB makes a bad reference.

    Again, you state a conclusion (the obvious place to go is google scholar) before you state your reasoning.

    ReplyDelete
  6. Jeremy Frumkin, 2007/11/07: You touch on, but don’t delve into, a very important point here – the current licensing agreements and business models work against increased discovery of resources. If there were a move towards pay-per-use, instead of blanket licensing, there would be financial incentive for content providers to work with libraries, google, etc. to ensure the greatest amount of discoverability possible.

    ReplyDelete
  7. jpw, 2007/11/07: Hmm. I suppose I should watch my syntax more closely. My only reference to OEDB was for its ironic characterization of the invisible/deep web (not for its coverage). Is the invisible web the same as our licensed content? No, they’re not the same; however, our licensed resources are frequently or even typically a part of the deep web. So, to be clearer, what I should have said is that “Scopus is just one of the most recent entrants in the arena of attempting to index that part of the deep web (licensed resources) that is my particular concern” (and shelved my pot-shot for OEDB).

    ReplyDelete
  8. Jeremy Frumkin, 2007/11/07: I’m not sure there is an obvious disadvantage to making these assumptions about a user; perhaps if this approach were utilized and went no further, but as a starting point for providing a more tailored service to a user, it seems this would have more positives than negatives. The more we have the opportunity to understand user workflows related to information discovery, the more we can extend and combine approaches that reduce the time and effort a user needs to get the information they need.

    ReplyDelete
  9. Jeremy Frumkin, 2007/11/07: In our first instance of LibraryFind, we actually did this; the problem was, even in the first instance, that we did not have rich enough descriptions of our licensed databases with which to match (even with linguistic computation) user queries, and therefore the recommendations and mappings to particular databases were generally unhelpful at best. That being said, this is an approach that would benefit from a concerted effort, and could prove fruitful if the barriers you describe can be overcome.

    ReplyDelete
  10. Jeremy Frumkin, 2007/11/07: This is perhaps useful, but seems secondary. Our experience with usability and user testing strongly indicates that, at least for an undergraduate audience, users are unlikely to use auxiliary features – their most common technique is to (a) use a different search tool, and (b) perform a different search. This has so far shown consistent whether the feature is a suggestion feature such as you describe, or even faceting (which is now all the rage). The only feature which so far has not followed this rule is spell-correction / recommendation (i.e. google’s ‘did you mean…’ type of feature).

    ReplyDelete
  11. Jeremy Frumkin, 2007/11/07: Again, users aren’t interested in options (Librarians are, but general users aren’t). However, take this concept one step further – don’t force the user to step through the process of selecting other items they could look at; put that under the hood and include those ‘better’ items in their results. Any extra lifting you force the user to do will be viewed as a barrier to using whatever tool / service is being provided.

    ReplyDelete
  12. Jeremy Frumkin, 2007/11/07: Exactly! Well, almost – I would say the goal is to get the user to their needed resources as quickly as possible; the result set is one step in that process.

    ReplyDelete
  13. Jeremy Frumkin, 2007/11/07: Exactly! Well, almost – I would say the goal is to get the user to their needed resources as quickly as possible; the result set is one step in that process.

    ReplyDelete
  14. David Walker, 2007/11/07: I wonder if selecting databases based on the user’s query is ever going to be feasible.

    I don’t think the answer here is richer descriptions of the databases, honestly. The problem we face is really on the other end: Users often enter very simple terms — with a large numbers of un-typed assumptions in their queries.

    ReplyDelete
  15. Jonathan Rochkind, 2007/11/07: Tito Sierra from NCSU presented on an experiment in doing this at the last Code4Lib. He crawled course description and faculty web pages to collect a corpus of words corresponding to a particular discipline. He then just indexed it on some ‘off the shelf’ open source indexing system (I forget if it was lucene), and I believe used the standard built-in relevancy ranking to determine which discipline’s corpus best matched a given query.

    He said it was actually fairly effective, although of course not perfect.

    You can do a surprising amount with standard tools these days, without custom stuff written by ‘an expert in computational linguistics’.

    Tito was just using the correspondence to deliver the proper ’subject guide’. But a number of us in the audience immediatley relaized the applicability of this technique to automatically doing a meta-search in the appropriate subject/discipline set of dbs.

    I haven’t had time to explore this technique myself yet, but still hope to sometime.

    ReplyDelete
  16. Jonathan Rochkind, 2007/11/07: Of course, we used to have more pay-per-use licenses, but have transitioned to the present standard of blanket licenses–which at the time we thought was a huge win for us.

    While I definitely see what you’re saying, there are still lots of reasons to be resistant to any move back to pay-per-use. A lot of the ideas presented in this essay rely on us, at the library, doing automatic searches of various databases in the background that hte user didn’t specifically ask for, and presenting results to the user just in case they are useful. If we were paying per use, that would be a _disincentive_ to us to provide those kinds of services. As a developer, I’d have a lot less latitude to experiment with those kinds of features if the introduction of such a thing would cost my organization money.

    On the other hand, I suppose if the pay-per-use were only for an actual fulltext view, not for a search, that might be more reasonable. But many of our databases we might want to search might not offer fulltext–and even those which do, the actual fulltext viewed (via link resolver) might not be from the same provider that delivered the search results, so I’m not sure that business model would work for the providers.

    Even if it would, there are good reasons we moved _to_ a blanket license instead of pay-per-use, I’d be very cautious about moving backwards.

    ReplyDelete
  17. Jonathan Rochkind, 2007/11/07: I agree with you completely, Jeremy. But it’s also good to remember that at their best the ’social’ tools can in fact be _about_ discovery. There is such a thing as social discovery.

    But yeah, that’s no excuse for leaving our core-mission services in such a sorry state.

    ReplyDelete
  18. Emily Lynema, 2007/11/08: Actually, Tito is directing users to a list of databases in a particular subject area, not just a subject guide. So it’s a more direct connection from the user’s search terms to database than it sounds like (although it doesn’t actually pass through to the database itself).

    Here’s an example where I queried ‘deforestation’ on our library website. You can see the list of ’specialty research databases’ that come up at top are for the subject areas: environmental science, forestry, plant biology, atmospheric science, etc. Not too bad.

    ReplyDelete
  19. Jonathan Rochkind, 2007/11/08: To toot my own horn, this is the topic of my article published last February in Library Journal:

    http://www.libraryjournal.com/article/CA6413442.html

    ReplyDelete
  20. Jonathan Rochkind, 2007/11/08: Excellent. The next step is automatically using metasearch to _query_ those identified databases, and showing the user the results of that query, not just the databases that the user might want to query. Do it for them, and show them the results.

    Of course, I’m not sure my metalib instance could handle that level of traffic. When I think about doing this, I think about automatically querrying in like 25% of searches or something, to see what happens without killing my metalib server. Metalib is a resource hog.

    ReplyDelete
  21. jpw, 2007/11/09: Nice cite, and a very important thread in the argument about how to make discovery work. I should modify my statement about “crazy talk”! I’m fairly sure I don’t buy the argument that we should be negotiating for local loading of this data, however. Having some competitors for GS would be great, but advocating for sending users back to local discovery mechanisms, however successful, runs counter to taking advantage of the natural draw users have to higher profile network-based services. That’s another topic, though, and the argument you make is an important one to take into account.

    ReplyDelete
  22. Sue Woodson, 2007/11/10: Has anyone tried to negotiate a license to locally house sets of databases? In the past lots of libraries loaded licensed content on local servers. At Hopkins we loaded both Wilson and SilverPlatter databases locally and only moved to vendor served databases beginning in the late 90s. We abandoned local loading in part because of the resources it took to manage and update the databases and the search engines.

    To me it’s not so much a crazy idea as it is a very expensive idea.

    It’s the the kind of project that could be taken up by a consortium of research libraries. That approach would spread out the cost of supporting a ‘metabase’ and pull together a larger set of staff to cooperate in the development.

    Because, of course, after you get all the databases together you still have to develop algorithms that work across disparate types of data and on mechanisms for helping searchers move beyond the first set of results. Searching is, after all, only the first step.

    ReplyDelete
  23. Mita, 2007/11/18: (1) is not crazy talk. In fact, its been done.

    OCUL, a Canadian consortium of university libraries from Ontario, negotiated with the major publishers (Elsevier, Springer and the other usual suspects) and loaded them onto a single server called Scholars Portal (http://www.scholarsportal.info/).

    Not only does this make the librarians involved feel better about the future preservation of this content, it gives the opportunity for developing search beyond the constraints of metasearch.

    Scholars Portal will be moving to a new server and a colleague and I have written about the possibilities of its development in a white paper called Scholr 2.0 :
    http://www.scholarsportal.info/commentpress/

    ReplyDelete
  24. Tori, 2007/11/18: Well-thought and helpful essay. I enjoyed the breakdown of possible solutions *very* much. However, one thing I always find missing in discussions of search (GS or otherwise) is its natural connection to fulfillment or access. Finding something online is great. Now how do I “get” it in my hands or on my screen? Sometimes search ends up being a frustrating exercise in being able to view a possibility of a dangling carrot, but never being able to grasp it. I know this opens up the Pandora’s box of copyfight and ownership issues, but someone needs to acknowledge the white elephant in the room at some point.

    ReplyDelete
  25. WoW!ter, 2007/11/18: Perhaps interesting for you is the definition of Tamar Sadeh (ex libris) for meta search and federated search:
    Sadeh, T. (2006). Google Scholar versus metasearch systems. High Energy Physics Libraries Webzine(12). http://library.cern.ch/HEPLW/12/papers/1/

    ReplyDelete
  26. Jeroan Bosman, 2007/11/30: Local indexing is also a reality in Utrecht, The Netherlands. Our system, called Omega (at http://omega.library.uu.nl/seal/omegasearch.php?lan=en) manages so far to search some 70% of our licensed journals. It is hard work (getting the data, building and updating the filters, tweaking the relevancy ranking), but it can be done. And the system is intensively used, by undergraduates in particular. The unique selling point is that each and every result is available in full text with one click. And another good thing is that the system is much much faster than the typical federated solution. Main concerns right now are getting the metadata from a remaining long list of smaller publishers and, another pressing issue, the way to go in local indexing of ebooks content.

    ReplyDelete
  27. Tito Sierra, 2007/12/05: Sorry for coming to this conversation so late… just stumbled upon this fine blog post today. The experiment Jonathan R. is referring to is called Smart Subjects (just google ’smart subjects’ for more info). It basically attempts to do what John W. describes in paragraph 9. It takes an arbitrary user query as input and returns a short list of related library subjects. Rather than rely on database descriptions it searches across a large corpus of topical terms as a surrogate description. Since the system is based on structured search indices, the response time is milliseconds.

    Our ERM system has a short list of article databases mapped to each of these subjects, so it possible to return a target list of eight databases based on a user query. I would just need to combine and de-dupe the databases associated with our library subjects. Jonathan, the only reason our current implementation doesn’t just send the user’s query into a targeted metasearch environment, or a specific database, is because our production metasearch application can’t handle targeting of this sort. When I have some time I would like to continue with this experiment to create a explicit database recommendation service. This would ask the user to describe the topic they are interested in a few words and it would output a listing of suggested databases to send their search to.

    That said, David’s comment is spot on. If the user types in terms that are subject specific such as “toxicity”, we can make some inferences on this. If the user enters something vaguer than auto-selection of database targets results in false positives. Taking David’s query example of ‘california indians’, my system recommends a variety of subjects including Education, Psychology and History. This false positive problem could be addressed algorithmically by dropping recommendations that are too dissimilar from each other. In other words only providing recommendations when there is a clear topical signal. Again, this is something I would like to experiment with in the future. It’s unclear that this approach will solve the problems John W. describes with metasearch, but we ought to experiment just in case.

    ReplyDelete