Search engines control the information we see and use. Their key component is a ranking algorithm that tries to determine the most relevant web pages for your query. How good are these algorithms? Publicly, there’s a lot of hype, while privately, all the big engines run proprietary quality evaluation efforts. But there’s virtually no real data out there for the rest of us.
Using Mechanical Turk, we can evaluate engine relevance. We tried an experiment where we took five hundred queries and ran them against the top 4 English language web search engines: Ask, Google, Live, and Yahoo. The queries were a random sample from a real-world set of search queries. We had annotators rate the relevance of the top five results for each engine. Our results:
Ask clearly performed the worst. The other three engines were in a statistical tie. Their ordering was Google, Yahoo, then Live, but the differences were miniscule: the top 3 engines all answer about 80% of queries effectively.
What do these results mean?
People often talk about Google as being the most relevant search engine, with the best algorithms and the like. This study finds little evidence to support that. Sure, our methods are preliminary and could be improved in any number of ways; we can probably shrink those error bars and find more statistical differences. However, it is the case that for 500 typical queries, a rough but pretty objective measurement of search quality found that Google, Live, and Yahoo all performed about the same.
Note that these results don’t speak to the entire user experience. To be able to compare between engines, we extracted only the core web results with their titles, urls, and snippets. But a search engine also includes much more: the presentation, branding, video and image results, ads, etc. We only tested the relevance of core web search.
Many more details below.
How we’re measuring search relevance
Evaluating search engine quality is a tricky task. Here’s our first pass on the methodology.
We take a set of queries and run them against several search engines, scraping their web interfaces. We then show the query and results to human raters, asking them how relevant each result is. It’s a blind test: they don’t know which engines the results came from.
Here is an example of what a rater sees:
The raters all come from Amazon Mechanical Turk, a distributed workforce. We submit the above query/result judgment surveys to the AMT service, and pay its users – “Turkers” – to do the relevance ratings. (If you want to learn more, try looking at the Dolores Labs FAQ.)
What queries are being used? We took a random sample from the AOL query log data set; these are actual queries that real users typed in to a search engine. The AOL data set is remarkable for being pretty much the only publicly available, real-world data on web search behavior. Of course, it’s infamous for very valid privacy issues. We’re only using the part of it that doesn’t involve personal information – the raw queries, without user identification. (This NYT article on the issue is interesting.)
What’s being measured? For each engine, we count the number of queries that had at least one “Highly Relevant” result within the first five results the engine returned. This is a version of the “precision at 5″ metric from information retrieval. There are, of course, many other methods to explore. We wanted a metric that was simple and easy to interpret.
How were raters’ judgments used? We had three raters per result, and basically took a simple majority vote. We didn’t attempt to model individual annotator biases.
Are these judgments trustworthy? The ratings are certainly noisy. And sure, the workers have little training and are (somewhat) anonymous to us. However, the relevance judgment task is fairly subjective and therefore inherently noisy. Further, it’s arguably better to use untrained annotators, since this more closely mimics normal search users. And finally, we’re finding some statistically significant, systematic differences between engines on a query set, with only extremely simple analysis – so something must be working right.
What’s the statistical methodology? As dead simple as possible: 95% confidence intervals on the graph, and engine comparisons via paired t-tests. These are all on that per-query precision-at-5 metric. We think that with larger scale experiments, more fine-grained breakdowns, survey design improvements, better analyses, etc., we can flesh out more differences.
What’s the “meta-engine upper bound”? That’s just how many queries had a “Highly Relevant” result on at least one engine. So hypothetically, if you were to combine all the engines and select the best results for the top, it would perform at this upper bound. This bound is overly high for a number of reasons (e.g. it’s artificially inflated by judgment noise and it assumes smart re-ranking); but it gives some idea how much the search engines could still improve.
Anyway, we’re thinking of doing more work along these lines if people are interested. There are certainly big improvements that could be made; we’d love any feedback you have.


david
04/04/08
I’m really glad to see greater discussion of methodology here. In general, when it comes to data without methodology addressed, I assume the absolute worst about it. Whether or not I read all the details (in this case I did), I still like to know that you aren’t afraid to share them.
I’d also be interested to see how different you got if you pulled, say, the 15th-20th results instead of 1-5. I feel like that deep you’d begin to see divergence in how relevant the results are. But then, given these findings, maybe not.
brendano
04/04/08
Glad you appreciate all the methodology, I was afraid people would get bored :) I usually don’t trust data either when it lacks discussion on where it came from; on the other hand, it’s such a pain to write it all up, especially for a more light-hearted experiment like the color wheel!
Daniel B
04/08/08
Nice idea to use mechanical turk for this test. I think though that the contents of the actual page being listed are far more important than the listing text shown in the search engine for determining relevancy. I know I have clicked on quite a few search results that appeared relevant, only to be taken to a ‘made for adsense’ page full of automatically generated / collated text. The opposite is also true where the listing doesn’t seem relevant but the iste is exactly what you are looking for.
If you ever do a version 2 then the users level of satisfaction with the page should be the key metric. I understand that could be difficult with users who aren’t actually looking for anything though.
Tim Converse
04/18/08
Entertaining and well-written, but I feel like I’ve seen it all before.
Jeff Dalton
06/17/08
My biggest suggestion is to work on the evaluation metric used. Precision @5 is the number of relevant retrieved / retrieved for the top five. Your metric of having at least one highly relevant in the top isn’t p@5 and seems easy to attain.
For the future, I would suggest using pairwise preference judgments as an alternative (Here or There: Preference Judgments for Relevance by Ben Carterette, et al.).
brendano
06/17/08
Jeff, thanks for the suggestion. I agree that the metric is extremely simple. I actually experimented a bit with other metrics that fit into this absolute judgment framework, including the count of high-relevance documents in the top five as you mentioned. The results are about the same. I think the raw AOL query set is just pretty easy for standard search engines today — lots of 1 or 2 word topic-oriented queries.
I skimmed through the Carterette paper and it’s interesting. My concern with pairwise setup is, in order to get comparability among query-result pairs, you need to get annotators to do an O(N^2) amount of work. (Unless you do something horribly complicated with partial orders.) The absolute judgment task scales linearly, of course. Given the AMT environment and a fixed budget, if I stay in the smaller-volume task, instead of spending a lot on a quadratic taskload, I can simply get a higher number of workers per result and boil out more noise. Of course, if it’s true the pairwise judgment task is easier — as the paper claims — that might make my spending more efficient. But since it’s polynomial, no matter the cost/benefit ratios, there has to be a tipping point where, for a given data set size, you’d always want to switch back to absolute judgments.
Absolute judgments are just so much easier to compute with — both for analysis and to use as machine learning training data. I really don’t want to have fancy utility inference or stopping rule schemes just to know the relative ranking of my data. (And I think real-valued scores will always become a necessity. Theoretical microeconomists have made boatloads of theorems about representing preferences by pairwise comparisons. It turns out that when you add enough rationality assumptions — e.g. the sort that are demanded of search engine ranking tasks anyways — then your fancy ordering can always be mapped back to real-valued utility function.)
I’d be most interested in a paper that compares real-valued scores derived from some sort of pairwise comparison task, versus absolute judgments, and is mindful of the cost tradeoffs in service of an actual goal, like ranking algorithm training.
Liam
08/26/08
Outstanding Brendan! Get you a case of beer for that one.
Vitaly
02/04/09
“Search engine relevance – an empirical test” – Just do search with a search engine, cut URL and paste it to an Insert Text or URL box in http://www.ClassEngine.com which a free online document/web content classification tool. You will see if returned categories are close to the search query.
csxk16
05/13/09
harb45 test test544343
q8qrr0
05/13/09
yugygu6756 tyu hffdrtd y guyg ug
0ajz9d
06/09/09
hjhggg6642 test test
angelacloulgE
08/02/10
Hello. Nice work. How can I subscribe to your site?