Login

Login

The CrowdFlower Blog

Bing is an Improvement over Live, but Still Not Google Quality: Evaluating Bing With Mechanical Turk

by

Microsoft’s new search engine, Bing, has recently gotten a lot of attention. Several people have already built tools to compare Google and Bing.

Since all the engines are fairly similar, it’s hard to separate true quality from our preconceptions. For example, one of Google’s internal tests is reported to have shown that “users still prefer the results with the Google logo, even if they’re not Google results.”

Are the new Bing search results really better than the old Live search results? Are they better than Google?

We took 100 random real-world queries and showed their results from each engine to workers on Mechanical Turk. For a single query, we showed the results from two engines side-by-side and asked workers to judge which result set was better. For each query, here’s the aggregate judgment from several workers:

Bing versus Google

Bing (Microsoft today) versus Live (Microsoft as of March)

Summary

We found that Google is statistically significantly preferred to Bing (p < 0.04), though the difference is rather small: Google is preferred on 55 percent of the queries, and on average it scores two tenths of a standard deviation better than Bing. (0.141 on a four-point scale.)

On the other hand, we found that users preferred Bing's new results to the older Live search results 55% of the time. But this result wasn't statistically significant -- they're virtually tied in aggregate.

In conclusion, Bing's quality seems to be improving, but hasn't yet caught Google. Of course, relevance is just one component of a search engine user experience, and it's clear that all the major engines are quite close, and there exist a large set of queries where Bing significantly outperforms Google.

Details

First, we randomly sampled a query set from the leaked AOL queries, which is probably still the best, most representative available data of web search queries. We ran 100 of them on the old Microsoft Live search back in March for a previous project, and last week we scraped Google and Bing.

Note that our scrapes ignore the Google “One Box” results that you see above the search results for many queries that include news, pictures, etc. We threw out the results for several common navigational queries where Bing returns only one result (myspace, aol, etc.) — these probably don’t affect the results much since all the engines do very well.

There are many ways to evaluate relevance. For this experiment, we chose to show people the results of two engines at a time, side-by-side and unbranded. You can see exactly what the turkers saw here. We randomized the left and right engines.

The possible answers were “Engine A much better”, “Engine A slightly better”, etc. We averaged the results over 6-8 workers for every pair of engines, mapping responses to {-2, -1, +1, +2}.

A histogram of the raw judgments on Google versus Bing:

Thanks to Brendan for help on this post.

Updates:
Will points out that we should mention Brendan and I both worked at Powerset, which was acquired by Microsoft after we left.

Hang has a blog response and takes issue with our graphs, my response is in his comments: http://blog.figuringshitout.com/another-way-to-lie-with-statistics

0saves
If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.

Comments

  1. Perry

    Very interesting.

    Was Google always “Engine A”? I’m wondering if the results aren’t skewed to whichever is the first. It would be interesting to re-run the tests but inverse the engine order.


  2. I’d like to see this done again with a larger sample, more recent queries, and some evaluation of Turker quality. Of course, I don’t know where to get the query log data.

    Using queries that are 2 1/2 years old (e.g. [Spring 2006 Fashion]) is probably not representative for today’s searches.

    Did the Turkers have the option to say “neither”? I found that “neither” was my answer for most queries I ran. So I suspect many of the -1 and +1 results may be noise. For instance, looking at the presumably navigational query [fidelty.com], the Turkers overwhelmingly prefer Google, I slightly prefer Bing’s results, but it’s really really close (and likely depends on whether the subtitled link results match the user’s intent).

    How much agreement was there among the Turkers? Assuming the vertical axis in the first plot the sum of the votes (what you call “aggregate”), there’s not much difference for most of the queries.

    We’ve found this kind of third-party evaluation difficult for query logs (we’ve tried it with several customers). Neither we nor the subject expert judges could tell for many cases what the query submitter’s intent was. For instance, what’s the user’s informational need when they issue queries in your evaluation set like [music], [natalie], [credit cards], or [above ground pools]?

    Instructions also matter (you didn’t show us those). Was it the overall results being judged in terms of which results page you’d rather see? Or which has the single result you’d most likely click? Which supplies most overall information about the topic in snippets? Did they click through to see the pages, or just judge the results pages?


  3. Perry – A vs B was randomized. So no order effects.


  4. It might be rational to prefer the results that have the Google logo in real-world tests, even if they’re exactly the same — Microsoft has a history of censoring their search results, so it might mean something different if you can’t find what you’re looking for in the Google results than if you can’t find it in the Microsoft results. In particular, if you can’t find it in the Google results, you can be pretty sure it’s not there or you’re using the wrong query, but if you can’t find it in the Microsoft results, they might just not want you to find it.


  5. Let me add the disclaimer you forgot to add to your study: you used to work at Powerset on technology later purchased by Microsoft and incorporated, in part, in Bing.com. I’m not claiming bias, but it’s still worth the disclaimer.


  6. Oh, and my own disclaimer, of course: I still work at Powerset, and am a Microsoft employee.


  7. Lukas: the first two graphs appear to be the result of people simply guessing largely at random. I generated a null hypothesis graph and I commented on this issue at my blog:

    http://blog.figuringshitout.com/another-way-to-lie-with-statistics

    Cheers
    Hang


  8. For a quick and simple way to “taste test” the difference yourself check out the amazing Blind Search – http://blindsearch.fejus.com/

    Adds Yahoo to the mix and removes the logos so you don’t know


  9. David

    Kragen – tinfoil hat much?


  10. Lukas, what test did you run to confirm the statistical significance?


  11. Xianhang: that’s what the null hypothesis tests are for. The first graph (bing vs google) has about p=0.04, which means that if people were guessing randomly, there’s less than a 4% chance we would have seen the sort of results we saw. 5% is the customary threshold for “statistical significance.” The second graph, as we stated, could have been due to chance (it’s p-value was higher).

    That’s why we say we think google is slightly better than bing, but it’s a little bit of a wash whether bing is better than live.

    Panos: i think it was a one-sample t-test of the differences, but i’m not sure


  12. err, i don’t mean differences, i mean the comparison responses (of course)


  13. Let me put it this way, even though I switched my default search engine to Bing, I consistently have to go back to Google to get what I want (after searching on Bing first).

    I would love for Bing to be better, for the simple reason that even Google doesn’t do the best job in search.

    But Bing is just not competing, IMO – look at the difference between these two queries:
    http://tinyurl.com/l52fkt and http://tinyurl.com/ltbyl9

    It looks to me that Google understands user intent better


  14. Do any of the results vary much by location, or previous search history? Google’s results seem to vary significantly by location, for example, which is frequently annoying, but I guess it must affect the results for the better if the search engines are doing it.


  15. Hi Anil,

    I had to write a post on this topic after reading yours. Here’s a link: http://blog.gadodia.net/bing-vs-google-no-fancy-analytics-pure-personal-experience/


  16. David

    @Offbeatmammal: The problem with Blind Search is that the search engine used to retrieve each column of results in clearly marked in the page’s source code. Anyone who can use “View Source” can check which column is which and vote accordingly.


  17. rourbboob

    visit us!
    newsbox.cc
    newsbox.us
    nbstatus.wordpress.com
    NOW!


  18. Very much a prompt reply :)


  19. Similar to the experiment using 100 random queries, here is another example of the same type, where users can plugin their queries and select the most relevant search engine themselves. bset.royans.net

    Google and Yahoo both seem to be much better than Bing, though Google is a leader by a long margin. Whats also interesting is that, it looks like different search engines might be better for different types of content ( or could be based on location, language)


  20. Hi, I´ve contacted you by email some days ago and did not have any feedback. My email address is in the email box


  21. Bing is currently dropping sites and pages like theres no tomorrow (just like MSN always has done)i’ve noticed this with several sites.. they do come back though in the majority of cases.
    Bing is a joke.. the all new search engine, yet behaves exactly the same as MSN and returns the same results.


  22. I’d like to know if someone has already carried out a blind test on search engine results.

    It could be very interesting to ask users which results are considered best without knowing the name of the search engine that produced that particular SERP. This could avoid the risk that users’ opinions could be influenced by brand related ‘noise’.

    Obviously the results could be compared only on a semantic basis, but I think that the statistical reliability could be significantly better.


Leave a Reply

Comment


Why CrowdFlower?

How it Works What it Means Scalability Technology Innovation and Expertise

Documentation

Requester Interface Gold CrowdFlower API CML Channel API Image Moderation API

Solutions

eCommerce Online Media and Publishing Data Providers Daily Deals & Local Search Brand Management Self-Service

Products

read more...

Customers

read more...

About

Team Press Resources Jobs Contact

Law Talk

Privacy Policy Terms of Service ©2011 CrowdFlower