<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.3.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>
<channel>
	<title>Comments on: Search engine relevance - an empirical test</title>
	<link>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/</link>
	<description></description>
	<pubDate>Sat, 06 Sep 2008 04:41:11 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.3</generator>
		<item>
		<title>By: Liam</title>
		<link>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-457</link>
		<dc:creator>Liam</dc:creator>
		<pubDate>Tue, 26 Aug 2008 15:57:44 +0000</pubDate>
		<guid>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-457</guid>
		<description>Outstanding Brendan! Get you a case of beer for that one.</description>
		<content:encoded><![CDATA[<p>Outstanding Brendan! Get you a case of beer for that one.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brendano</title>
		<link>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-337</link>
		<dc:creator>brendano</dc:creator>
		<pubDate>Tue, 17 Jun 2008 17:23:46 +0000</pubDate>
		<guid>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-337</guid>
		<description>Jeff, thanks for the suggestion.  I agree that the metric is extremely simple.  I actually experimented a bit with other metrics that fit into this absolute judgment framework, including the count of high-relevance documents in the top five as you mentioned.  The results are about the same.  I think the raw AOL query set is just pretty easy for standard search engines today -- lots of 1 or 2 word topic-oriented queries.

I skimmed through the Carterette paper and it's interesting.  My concern with pairwise setup is, in order to get comparability among query-result pairs, you need to get annotators to do an O(N^2) amount of work.  (Unless you do something horribly complicated with partial orders.)  The absolute judgment task scales linearly, of course.  Given the AMT environment and a fixed budget, if I stay in the smaller-volume task, instead of spending a lot on a quadratic taskload, I can simply get a higher number of workers per result and boil out more noise.  Of course, if it's true the pairwise judgment task is easier -- as the paper claims -- that might make my spending more efficient.  But since it's polynomial, no matter the cost/benefit ratios, there has to be a tipping point where, for a given data set size, you'd always want to switch back to absolute judgments.

Absolute judgments are just so much easier to compute with -- both for analysis and to use as machine learning training data.  I really don't want to have fancy utility inference or stopping rule schemes just to know the relative ranking of my data.  (And I think real-valued scores will always become a necessity.  Theoretical microeconomists have made boatloads of theorems about representing preferences by pairwise comparisons.  It turns out that when you add enough rationality assumptions -- e.g. the sort that are demanded of search engine ranking tasks anyways -- then your fancy ordering can always be mapped back to real-valued utility function.)

I'd be most interested in a paper that compares real-valued scores derived from some sort of pairwise comparison task, versus absolute judgments, and is mindful of the cost tradeoffs in service of an actual goal, like ranking algorithm training.</description>
		<content:encoded><![CDATA[<p>Jeff, thanks for the suggestion.  I agree that the metric is extremely simple.  I actually experimented a bit with other metrics that fit into this absolute judgment framework, including the count of high-relevance documents in the top five as you mentioned.  The results are about the same.  I think the raw AOL query set is just pretty easy for standard search engines today &#8212; lots of 1 or 2 word topic-oriented queries.</p>
<p>I skimmed through the Carterette paper and it&#8217;s interesting.  My concern with pairwise setup is, in order to get comparability among query-result pairs, you need to get annotators to do an O(N^2) amount of work.  (Unless you do something horribly complicated with partial orders.)  The absolute judgment task scales linearly, of course.  Given the AMT environment and a fixed budget, if I stay in the smaller-volume task, instead of spending a lot on a quadratic taskload, I can simply get a higher number of workers per result and boil out more noise.  Of course, if it&#8217;s true the pairwise judgment task is easier &#8212; as the paper claims &#8212; that might make my spending more efficient.  But since it&#8217;s polynomial, no matter the cost/benefit ratios, there has to be a tipping point where, for a given data set size, you&#8217;d always want to switch back to absolute judgments.</p>
<p>Absolute judgments are just so much easier to compute with &#8212; both for analysis and to use as machine learning training data.  I really don&#8217;t want to have fancy utility inference or stopping rule schemes just to know the relative ranking of my data.  (And I think real-valued scores will always become a necessity.  Theoretical microeconomists have made boatloads of theorems about representing preferences by pairwise comparisons.  It turns out that when you add enough rationality assumptions &#8212; e.g. the sort that are demanded of search engine ranking tasks anyways &#8212; then your fancy ordering can always be mapped back to real-valued utility function.)</p>
<p>I&#8217;d be most interested in a paper that compares real-valued scores derived from some sort of pairwise comparison task, versus absolute judgments, and is mindful of the cost tradeoffs in service of an actual goal, like ranking algorithm training.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeff Dalton</title>
		<link>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-331</link>
		<dc:creator>Jeff Dalton</dc:creator>
		<pubDate>Tue, 17 Jun 2008 12:55:15 +0000</pubDate>
		<guid>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-331</guid>
		<description>My biggest suggestion is to work on the evaluation metric used.  Precision @5 is the number of relevant retrieved / retrieved for the top five.  Your metric of having at least one highly relevant in the top isn't p@5 and seems easy to attain.  

For the future, I would suggest using pairwise preference judgments as an alternative (Here or There: Preference Judgments for Relevance by Ben Carterette, et al.).</description>
		<content:encoded><![CDATA[<p>My biggest suggestion is to work on the evaluation metric used.  Precision @5 is the number of relevant retrieved / retrieved for the top five.  Your metric of having at least one highly relevant in the top isn&#8217;t p@5 and seems easy to attain.  </p>
<p>For the future, I would suggest using pairwise preference judgments as an alternative (Here or There: Preference Judgments for Relevance by Ben Carterette, et al.).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Black Hat SEO</title>
		<link>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-296</link>
		<dc:creator>Black Hat SEO</dc:creator>
		<pubDate>Thu, 22 May 2008 01:13:23 +0000</pubDate>
		<guid>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-296</guid>
		<description>&lt;strong&gt;Black Hat SEO...&lt;/strong&gt;

Black Hat SEO...</description>
		<content:encoded><![CDATA[<p><strong>Black Hat SEO&#8230;</strong></p>
<p>Black Hat SEO&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tim Converse</title>
		<link>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-164</link>
		<dc:creator>Tim Converse</dc:creator>
		<pubDate>Fri, 18 Apr 2008 08:56:49 +0000</pubDate>
		<guid>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-164</guid>
		<description>Entertaining and well-written, but I feel like I've seen it all before.</description>
		<content:encoded><![CDATA[<p>Entertaining and well-written, but I feel like I&#8217;ve seen it all before.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel B</title>
		<link>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-147</link>
		<dc:creator>Daniel B</dc:creator>
		<pubDate>Tue, 08 Apr 2008 00:29:07 +0000</pubDate>
		<guid>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-147</guid>
		<description>Nice idea to use mechanical turk for this test. I think though that the contents of the actual page being listed are far more important than the listing text shown in the search engine for determining relevancy. I know I have clicked on quite a few search results that appeared relevant, only to be taken to a 'made for adsense' page full of automatically generated / collated text. The opposite is also true where the listing doesn't seem relevant but the iste is exactly what you are looking for.

If you ever do a version 2 then the users level of satisfaction with the page should be the key metric.  I understand that could be difficult with users who aren't actually looking for anything though.</description>
		<content:encoded><![CDATA[<p>Nice idea to use mechanical turk for this test. I think though that the contents of the actual page being listed are far more important than the listing text shown in the search engine for determining relevancy. I know I have clicked on quite a few search results that appeared relevant, only to be taken to a &#8216;made for adsense&#8217; page full of automatically generated / collated text. The opposite is also true where the listing doesn&#8217;t seem relevant but the iste is exactly what you are looking for.</p>
<p>If you ever do a version 2 then the users level of satisfaction with the page should be the key metric.  I understand that could be difficult with users who aren&#8217;t actually looking for anything though.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brendano</title>
		<link>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-136</link>
		<dc:creator>brendano</dc:creator>
		<pubDate>Fri, 04 Apr 2008 03:59:08 +0000</pubDate>
		<guid>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-136</guid>
		<description>Glad you appreciate all the methodology, I was afraid people would get bored :)  I usually don't trust data either when it lacks discussion on where it came from; on the other hand, it's such a pain to write it all up, especially for a more light-hearted experiment like the color wheel!</description>
		<content:encoded><![CDATA[<p>Glad you appreciate all the methodology, I was afraid people would get bored :)  I usually don&#8217;t trust data either when it lacks discussion on where it came from; on the other hand, it&#8217;s such a pain to write it all up, especially for a more light-hearted experiment like the color wheel!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: david</title>
		<link>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-135</link>
		<dc:creator>david</dc:creator>
		<pubDate>Fri, 04 Apr 2008 01:05:13 +0000</pubDate>
		<guid>http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/#comment-135</guid>
		<description>I'm really glad to see greater discussion of methodology here. In general, when it comes to data without methodology addressed, I assume the absolute worst about it. Whether or not I read all the details (in this case I did), I still like to know that you aren't afraid to share them.

I'd also be interested to see how different you got if you pulled, say, the 15th-20th results instead of 1-5. I feel like that deep you'd begin to see divergence in how relevant the results are. But then, given these findings, maybe not.</description>
		<content:encoded><![CDATA[<p>I&#8217;m really glad to see greater discussion of methodology here. In general, when it comes to data without methodology addressed, I assume the absolute worst about it. Whether or not I read all the details (in this case I did), I still like to know that you aren&#8217;t afraid to share them.</p>
<p>I&#8217;d also be interested to see how different you got if you pulled, say, the 15th-20th results instead of 1-5. I feel like that deep you&#8217;d begin to see divergence in how relevant the results are. But then, given these findings, maybe not.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
