Crowdsourcing to find media bias: Hillary vs. Obama
As anyone who follows political races knows, different sources can report the same event in very different ways. We took nearly six thousand recent articles over the past month about Clinton and Obama and sent them to Mechanical Turk to be classified as favorable or unfavorable for the respective candidates. We scraped the articles from Google News restricted to several sources, and threw in front page headlines from Digg.
Here is the graph for favorability scores, aggregated by source. We found that Digg was far and away the most favorable for Obama.
The next graph tracks overall news favorability by date. To provide some context, we compared it with the change in Obama stock on the Intrade prediction market.
More details after the jump:
We created our data set by doing two separate searches, one for “Barack Obama” and one for “Hillary Clinton”. This did a pretty good job ensuring that results from Google News or Digg’s search facility demonstrated how the article was about the given candidate. For each article, we showed the headline, search result snippet, and link to several Turkers. They reported whether it was positive, neutral, or negative toward the candidate.
The favorability metric was created by averaging the ratings across articles. Pro-Obama and anti-Hillary articles were both worth 1 point; anti-Obama and pro-Hillary both worth -1, and neutrals 0.
Therefore, if all articles are either positive towards Obama or negative towards Hillary, the rating is +100%; and vice-versa for -100%.
The data is very noisy. The question of favorability is extremely tricky: it includes a combination of expectations, sentiment, and the objective events a newspaper chooses to report. All of these are hard to reliably assess or even define. (And whether anything you measure constitutes “media bias” is another complicated question!)
Despite all this philosophical intractability, the data must be showing something real, because we have a statistically sound result: the difference between Digg and the others was statistically significant (t-test, p<.001). The differences within the mainstream media were not statistically significant.



March 27th, 2008 at 5:19 am
why is obama the positive and hillary the negative on your bar graph? hrm.
March 27th, 2008 at 6:16 am
It is a fact that Obama is more popular in the younger demographic while Hillary has the edge in the older demographic. Also Obama is more popular among university educated/white collar workers, while Hillary is more popular in the working class. Both of the Obama lending factors are present on DIGG — it skews young and it skews towards college students / future white collar technology professionals. I suggest finding working class outlets that are frequented by older American to see a similar pro-Clinton stance.
March 28th, 2008 at 4:23 pm
Following John’s comment above: there may be a hidden bias in using Mechanical Turk for this sentiment analysis. The demographics of the Turkers (young, college educated) seems to fit better the profile of the Obama fans. So they everything else being equal, I expect the articles to be marked as pro-Obama.
I do not question the predictive power of the news articles, though. I have done it much larger scale, and indeed it works, even with automatic sentiment analysis techniques. (For more details, look at http://behind-the-enemy-lines.blogspot.com/2007/12/prediction-markets-are-not-efficient.html)
March 28th, 2008 at 6:53 pm
Yes, I’d expect the Turker population to have a bias compared to a more representative population. But I think the news source comparison still holds, because the judgments were blind and done by the same annotator population.
March 28th, 2008 at 9:31 pm
Correct, the study measures correctly the bias of each outlet. The only issue is where to put the “0″ at the y-axis that measures “favorability” for Obama.