Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick)
Monday, June 16th, 2008 by Brendan O'Connor[ This article is part of a series, Wisdom of Small Crowds, which focuses on crowdsourcing methodology for Amazon Mechanical Turk-like systems. ]
We use Turkers to classify all sorts of data, by having several workers render judgments on each item. But what should we do when they disagree? Like any other human behavior, Turker judgments are noisy: sometimes there are mistakes, and sometimes the task is genuinely difficult or subjective, and there is no “right” answer. Once we have a bunch of Turker judgments, we need to aggregate them — that is, use some sort of voting mechanism — to give as accurate a classification as possible. It turns out that one simple trick, threshold calibration, can substantially improve accuracy, and can be tuned to the specifics of the problem.
Here’s an example. A recent client of ours had a de-duping task: given a pair of similar articles, the task was to decide if they were “about the same topic” or “about different topics”. This is just a binary classification problem; call these labels “YES” and “NO”. To figure out how well Turkers could perform the task, we had our client provide us with a gold standard data set. That is, for 135 examples, their experts did the task themselves and provided “gold” ground truth labels.
We used a very high number of workers per example (about 20). For all 135 examples in the gold standard, the following graph plots them vertically by their “Turker confidence in YES” — that’s just the percentage of votes for “YES” among the 20 or so judgments for that particular example. I’ve also colored each example with the experts’ gold label. You can see that this simple Turker data provides some statistical separation between the classes.
This graph also shows how to create a classifier from Turker votes. We have to choose a confidence threshold for our classifier’s decision: above the threshold, say “YES”, and below say “NO”. Unfortunately, Turkers aren’t perfect at modeling the experts: anywhere we place the threshold, errors occur. However, some thresholds are better than others. The threshold with the best accuracy is at 73% confidence — that is, a 73% super-majority voting rule — and it classifies instances correctly 90% of the time. Furthermore, we can tune for different types of errors. If we are particularly concerned with avoiding false positive errors, we can set a higher, more conservative threshold; or, if we want to find as many “YES” instances as possible, we can set a lower, more liberal threshold.
Here’s another chart that more carefully details the tradeoffs between true and false positives vs. true and false negatives. For a particular decision threshold, it shows how it divides up the instances into the confusion matrix’s 4 categories of correct and incorrect decisions.

A final note on why threshold calibration is important: For this task, the Turkers were considerably more liberal than the experts at deciding what a “YES” example was — experts marked only 36% of examples as “YES”, whereas a simple Turker majority voting rule marks 57% that way. This is because the experts understood the full implications of the decision, which were substantial — various entries in their database and website would be merged, and users would be confused if they were exposed to a bad merge. False positives had a very high cost. The prompt for Turkers, by contrast, was fairly vague. (In our experience, we generally find that good task design is a huge factor in getting better Turker accuracy.) However, since Turker decisions noisily correlate with the experts, moving the decision threshold can help accuracy. Here’s the threshold vs. accuracy graph:
Statistical analysis of Turker data can substantially improve accuracy performance, even with something as simple as choosing the best decision threshold. This blog post only scratched the surface; there are a few more useful things to consider. Stay tuned for Part 2 and hopefully many more!
A few more notes on Turker voting and threshold calibration:






