Posts about ‘Wisdom of Small Crowds’


Wisdom of small crowds, part 3: another worker visualization

Thursday, August 7th, 2008 by Brendan O'Connor

This is a follow-up to the previous post on individual workloads and rates. Here are the submission times and durations for every worker on the same graph. Each worker is one horizontal line. An assignment is started at a dot, and its duration is for the line segment extending to the right.

submission-durations-wide1.png

The particular data set isn’t the same as in the previous post, but was for a similar task and exhibits a similar structure. Worker rates substantially differ. Some workers do a few HIT’s, but others work on as many as are available. Some work rapidly with breaks (19, 36). Some assignment durations are as long as 5-10 minutes (13, 37). Some work very intermittently (29).

This view makes the parallelism of AMT apparent. At any vertical timeslice you can see how many workers are active at that time. The entire job ends on the right side when the available HIT’s run out.

[ This article is part of a series, Wisdom of Small Crowds, on crowdsourcing methodology. ]

Wisdom of small crowds, part 2: individual workloads and rates

Tuesday, August 5th, 2008 by Brendan O'Connor

[ Update: see also another visualization of this. ]

AMT’s great new interface makes it easy to download completion times for individual worker assignments. Therefore, it’s easy to visualize :) For a recent small job we did (250 HIT’s, 5 workers per HIT), here’s a graph of completion times per worker, over the entire 15 minute duration of the job. Each assignment is a single point, graphed by when it was done versus how long it took.

completion-times.png

(more…)

Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick)

Monday, June 16th, 2008 by Brendan O'Connor

[ This article is part of a series, Wisdom of Small Crowds, which focuses on crowdsourcing methodology for Amazon Mechanical Turk-like systems. ]

We use Turkers to classify all sorts of data, by having several workers render judgments on each item. But what should we do when they disagree? Like any other human behavior, Turker judgments are noisy: sometimes there are mistakes, and sometimes the task is genuinely difficult or subjective, and there is no “right” answer. Once we have a bunch of Turker judgments, we need to aggregate them — that is, use some sort of voting mechanism — to give as accurate a classification as possible. It turns out that one simple trick, threshold calibration, can substantially improve accuracy, and can be tuned to the specifics of the problem.

Here’s an example. A recent client of ours had a de-duping task: given a pair of similar articles, the task was to decide if they were “about the same topic” or “about different topics”. This is just a binary classification problem; call these labels “YES” and “NO”. To figure out how well Turkers could perform the task, we had our client provide us with a gold standard data set. That is, for 135 examples, their experts did the task themselves and provided “gold” ground truth labels.

We used a very high number of workers per example (about 20). For all 135 examples in the gold standard, the following graph plots them vertically by their “Turker confidence in YES” — that’s just the percentage of votes for “YES” among the 20 or so judgments for that particular example. I’ve also colored each example with the experts’ gold label. You can see that this simple Turker data provides some statistical separation between the classes.

Test set separation by Turker ensemble binary classifier

This graph also shows how to create a classifier from Turker votes. We have to choose a confidence threshold for our classifier’s decision: above the threshold, say “YES”, and below say “NO”. Unfortunately, Turkers aren’t perfect at modeling the experts: anywhere we place the threshold, errors occur. However, some thresholds are better than others. The threshold with the best accuracy is at 73% confidence — that is, a 73% super-majority voting rule — and it classifies instances correctly 90% of the time. Furthermore, we can tune for different types of errors. If we are particularly concerned with avoiding false positive errors, we can set a higher, more conservative threshold; or, if we want to find as many “YES” instances as possible, we can set a lower, more liberal threshold.

Here’s another chart that more carefully details the tradeoffs between true and false positives vs. true and false negatives. For a particular decision threshold, it shows how it divides up the instances into the confusion matrix’s 4 categories of correct and incorrect decisions.

Classifier performance on gold standard at different thresholds

A final note on why threshold calibration is important: For this task, the Turkers were considerably more liberal than the experts at deciding what a “YES” example was — experts marked only 36% of examples as “YES”, whereas a simple Turker majority voting rule marks 57% that way. This is because the experts understood the full implications of the decision, which were substantial — various entries in their database and website would be merged, and users would be confused if they were exposed to a bad merge. False positives had a very high cost. The prompt for Turkers, by contrast, was fairly vague. (In our experience, we generally find that good task design is a huge factor in getting better Turker accuracy.) However, since Turker decisions noisily correlate with the experts, moving the decision threshold can help accuracy. Here’s the threshold vs. accuracy graph:

thresh-acc.png

Statistical analysis of Turker data can substantially improve accuracy performance, even with something as simple as choosing the best decision threshold. This blog post only scratched the surface; there are a few more useful things to consider. Stay tuned for Part 2 and hopefully many more!

A few more notes on Turker voting and threshold calibration:

(more…)