Posts about ‘Wisdom of Small Crowds’


EMNLP Slides

Thursday, November 6th, 2008 by Lukas Biewald

Rion Snow presented the paper, “Cheap and Fast - But is it Good?” at EMNLP last week.

Here are the slides from the talk:

Rls For Emnlp 2008
View SlideShare presentation or Upload your own.

AMT is fast, cheap, and good for machine learning data

Tuesday, September 9th, 2008 by Brendan O'Connor

Update 9/19: Final PDF version has been uploaded. See also the comments below for updates — our released data is already being used by others!


We (Brendan O’Connor) recently teamed up with Rion Snow, Prof. Dan Jurafsky, and Prof. Andrew Ng from the Stanford AI Lab to try using Amazon Mechanical Turk to generate data sets for Machine Learning research. Many AI tasks require a large amount of training data, and to build natural language systems, researchers traditionally pay linguistic experts for millions of annotations. Search engine companies employ hundreds or thousands of annotators for their classification, ranking, and other statistically trained systems, but their data is private and is not available for research. AMT is a potential tool to create high quality data sets accessible to everyone.

We rigorously tested the quality of AMT responses for several classic human language problems, and found that the quality was the same or better than the expert data that most researchers use. We wrote a paper, “Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks,” that will be presented in an upcoming conference, EMNLP-2008.

Our findings:

1. Turker-generated data is good. AMT makes it easy to ask many people for judgments, so for several tasks, we looked at accuracy rates for how well the averaged Turker judgments correlate to the expert gold standard. With more judgments per example, accuracy increases. For comparison, on each graph the horizontal dotted line indicates the rate at which a single expert agrees with their gold standard. Enough non-experts can match or often beat experts’ reliability.

k-acc3.png

2. Turker-generated data is cheap and fast. We can collect thousands of labels per dollar and per hour.

(more…)

Wisdom of small crowds, part 3: another worker visualization

Thursday, August 7th, 2008 by Brendan O'Connor

This is a follow-up to the previous post on individual workloads and rates. Here are the submission times and durations for every worker on the same graph. Each worker is one horizontal line. An assignment is started at a dot, and its duration is for the line segment extending to the right.

submission-durations-wide1.png

The particular data set isn’t the same as in the previous post, but was for a similar task and exhibits a similar structure. Worker rates substantially differ. Some workers do a few HIT’s, but others work on as many as are available. Some work rapidly with breaks (19, 36). Some assignment durations are as long as 5-10 minutes (13, 37). Some work very intermittently (29).

This view makes the parallelism of AMT apparent. At any vertical timeslice you can see how many workers are active at that time. The entire job ends on the right side when the available HIT’s run out.

[ This article is part of a series, Wisdom of Small Crowds, on crowdsourcing methodology. ]

Wisdom of small crowds, part 2: individual workloads and rates

Tuesday, August 5th, 2008 by Brendan O'Connor

[ Update: see also another visualization of this. ]

AMT’s great new interface makes it easy to download completion times for individual worker assignments. Therefore, it’s easy to visualize :) For a recent small job we did (250 HIT’s, 5 workers per HIT), here’s a graph of completion times per worker, over the entire 15 minute duration of the job. Each assignment is a single point, graphed by when it was done versus how long it took.

completion-times.png

(more…)

Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick)

Monday, June 16th, 2008 by Brendan O'Connor

[ This article is part of a series, Wisdom of Small Crowds, which focuses on crowdsourcing methodology for Amazon Mechanical Turk-like systems. ]

We use Turkers to classify all sorts of data, by having several workers render judgments on each item. But what should we do when they disagree? Like any other human behavior, Turker judgments are noisy: sometimes there are mistakes, and sometimes the task is genuinely difficult or subjective, and there is no “right” answer. Once we have a bunch of Turker judgments, we need to aggregate them — that is, use some sort of voting mechanism — to give as accurate a classification as possible. It turns out that one simple trick, threshold calibration, can substantially improve accuracy, and can be tuned to the specifics of the problem.

Here’s an example. A recent client of ours had a de-duping task: given a pair of similar articles, the task was to decide if they were “about the same topic” or “about different topics”. This is just a binary classification problem; call these labels “YES” and “NO”. To figure out how well Turkers could perform the task, we had our client provide us with a gold standard data set. That is, for 135 examples, their experts did the task themselves and provided “gold” ground truth labels.

We used a very high number of workers per example (about 20). For all 135 examples in the gold standard, the following graph plots them vertically by their “Turker confidence in YES” — that’s just the percentage of votes for “YES” among the 20 or so judgments for that particular example. I’ve also colored each example with the experts’ gold label. You can see that this simple Turker data provides some statistical separation between the classes.

Test set separation by Turker ensemble binary classifier

This graph also shows how to create a classifier from Turker votes. We have to choose a confidence threshold for our classifier’s decision: above the threshold, say “YES”, and below say “NO”. Unfortunately, Turkers aren’t perfect at modeling the experts: anywhere we place the threshold, errors occur. However, some thresholds are better than others. The threshold with the best accuracy is at 73% confidence — that is, a 73% super-majority voting rule — and it classifies instances correctly 90% of the time. Furthermore, we can tune for different types of errors. If we are particularly concerned with avoiding false positive errors, we can set a higher, more conservative threshold; or, if we want to find as many “YES” instances as possible, we can set a lower, more liberal threshold.

Here’s another chart that more carefully details the tradeoffs between true and false positives vs. true and false negatives. For a particular decision threshold, it shows how it divides up the instances into the confusion matrix’s 4 categories of correct and incorrect decisions.

Classifier performance on gold standard at different thresholds

A final note on why threshold calibration is important: For this task, the Turkers were considerably more liberal than the experts at deciding what a “YES” example was — experts marked only 36% of examples as “YES”, whereas a simple Turker majority voting rule marks 57% that way. This is because the experts understood the full implications of the decision, which were substantial — various entries in their database and website would be merged, and users would be confused if they were exposed to a bad merge. False positives had a very high cost. The prompt for Turkers, by contrast, was fairly vague. (In our experience, we generally find that good task design is a huge factor in getting better Turker accuracy.) However, since Turker decisions noisily correlate with the experts, moving the decision threshold can help accuracy. Here’s the threshold vs. accuracy graph:

thresh-acc.png

Statistical analysis of Turker data can substantially improve accuracy performance, even with something as simple as choosing the best decision threshold. This blog post only scratched the surface; there are a few more useful things to consider. Stay tuned for Part 2 and hopefully many more!

A few more notes on Turker voting and threshold calibration:

(more…)