Update 9/19: Final PDF version has been uploaded. See also the comments below for updates — our released data is already being used by others!
We recently teamed up with Rion Snow, Prof. Dan Jurafsky, and Prof. Andrew Ng from the Stanford AI Lab to try using Amazon Mechanical Turk to generate data sets for Machine Learning research. Many AI tasks require a large amount of training data, and to build natural language systems, researchers traditionally pay linguistic experts for millions of annotations. Search engine companies employ hundreds or thousands of annotators for their classification, ranking, and other statistically trained systems, but their data is private and is not available for research. AMT is a potential tool to create high quality data sets accessible to everyone.
We rigorously tested the quality of AMT responses for several classic human language problems, and found that the quality was the same or better than the expert data that most researchers use. We wrote a paper, “Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks,” that will be presented in an upcoming conference, EMNLP-2008.
Our findings:
1. Turker-generated data is good. AMT makes it easy to ask many people for judgments, so for several tasks, we looked at accuracy rates for how well the averaged Turker judgments correlate to the expert gold standard. With more judgments per example, accuracy increases. For comparison, on each graph the horizontal dotted line indicates the rate at which a single expert agrees with their gold standard. Enough non-experts can match or often beat experts’ reliability.
2. Turker-generated data is cheap and fast. We can collect thousands of labels per dollar and per hour.
3. Expert data enhances individual Turker data. First off, individual workers have differing accuracy rates:
So we implemented a statistical technique where we test their accuracy on a portion of the experts’ gold standard data, then reweight votes by worker reliability. This yields higher aggregated accuracy. (Also see our related threshold calibration post.)
4. Turker data enhances NLP systems. For one of the tasks, predicting the emotions elicited by a newspaper headline, we wrote a simple machine-learned classifier and trained it on the Turker data. It easily outperforms one trained on expert data. (There’s a subtle effect here; see the paper for details.)
We’ll update this blog post with a link to the final version of the paper in the coming weeks. Many thanks to our friend Rion, who spearheaded this collaboration. The current version of the paper is here:
[ This article is part of a series, Wisdom of Small Crowds, on crowdsourcing methodology. ]





Bob Carpenter
09/10/08
Awesomely cool paper.
We’ve been annotating data ourselves (at Alias-i) and arguing that it’s really not as hard as all the machine learning folks doing adaptation and semi-supervised learning would have us believe. Named entity data, in particular, is very easy to annotate.
I’m stunned by how cheap it was. 3500 annotations per US$? I kept looking for units in the tables like in financial reports (e.g. “all numbers in 1000s”). It looks like your programmer and task design time’s going to dominate the cost of just about any annotation task.
The tasks you chose are not ones where linguistic expertise seems important. So I’m guessing at least some of your effects stem more from worker carelessness, lack of calibration in their training, or some kind of unwritten training or collusion among the “expert” annotators. Some standards also undergo adjudication of votes, which brings annotators closer together; it’d be interesting to know if there was any of that in your “gold standard” data sets.
I’d love to see the results of using the Bayesian posterior category estimates in place of Dawid and Skene’s point-based EM results. Then again, that may not be necessary as you are just doing ML over the gold standard rather than applying Dawid and Skene’s inference mechanism for annotator sensitivity and specificity. (Laplace’s prior isn’t going to be optimal here where there’s clearly some prior information about annotator accuracies that could be used.)
At the very least, you’ll get posterior intervals on which to base expectation-based judgments of acceptance. It’d be nice to see if the predictions are better correlated with the gold standard than simple voting; I’m guessing they’ll be substantially more so. In particular, it’d be nice to see these results used in lieu of simple voting in the article de-duplication task mentioned in the previous blog post.
I’d be happy to run the data through the Bayesian model for you if you can share the data. I’d really like to see what happens if you estimate the annotator sensitivity and specificity without reference to the gold standard.
I liked the idea of measuring how many generic workers you needed to equal one “expert”. I’d like to see the worker accuracy’s compared pairwise. My guess is that there’s substantial variance among them. Just a table of the kappas, or even better, pairs of sensitivities/specificities taking one annotator as the gold standard. All the annotation data I’ve ever seen has been highly skewed with lots of inter-annotator variation in accuracy.
PS: That’s “Skene” not “Skeene” (your bib’s right but the inline citation in section 5.1 is a typo).
brendano
09/11/08
Thanks for all the comments!
Thanks a lot for the offer to run our data through your system (I’ve been half-successful getting BUGS going, but not entirely there yet!) — it’s all at http://ai.stanford.edu/~rion/annotations/ .
There are some odd details with how we did subsampling and cross-validation that didn’t fit into the paper, so I’ll spell them out here:
First of all, if you want to run any sort of worker modelling with the total 10 annotators per example, that’s easy, just use the data as it is.
However, as you know we were really interested in varying the #anno/example parameter, and simulated this effect by subsampling from that data. In the first part of the paper, which does simple averaging (for continuous problems) and simple majority voting (for categorical problems), this is straightforward. (In fact it goes over all possible permutations.)
But for worker modelling, subsampling is trickier. If you simply subsample down to k annos per example, this artificially decreases the number of workers shared across the train/test split. In real life, AMT works by having the requester specify a maximum number of assn/HIT. Workers come in, do as many HITs as they want, then stop. A HIT stops being offered in the work pool once it’s reached the maximum #assn/HIT. So if you decreased #assn/HIT for a new job, a single worker who was inclined to do say 50 examples, would still do that many. But if you independently subsample down to k annotations per example, you see only (k/10)*50 annotations from that worker.
Anyway, maybe this is worrying too much now, but what I did was build up the entire anno subsample by iteratively sampling without replacement a single worker at a time; upon picking one, I added their annotations to the set of annos, and discarded per-example overflows. This should simulate a Turker choosing to do the experiment, and doing as much work as they can/want. This is in “anno_sample_via_workers()” in http://github.com/brendano/dlanalysis/tree/master/main.R . (All the code for the worker modelling experiments is in that codebase, but it’s rather messy and may not completely work at the moment.)
*Then*, within this anno subsample, we do 20-fold cross-validation among *examples*. So sometimes workers are shared across the train/test split, sometimes they’re not. Worker modeling is only useful when a worker has done examples on both sides of the train/test split. This is not guaranteed because the worker/example contingency table is sparse: lots of workers only annotated several examples, and only a few workers did all. Quick’n'dirty graph of this for one of the affect tasks here and here.
I hope that sounded reasonable and hopefully can be ignored at first pass :)
Bob Carpenter
09/11/08
Thanks so much for sharing the data and the extra details of how the data was collected.
Alexander Sorokin
09/12/08
Hi! That’s great work! I’ve been doing similar things in the vision domain (http://vision.cs.uiuc.edu/annotation/). I had my findings published at the Internet Vision workshop at CVPR 08.
I think MTurk is a great and heavily under-utilized resource in the research community.
brendano
09/13/08
Alexander — wow, very exciting. I’ve definitely been thinking that vision ML data sets are a pretty obvious win — Turkers really like image problems, perhaps more so than text. We’ve done a number of image labelling tasks, and they are often even cheaper than linguistic tasks — I think because many of them are fun, and the cognitive load is often lower. (Human vision processing systems evolutionarily predate language processing; there’s been more time for optimization :) )
Bob Carpenter
09/15/08
I posted a blog entry with the analysis on our blog under the title
Dolores Labs Text Entailment Data from the Amazon Mechanical Turk.
The model does pretty well, especially given the four or five high volume noise generators among the Turkers.
From Panos’s post about why Turkers work, it’s perhaps not surprising that most folks bailed on what looked like a GRE verbal test after 20 examples.
brendano
09/19/08
CORRECTION to my comment #2 up there — I reviewed all the code and it turns out, that fancy subsampling process was NOT used for the results in the paper. Just the simplest version. Oops. Sorry for any confusion.
brendano
09/19/08
Bob, sorry I didn’t respond earlier — this is really exciting. Congrats on getting the model to work so well.
Since we didn’t include the exact numbers for worker modelling/correction in the paper, to make this complete, here are exact numbers from both of our experiments so far (plus a new one I just did):
Accuracy rates at Turker ensembles matching the gold standard, RTE with 10 judgments/example:
89.7% – naïve voting
92.6% – hidden labels, inferred worker prior [MAP via Gibbs sampler] [link]
92.9% – hidden labels, uniform worker prior [MAP via Gibbs sampler] [link]
92.6% – known labels (LOO), add-1 worker prior [MAP via direct inference]
92.9% – known labels (LOO), non-bayesian: drop workers with <67% accuracy then naïve vote the rest
Yes, that last one is way less principled than the others, and wasn’t in the paper, though probably should have been. (My fault; I’ll blame a deadline rush I suppose.)
Anyway, I think it’s impressive that the hidden label model does exactly as well as any system using known labels.
I’d be interested to see how things do for smaller numbers of annotators per example, since there’s more headroom down there -- e.g. at 3 anno/example, naïve accuracy is only 80.5%.
I’m also wondering if a model with per-item difficulty will start becoming useful. When I do RTE myself, I feel there’s large variance in difficulty. Of course this highlights why the BUGS approach makes progress much easier...
[[
More on these new experiments. (1) The thresholding technique: In practice it will work a little less well, because I iterated the threshold parameter and took the best one, and the space was a little bumpy. Though no lower than 92% for all reasonable thresholds; what’s key is eliminating the several noisy+prolific workers. In fact, the 67% threshold drops nearly half of all judgments (!). Though in practice with the AMT feedback cycle you wouldn’t pay for all those bad judgments, just pretest and eliminate bad workers early on, and pay good workers for more work.
(2) Unlike the paper, I used here leave-one-out instead of 20-fold cross-validation. (I have a faster implementation now; I think I'm starting to hit the pain points of R so went back to python ... http://github.com/brendano/dlanalysis/tree/master/workers.py )
]]
Tim Converse
09/25/08
This is a really nice paper – the section on judge bias was helpful for us at Powerset where we worry about some of the same AM Turk data-quality issues. Thanks for putting it out there.
Bob Carpenter
09/28/08
I do think there’s more headroom with fewer annos. What I’m interested in is getting the posterior estimate of accuracy so we can decide which items need more annotation.
I have exactly the same feeling about difficulty, but I haven’t been able to fit any of the logistic models with a latent difficulty predictor. The basic model is just p(anno[i,j]) = inverseLogit(accuracy[j] – difficulty[i]), just like the items response model. I’m going to send some mail to the epidemiologists fitting these models — some of the more recent paper discuss the instability of the model fitting. Even with the true category known, I’m having trouble fitting the models.
The problem seems to be that there’s two ways to account for variability in an annotation, either that it’s difficulty or that the annotators are error-prone. If I crank prior variance on difficulty down close to zero, it fits, but there’s not much of a difficulty effect. If I let variance even approach 1, different chains don’t mix well at all and I just can’t infer the difficulties reliably. I’ve tried this with both real and simulated data.
I did manage to fit the binary mixture of easy/hard items, but that seems less relevant with more and more annotators and especially with very noisy annotators.
Brendan O'Connor
10/22/08
Oh, I just realized I’ve used item difficulty modelling in a different context, to figure out the tradeoff between more annotations and accuracy under naive voting:
With a gold standard set of true examples, get a high #annos/example, and get per-item error rates (1-sensitivity) err_i. This is to make a high recall system, so “yes” decisions must be unanimous (single dissenter causes “no” decision), meaning that all annotators must make an error to get an aggregate error. Then for k annotations per example the expected number of errors is \sum_i (err_i)^k.
This can be viewed as a model where all workers have equal capability, but items have a difficulty parameter. It gives a more pessimistic error estimate than most naive model where all items have the same difficulty, whose expected errors is n * (err)^k. I believe the estimates are different because hard items don’t get solved too well by throwing more and more annotators at them.