“I know it when I see it.” — Justice Potter Stewart
We have been running Crowdsifter, our content moderation product backed by Amazon’s Mechanical Turk for a while and we wanted to share some quality metrics and some stats on how our system aggregates redundant results to improve those metrics.
Controlling for Worker Quality, Bias, and Item Difficulty
In the graph above we picked the the best error rate for raw AMT with 1-11 workers and the best error rate that Crowdsifter provided on a porn judgment task with 2491 images (1006 porn, 1485 non-porn). The error rate is the rate at which wrong decisions are made. A wrong decision is whenever we label porn as not porn or non-porn as porn. The above experiment includes images which were labeled as ambiguous, which is the reason the error rates shown seem so high.
Using Crowdsifter with an average of 3.93 workers per image we achieve the same possible minimum error rate as majority voting in raw AMT with 9 workers per image. We do this by controlling worker quality by keeping track of their judgments. And if we have a “expert” evaluated gold standard of what is pornographic, then we can keep track of which workers are doing a good job or a bad job. On non-gold standard images we weight workers’ judgments based on how well we trust their judgment to reflect our standard of porn. Without these controls, majority voting in raw AMT is vulnerable to the many scammers that lurk there.
For images where obscenity is particularly ambiguous, we can allocate more workers. This results in a better sampling of whether an image is obscene. Some images don’t need many judges to accurately determine if they are pornographic. We can determine which images are easily classifiable as porn by sampling a group of workers and checking whether they all agree. Using too many judges per image can become prohibitively costly. It is important to have this scheme so we can dynamically allocate workers. Raw AMT is both wasteful and inefficient, applying many judgments to easy items, while not using enough judgments for hard items.
Better Measures
The raw error rate includes both images incorrectly labeled as porn, and incorrectly labeled as non-porn. In content moderation we want to minimize our porn miss rate (also known as false negative rate) because we don’t want to let any porn onto our site. The graph for the porn miss rate corresponding with the above graph is shown below.
The most important part is the porn miss rate, and our rate is close to the rates of 9 to 11 workers per image on AMT, even though we are using less than half that number of workers, meaning we significantly cut our costs.
Adjusting Thresholds
We can adjust our certain thresholds to lower the porn miss rate, but we do this at the risk of labeling all our images as porn, so nothing would make it onto our site. Adjusting the threshold to meet the needs of minimizing the porn miss rate, while maintaining an acceptable non-porn miss rate, is a task Crowdsifter can readily handle.
We’ll save what we can do with threshold adjustment for a later blog post.
-John
Thanks to Brendan for help in this post.


Josh
07/10/09
While I have played with MT, I’m no expert – but what stops one from implementing the same algorithm atop of MT? Via the API you can keep track of who each turker is, and can control against a known good sample. Or is there some additional secret sauce that is not implementable using the Turk API?
lukas
07/10/09
Josh – Great question.
1) It’s not trivial to implement an alogirthm using the turk API where different images are shown with different frequencies. It’s also not trivial to hide in gold standard data that immediately calls out worker performance.
2) Finding an optimal weighting algorithms and worker quality calculation is not trivial.
3) We have historical data on every worker that has done this task, so we know which workers are higher quality and lower quality and we know who is scamming us.
The result is more than double the efficiency of aggregating turk results without these benefits. So not only do you NOT have to deal with the turk API, and build your own reporting tools, but your images will get the same quality of moderation in half the time.
My goal with this post was less to brag about the quality of our system and more because I thought our intern John had done a great job of systematically measuring our quality and comparing to a baseline. I thought it would be interesting to show blog readers how we do our quality measurements and the way think about the quality problem with multiple annotators.
brendano
07/11/09
Josh, on implementing with the standard AMT API — it makes very strong assumptions that you want the same number of annotations per item. Furthermore, it has a static publishing model so you can’t make dynamic decisions which items should get more; and you need your own gold testing system; and etc etc like lukas said
Bob Carpenter
07/13/09
I’m confused by the comments. Isn’t Crowdsifter implemented on top of Mechanical Turk? That’s what the post says in the very first line after the quote. I understand why you can’t just use their bulk task mechanisms, but at some level, everything bottoms out in their web API.
My second question is: why is this task so hard? Are there really so many borderline cases that a 5% false negative is reasonable? Maybe you need Justice Potter Stewart as an annotator.
My third question is that if the task is as hard as all that, what’s your agreement among the people doing the “gold standard” annotations? Or do you just annotate easy cases in your gold standard? That’d probably work just as well for adjusting for annotator accuracy in predictions.
Patrick Perry
07/14/09
This is interesting stuff. Clearly, you guys have thought about the problem a lot and are doing a great job.
I suspect these measurements are be biased, though I can’t figure out in what direction. In the real world, the number of porn images is nowhere near 40%. I don’t know what the actual number is, but I would expect less than 0.01% (you guys probably have a better idea about this than I do). Porn images are rare. I can see two potential consequences of this.
Scenario 1: workers almost always click “not porn”, and fatigue sets in. They stop paying attention to the task and their accuracy goes down.
Scenario 2: because porn is so rare and so different from the other images, it really stands out. Workers do not have to think much or pay much attention to get a very good accuracy rate.
Do you guys have any guesses as to which of these happens? I would be very interested in seeing how the accuracy changes as you vary the proportion of porn images in the training set.
Patrick Perry
07/14/09
I should add that: a) it would be nice to see some plots with standard error estimates, and b) it may not be feasible to get a good estimate when you have less than 1% porn in the training set
Pelez
07/20/09
Пора переименовать блог, присвоив слово связанное с доменами :) может хватит про них?