Wisdom of small crowds, part 3: another worker visualization
August 7th, 2008 by Brendan O'ConnorThis is a follow-up to the previous post on individual workloads and rates. Here are the submission times and durations for every worker on the same graph. Each worker is one horizontal line. An assignment is started at a dot, and its duration is for the line segment extending to the right.
The particular data set isn’t the same as in the previous post, but was for a similar task and exhibits a similar structure. Worker rates substantially differ. Some workers do a few HIT’s, but others work on as many as are available. Some work rapidly with breaks (19, 36). Some assignment durations are as long as 5-10 minutes (13, 37). Some work very intermittently (29).
This view makes the parallelism of AMT apparent. At any vertical timeslice you can see how many workers are active at that time. The entire job ends on the right side when the available HIT’s run out.
[ This article is part of a series, Wisdom of Small Crowds, on crowdsourcing methodology. ]




August 14th, 2008 at 2:20 pm
Excellent demonstration of worker times (both this one and the previous post).
Have you thought of examining more closely the HITs that tend to take longer to complete than the rest? I am wondering if they are “more difficult” than the rest, or if they fall into some specific category.
August 14th, 2008 at 2:44 pm
Also, can you post the code for these visualizations? They are pretty cool and very revealing at the same time.
August 14th, 2008 at 7:21 pm
Thanks! The code is all in R and pretty minimal, so I’ll just put it inline here.
This is working off of the new CSV format from the new AMT interface, which has one row per assignment:
a = read.csv("amt_assignments_file.csv") # some lame datetime cleanups - amazon uses strftime("%c"), totally dumb... lame_convert < - function(x) strptime(x, "%A %b %d %T") for (c in c('AcceptTime','AutoApprovalTime','CreationTime','Expiration','SubmitTime')) a[,c] = as.POSIXct( lame_convert(a[,c]) )Here’s the parallelism plot:
worker_parallelism_plot < - function(a, w_pos=NULL, ...) { if (is.null(w_pos)) { w_starts = (dfagg(a,a$WorkerId, function(x) min(x$SubmitTime - x$WorkTimeInSeconds))) w_pos = rank(w_starts, ties='first') } plot(a$SubmitTime - a$WorkTimeInSeconds, w_pos[a$WorkerId], type='p', ...) segments(a$SubmitTime - a$WorkTimeInSeconds, w_pos[a$WorkerId], a$SubmitTime, w_pos[a$WorkerId]) # text(sort(w_starts), 1:length(w_pos), sprintf("%s", 1:length(w_pos)), pos=2) }The one-box-per-worker plot in the other post is just
August 14th, 2008 at 7:24 pm
oops, that code uses a function (”dfagg”) from the utility file http://github.com/brendano/dlanalysis/tree/master/util.R
August 14th, 2008 at 9:54 pm
Oh, as for HITs that take longer — haven’t looked at that too much. I’ve only done this timing analysis for a task that’s really easy for all HIT’s.