Our color names data set is online
I just packaged and released the data set for our color names experiment. It has 10,000 color/label pairs.
This is the download link. Read on for more details:
I tried to generate the color patches in a way to get interesting colors. This of course is incredibly subjective. My main concern was to eliminate muddy dark grays, which are very common when uniformly sampling over standard RGB values. (Perhaps I went too far — see the big donut hole in the color wheel plots.) So the color patches were sampled from HSV with uniform sampling over hue, but saturation and value biased high (normal distribution). The exact code and parameters for this is included in the download.
The plots in the post and the explorer look like a color wheel with hue as the angle. But actually they’re from running PCA over the RGB values, using the first two principal components as x and y. This was a very arbitrary decision, but seemed to make a nice visual effect. There are many other reasonable ways to plot the data.
The data includes anonymized identity on the workers. (The Mechanical Turk service makes all workers anonymous, but we anonymized yet again for releasing the data set.) You can see that certain workers did a large number of annotations. We have no demographic information for this one, sorry.
The files are:
- data.csv, which contains the color/label pairs, also with rgb and hsv representations.
- R.R, which has some routines that were used to generate and plot the data. It has examples of how to read and use the data, if you like to use R.
- html.rb, which with write_html() creates the explorer.
- sample-hit.html, one of the web forms used for data collection. There were 1000 forms with 10 colors each. For each single form (”HIT”), exactly one annotator filled it out. Individual annotators sometimes did multiple forms if they wanted to.
Let us know if this is useful, if you have any questions, or find something wrong with the download — either email or leave a comment here. And if you do anything cool with this data, we’d really love to hear about it.

March 18th, 2008 at 8:56 pm
Great idea and a nice start, however the color names are vague, poorly spelt and inconsistent. It’s pretty unusable in its current state.
March 18th, 2008 at 10:42 pm
Do a filter for “red” in the explorer view. “Hulkcredible.”
March 19th, 2008 at 9:23 pm
bill, try ^red$ to get just ‘red’.
March 22nd, 2008 at 11:31 pm
Here’s a somewhat cleaner (labels are spell checked, lowercase) version of the CSV file: http://drop.io/doloreslabscolorsdata2
April 7th, 2008 at 11:52 pm
1) I love the project! Great work.
2) I imagine it’s safe to assume that you don’t know the sex of your workers? I’d be very interested to know how each sex rated colors differently. If you only looked at males would you see more “blue”, “green”, “red”, etc. I notice that, relatively speaking, few people just said “red”. I blame the lipstick industry.
April 13th, 2008 at 11:37 pm
Does not mean sh*t. I run 58 mturk accounts with automated scripts. Did not even see the colors once I figured out it was in order red/purple/blue/green/yellow tones…Heh…. weekend scientists!
April 14th, 2008 at 8:38 pm
In case anyone believes above comment, there was no pattern to the solicited colors. We reviewed that data and found virtually no cases of scamming (you can check for yourself in our color explorer).
This does illustrate why the reputation/scam detection system we’ve built on top of turk is important and useful to customers. Love the term “weekend scientist” :).
May 25th, 2008 at 4:27 am
Here’s a 3D visualizer for the data and some screenshots:
http://www.box.net/shared/qbunsqy0og
The script requires Python and the Panda3D engine. It uses the modified data set (data2.csv) posted by Nick. Use the left mouse button to pan and the right mouse button to zoom. It’s pretty easy to change the scale and orientation of the axes in the code. Improvements are welcome.
July 4th, 2008 at 3:44 pm
thank you for this post. i will add this blog to my favorites list.