Login

Login

The CrowdFlower Blog

Our color names data set is online

by

I just packaged and released the data set for our color names experiment. It has 10,000 color/label pairs.

This is the download link. Read on for more details:

I tried to generate the color patches in a way to get interesting colors. This of course is incredibly subjective. My main concern was to eliminate muddy dark grays, which are very common when uniformly sampling over standard RGB values. (Perhaps I went too far — see the big donut hole in the color wheel plots.) So the color patches were sampled from HSV with uniform sampling over hue, but saturation and value biased high (normal distribution). The exact code and parameters for this is included in the download.

The plots in the post and the explorer look like a color wheel with hue as the angle. But actually they’re from running PCA over the RGB values, using the first two principal components as x and y. This was a very arbitrary decision, but seemed to make a nice visual effect. There are many other reasonable ways to plot the data.

The data includes anonymized identity on the workers. (The Mechanical Turk service makes all workers anonymous, but we anonymized yet again for releasing the data set.) You can see that certain workers did a large number of annotations. We have no demographic information for this one, sorry.

The files are:

  • data.csv, which contains the color/label pairs, also with rgb and hsv representations.
  • R.R, which has some routines that were used to generate and plot the data. It has examples of how to read and use the data, if you like to use R.
  • html.rb, which with write_html() creates the explorer.
  • sample-hit.html, one of the web forms used for data collection. There were 1000 forms with 10 colors each. For each single form (“HIT”), exactly one annotator filled it out. Individual annotators sometimes did multiple forms if they wanted to.

Let us know if this is useful, if you have any questions, or find something wrong with the download — either email or leave a comment here. And if you do anything cool with this data, we’d really love to hear about it.

-Brendan

0saves
If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.

Comments

  1. Chris

    Great idea and a nice start, however the color names are vague, poorly spelt and inconsistent. It’s pretty unusable in its current state.


  2. bill

    Do a filter for “red” in the explorer view. “Hulkcredible.”


  3. David

    bill, try ^red$ to get just ‘red’.


  4. Nick

    Here’s a somewhat cleaner (labels are spell checked, lowercase) version of the CSV file: http://drop.io/doloreslabscolorsdata2


  5. Brandon

    1) I love the project! Great work.

    2) I imagine it’s safe to assume that you don’t know the sex of your workers? I’d be very interested to know how each sex rated colors differently. If you only looked at males would you see more “blue”, “green”, “red”, etc. I notice that, relatively speaking, few people just said “red”. I blame the lipstick industry.


  6. mturker

    Does not mean sh*t. I run 58 mturk accounts with automated scripts. Did not even see the colors once I figured out it was in order red/purple/blue/green/yellow tones…Heh…. weekend scientists!


  7. In case anyone believes above comment, there was no pattern to the solicited colors. We reviewed that data and found virtually no cases of scamming (you can check for yourself in our color explorer).

    This does illustrate why the reputation/scam detection system we’ve built on top of turk is important and useful to customers. Love the term “weekend scientist” :).


  8. Here’s a 3D visualizer for the data and some screenshots:

    http://www.box.net/shared/qbunsqy0og

    The script requires Python and the Panda3D engine. It uses the modified data set (data2.csv) posted by Nick. Use the left mouse button to pan and the right mouse button to zoom. It’s pretty easy to change the scale and orientation of the axes in the code. Improvements are welcome.


  9. This post is awesome, nice work!


  10. Hi guys.

    What’s the licence on the data, please? (CC0 or CC-BY would be good :-) ).

    And on the code (GPL v3+ or MIT would be good :-) ).

    Thanks.


  11. Excellent questions. We don’t have formal licensing on these. What do you advise?


  12. Jan Margeta

    Hi, wonderful experiment and a fantastic collection of data!
    It would be interesting to also analyse the rejected “scammer” data. Maybe some of them are not intentional scammers and some interesting patterns exist.

    Have you decided the license of the data and code already? Licenses proposed by Rob Myers seem to be quite friendly.

    Cheers!


Leave a Reply

Comment


Why CrowdFlower?

How it Works What it Means Scalability Technology Innovation and Expertise

Documentation

Requester Interface Gold CrowdFlower API CML Channel API Image Moderation API

Solutions

eCommerce Online Media and Publishing Data Providers Daily Deals & Local Search Brand Management Self-Service

Products

read more...

Customers

read more...

About

Team Press Resources Jobs Contact

Law Talk

Privacy Policy Terms of Service ©2011 CrowdFlower