What makes a bad survey question and why does it matter? I thought I’d use my first blog posts as Dolores Labs’s friendly neighborhood social scientist to talk a little bit about question design since it’s a relevant, but often overlooked, area of Crowdsourcing work.
You can ask “the crowd” all kinds of questions, but if you don’t stop to think about the best way to ask your question, you’re likely to get unexpected and unreliable results. You might call it the GIGO theory of research design.
To demonstrate the point, I decided to recreate some classic survey design experiments and distribute them to the workers in Crowdflower’s labor pools. For the experiments, every worker saw only one version of the questions and the tasks were posted using exactly the same title, description, and pricing. One hundred workers did each version of each question and I threw out the data from a handful of workers who failed a simple attention test question. The results are actual answers from actual people.
An Example: Response Scales
The rest of this post focuses on one example question that involved a response scale and a test to see how altering the scale would affect people’s answers. Here are two versions of the same question that I posted to Crowdflower:
Low Scale Version:
About how many hours do you spend online per day?(a) 0 – 1 hour
(b) 1 – 2 hours
(c) 2 – 3 hours
(d) More than 3 hours
High Scale Version:
About how many hours do you spend online per day?
(a) 0 – 3 hours
(b) 3 - 6 hours
(c) 6 – 9 hours
(d) More than 9 hours
Notice that both versions can accommodate any answer and that the only difference is in the range of the scale items. You can give an accurate response to either question and neither version explicitly pushes you to give any answer over another.
So what did people say? Here’s a pair of histograms breaking the responses up by the two versions of the question:

I didn’t label the height of the bars because the results are almost useless in this form. The only conclusion we can draw is that a lot of people in the Crowdflower worker pool tend to spend more than three hours per day online (whoa, no way…).
At the same time, it seems like the workers might have given low answers more frequently in response the low scale (check out how big the first three blue bars are compared to just the first orange bar).
To look at that comparison more closely, let’s break the answers into two categories for each scale: (1) the percentage of responses that were less than three hours, or (2) the percentage of responses that were more than 3 hours.

The difference between the height of the orange points (high scale) is much bigger than the corresponding difference between the height of the blue points (low scale). In other words, people who saw the high scale were much more likely to say they spent more than 3 hours online. In case you’re a stats nerd, the Chi-square test showed that this variation was significant with a p-value < 0.001, so the difference was almost certainly not due to chance.
But maybe collapsing the responses like this is a little too coarse and you'd still like to see how the variation worked across the scale as a whole. With that in mind, Lukas suggested another way to look at the effects – a comparison of the cumulative percentage of responses – and the differences are even more clear.

That gap between the blue and the orange line at “Less than 3 hours” – the one level that was measured explicitly on both scales – is huge!
Explaining the Gap
If you’re thinking that the differences between the scales alone can’t explain why all of these results are so skewed, that’s a good thought. However, the fact that this was a randomized experiment on a relatively homogeneous group of people makes it very unlikely that anything else explains the difference. Just to be sure, I did some other tests and found no significant differences between the sets of respondents that saw the low and high scales in terms of gender, country of origin, and the amount of time they took to complete the survey. So it seems like the scale is indeed the most likely culprit.
But what explains why scale questions can bias people’s responses so heavily? Survey researchers call this kind of behavior satisficing – it happens when people taking a survey use cognitive shortcuts to answer questions. In the case of questions about personal behaviors that we’re not used to quantifying (like the time we spend online), we tend to shape our responses based on what we perceive as “normal.” If you don’t know what normal is in advance, you define it based on the midpoint of the answer range. Since respondents didn’t really differentiate between the answer options, they were more likely to have their responses shaped by the scale itself.
These results illustrate a sticky problem: it’s possible that a survey question that is distributed, understood, and analyzed perfectly could give you completely inaccurate results if the scale is poorly designed.
Okay, it’s Broken. Now How Do I fix It?
So what are you supposed to do in order to figure out which scale is more accurate? One of the best ways to mitigate the problem is to do some open-ended research on your respondent population so that you can get a good sense of a reasonable range of responses. Then you can re-center your response scale around that distribution.
To try this out, I ran the survey yet again with the same question, except that this time I left the “hours online” question open-ended, allowing Crowdflower workers to type in their responses. Here’s a density plot of those responses with the minimum, maximum, and mean responses highlighted (sparklines style):

While the distribution is skewed and has something of a long-ish tail, the mean (6.53 hours per day), median (6 hours per day), and mode (5 hours per day) are all close to the midpoint of the high scale in my original questions. Therefore, the responses from the high scale were probably a more accurate reflection of the worker’s judgments.
Keep in mind, this technique provides no guarantee that the workers have accurate knowledge of how many hours they spend online – it’s turtles all the way down. I’d be willing to bet that their best guesses are pretty good, but if a big policy decision was riding on this question, I’d try to supplement my little survey with some other data sources. No matter what, there’s no perfect solution.
So what?
The point of all this has not been to undermine survey research, but to illustrate some of the problems that can happen if you’re not careful with things like scale design, as well as to present some strategies for solving those problems. As crowdsourcing becomes a mainstream tool in a range of academic and commercial fields, survey and questionnaire design techniques are also becoming more widely applicable. Nevertheless, people don’t usually encounter this kind of stuff outside of research methodology textbooks and the polling season of an election year.
I have a few more examples from these same experiments that I hope to follow up with in more posts soon. Meanwhile, leave a comment or email me at aaron [at] doloreslabs [dot] com with questions, comments, corrections and requests for data/code. All of these plots were created using R.
Pete Michaud
12/16/09
Interesting work. I’m about to implement product satisfaction surveys across multiple markets, and this was good food for thought. Keep it up 8)
michael
12/21/09
Why use a scale at all? I would make those types of questions always open ended. Anyone who takes the survey has to think about how many hours they spend online anyway. That’s the first step. The second is fitting their estimate in one of the categories. Seems like unnecessary work for the participants.
(Well, actually I would also make it two questions. One for weekdays, another for weekends. Otherwise the participants might be forced to do averaging in their heads – something which our programs are much better at.)
Pretty much the only time where I would use a scale in a case where the participant could also just give a number is income. And that’s only because asking directly for the income is considered very impolite (in Germany). A few years ago you would have to add age to that, but that has already changed and the thing with the income will, too.
aaron
12/21/09
Thanks for the feedback, everyone!
Michael – your comment illustrates a great point that deserves a lot more discussion: scale response questions may not be the best way to get the information you want on a given topic. That said, there are a few reasons I can think of why you still might want to use scales in certain situations:
I think these are all great reasons why the particulars of the study, the population, and the topics of interest should drive the research design process. Every type of question has characteristics that may look like limitations in a given set of circumstances, but strengths in another.
Bill Petti
12/22/09
One major advantage that comes immediately to mind is that scale questions don’t require analysts to spend additional time coding answers before commencing with their analysis. While open-ended questions may avoid the issue of satisficing (which I am not convinced they do–respondents could easily reference their own subjective scale or notions), they do place an additional burden on the analyst. For short, small-n surveys this isn’t that big of an issue. However, once you start scaling up in terms of n and the number of questions it can become problematic. Once you get into coding there are all sorts of issues that can arise (issues of subjectivity and bias, data entry errors, etc). Some crowdsourcing applications like Crowdflower may provide a convenient and reliable platform for coding, but at some level researchers will always have to make an intelligent trade-off between scale and open-ended questions.
michael
12/23/09
Would there really be that many problems with questions like this:
How many time do you spend online per weekday?
___ hours and ___ minutes online on weekdays
And how many time do you spend online per day on weekends?
___ hours and ___ minutes online per days on weekends
I really can’t imagine that. Throw up an error if the participant enters any letters. If the value is bigger than 24 hours or 60 minutes you can throw it out right away. (You have to be a bit careful with the minutes because from my experience some like to leave the hours blank and enter something like 90 minutes or 120 minutes.) All an easy enough fix.
If those open ended questions get any more complicated you will get into trouble, that’s for sure. But as long as it’s as easy as this, I doubt there will be many problems. And from what I know – talking to other communication studies researchers (who ask these types of questions all the time) – you actually get very accurate time estimates. No even asking for minutes is a fruitless exercise. Measurements of time spent online or in front of the TV match up pretty good to the self reported data (i.e. less than one hour difference, hence minutes matter, but you could probably leave them out if you wanted to make it easier).
Still, I can see you point about complexity. But that also depends on the respondents. I’ve seen people answering this question and some think long and hard and really try to come up with an accurate estimate. Others will just pull a number out. I would guess that scales are easier for the second group. But not exactly for the first. Scales might be even harder for them: they can’t just write their estimate in but have to fit it into categories.
(I like you point that scales help quantify things. But open ended questions can help there, too, albeit in a bit more subtle and maybe a bit less helpful way. When asking how often the participants go to the cinema or opera the researcher should first get a feeling for the average in the population and then decide whether to ask for visits per week, month or year.)
Mikael M
01/07/10
Interesting experiment! How much did it cost?
Eliot
01/08/10
Norbert Schwarz (survey research expert at Univ. of Michigan) has published a number of scientific papers documenting this effect. People do use the response scale as a cue indicating what is typical or normal, and place themselves accordingly. The same effect holds for relatively objective questions (hours per day online, number of headaches in an average month) and subjective questions (how satisfied are you with your life).
Ben Hyde
01/08/10
I always harbored a suspicion that survey respondents apply one of four heuristics. Pick randomly, pick min or max, and the cognitively expensive option being authentic.
Then there is the eternal authentic/strategic problem. If the survey respondent knows that his votes have an effect then he spend his tokens very differently. I had a small survey, designed to pick a winning paper, blow up on me once because one of the respondents was brilliantly strategic – turning all the dials to max for his preference and to min for the others.
The error I find most frustrating though is when the survey designer lacks a model of what the distribution of is, so his multiple choice is unable to capture a good signal for the parameters of the actual distribution. e.g. is distirbution zip or normal