A Question of Asking Questions

OVERVIEW

Research Background

When we design a survey to measure user feedback, a common approach is to ask users to respond to questions with a numerical scale such as 1-5 or 1-7. However, there seems to be no universally accepted standard in selecting the number of points used on a scale. Whether an arbitrary selection of scale points will influence the findings of a study remains unknown.

Research Question

Given the same survey questions, will scales with different points such as 1-5, 1-7, 1-9, 0-10, and 0-100 produce different study results?

Target User

Survey takers.

METHOD

Research Method

A longitudinal study (Time 1 study and Time 2 study) with thousands of participants was conducted. Two types of stimulus messages were created for this research, including four high quality ads and four low quality ads. The four high quality ads were created by a professional graphic designer, featuring four different products including jeans, stereo speakers, toothpaste, and paper tissues. The four low quality ads were the same as the high quality ads, except that each of them contained three typos.

This is the high quality ad for paper tissues.

This ad has no typos.

This is the low quality ad for paper tissues.

There are three typos in this ad, including “softmess,“ “youy,” and “combie.“

In the study, participants were randomly assigned to evaluate either high quality ads or low quality ads. Their feedback on the ads were measured by one of the five scales (1-5, 1-7, 1-9, 0-10, or 0-100).

Reason for Selecting the Longitudinal Study Method

This longitudinal method was chosen because a user’s feedback could be measured twice in the research (Time 1 vs. Time 2). If a scale used to measure user feedback were reliable, it should show a significant difference when a user saw different ads in Time 1 and Time 2. On the other hand, it should show no significant difference when the user saw the same ads in both Time 1 and Time 2.

Research Participants

Participants of this research were recruited from Amazon Mechanical Turk. There were 2,610 participants in Time 1 study. Four weeks later, they were all invited to participate in Time 2 study. There were 1,356 participants who accepted the invitation and completed Time 2 study.

FINDINGS

Research Findings

A “top 2 box” analysis was performed to see if users rated high quality ads more favorably than low quality ads. The results showed an expected pattern in general (i.e., users favored high quality ads), but the pattern was not clear for the 0-100 scale.

This finding suggests that a “top 2 box” analysis may not be appropriate for longer scales such as 0-100.

To compare scores on different scales, all scores were re-scaled to 0-100 by using the formula [(rating - 1)/(number of response categories - 1)] * 100. For example, 2 on a 1-5 scale will be 25 on a 0-100 scale because [(2 - 1)/(5 - 1)] * 100 = 25. When high quality ads were compared to low quality ads, no matter which scale was used, the scores for high quality ads were higher than those for low quality ads (although the p-values in these comparisons were not all < .05).

Users in this research appeared to be most sensitive to paper tissues ads, possibly because the typos in the low quality tissues ad were easier to find, compared to those in other low quality ads.

Moreover, the difference of user feedback was calculated for those who completed both Time 1 study and Time 2 study. The repeated measures ANOVA results showed that participants provided more favorable feedback for high quality ads than low quality ads, no matter same or different scales were used in Time 1 and Time 2.

The scores of Time 2 minus those of Time 1 should be positive if a user saw high quality ads in Time 2 and low quality ads in Time 1, and vice versa.

DELIVERABLES

Actionable Implications

Overall speaking, all five scales tested in this research (1-5, 1-7, 1-9, 0-10, & 0-100) measured user feedback in a consistent way. To apply these findings to research practices, using different scale points likely will not influence the findings of a study (if the p-value is not used as the only criterion to judge a difference being significant or not). Although 5-point and 7-point scales seem popular among researchers, using other scales may be acceptable as they likely will produce the same study results.

Research Publication

Cong Li and Khudejah Ali (2021), “Measuring attitude toward the ad: A test of using arbitrary scales and ‘p < .05’ criterion?” International Journal of Market Research, 63(5), 620-634.