Questionnaire or survey sample size

How do you work out what sample size to use for your survey? It is actually a complex calculation. And consequently, in my experience, people use rules of thumb – like 10%. Such rules of thumb cannot hope to give an adequate estimate of the needed sample size and consequently people either under-sample or over-sample. Often these samples are vastly too small - causing inappropriate decisions - or much too large - a wasted effort and expense. What sample size should you take? The answer is a balance between your intolerance for 'false positives' and 'false negatives'. The diagram below answers the following commonly asked question:

I wish to sample my customers to see if they are satisfied with the service I provide. I think that at least 85% of them are at least 'very satisfied'. I can tolerate a 'false positive' (ie saying they were satisfied when the were not) no more than 5% of the time. On the other hand, if more than 85% really are at least 'very satisfied', I want to be at least 90% certain that I will find out.

Use the sliders to set up your own sampling plans. The calculator shows 4 graphs. Anti-clockwise from the left they show:

Risk of detection versus sample size. This is the main graph used to determine sample size. When you take only a small sample size, you have a high risk of not being able to recognise that more the 85% really are at least 'very satisfied' even when they are. The risk diminishes with increasing sample size until it reaches the value you nominate as 'beta'. In this case at a sample size of 403.
Your ability to recognise that numbers are more than your nominated test value (eg 85%). Two curves are shown. The fine upper line is the 'false positive' case. The thicker lower line is the 'false negative' case. For various reasons, people may not be able to take the recommended sample size. In the example, if you take a sample of only 50, your sample would have to show more than 93.3% at least 'very satisfied' for you to accept the more than 85% are at least 'very satisfied'. With such a small sample size, you can not tell the difference between 85% and 93.3%. As the sample size increases, this 'fog' of unrecognisable numbers gradually decreases. In the example, at sample size 300, you can distinguish 88.4% from 85% and at the recommended sample size of 403, you can distinguish 87.9% from 85%. This is the 'alpha' effect and is caused by an intolerance for 'false positives'. Move the 'alpha' slider to see the effect on your ability to distinguish numbers. If you move the alpha slider to 8.5%, the sample size has reduced to 106 because you have increased your tolerance to fog to the extent that are comfortable not being able to distinguish 89.75% from 85%. If you move the slider to 3%, the sample size has increased to 1355 and your tolerance of fog has reduced to the extent that you want to be able to distinguish 86.82% from 85%. Most text books on sampling go only this far. But this only tells half the story.
- False negatives. The thicker lower curve shows the 'false negatives' risk (failing to correctly identify that more than 85% are at least 'very satisfied' when there actually are more than 85% who are at least 'very satisfied'). This curve shows the smallest value that you can recognise. In this example, if you took a sample size of only 50 (and you had a 5% tolerance for 'false positives' and a 90% intolerance for 'false negatives'), even if there were as many as 88.78% of your customers who at least 'very satisfied', there is a 10% chance that you would say, based on your sample, that less than 85% of your customers were at least 'very satisfied'. At sample size 300, (and you still have a 5% tolerance for 'false positives' and a 90% intolerance for 'false negatives'), even if there were 86% of your customers who at least 'very satisfied', there is a 10% chance that you would say, based on your sample, that less than 85% of your customers were at least 'very satisfied'.
Sensitivity of sample size to the actual value. When the actual value gets close to the value you are testing, the sample size needed increases exponentially. Remember that you do not know, nor will you ever know the actual percent of satisfied customers. However, you are trying to find out what range it might be in. In this example, greater than 85%. But what if, just by chance the actual percentage of customers who are at least 'very satisfied' is 88%. With your requirements of 5% tolerance for 'false positives' and a 90% intolerance for 'false negatives', the sample size needed is more than 1160. Sampling is about compromise. If the actual value is 89%, you need a sample of 641. Testing a little below where you think the actual value is helps you reduce your sample size.
Sensitivity of 'false negative risk' to the actual value. In our example, with a sample size of 403 and 5% tolerance for 'false positives' what happens to the risk of a false negative for different actual values. If the actual value is 86% - that is, if 86% of your customers really are at least 'very satisfied', there is an 81% chance that you will say less than 85% are at least 'very satisfied' based on your sample - which is fairly useless. If the actual value is 88% - that is, if 88% of your customers really are at least 'very satisfied', there is still a 52% chance that you will say less than 85% are at least 'very satisfied' based on your sample - which is still fairly useless. It is not until the actual percentage of satisfied customers has risen to 90% that the 'false negative risk' drops to 10%.

Suppose you decide that you can only afford to take a sample of 200.

The first graph tells you that your risk of detection has dropped to about 65% (ie 100% - 35%). That is, about a third of the time you would say less than 85% of your customers were at least 'very satisfied' when in fact more than 85% were. This could lead to a great deal of wasted effort to unnecessarily improve customer service levels.
The second graph tells you that your sample of 200 would have to find 89.15% who were at least 'very satisfied' before you would believe that more than 85% were. And even when more than 86.34% of your customers were at least 'very satisfied', your sample would lead you to say that less than 85% were 10% of the time.
The third graph tells us that with your sample of 200, if there really are 92% of your customers who are at least 'very satisfied' then you will be able to tell so - while making 'false positive' errors only 5% of the time and 'false negative' error only 10% of the time.
The fourth graph is not applicable.

Note. You need to apply the sample size rules described here to every segment that you are interested in testing. For example, if you want to tell that at least 85% of your female customers are at least very satisfied, you need a sample size of 403 female customers. If you want to know that males in the age groups 18-25, 25-35, 35-50, >50, you need a sample of 403 from each of those segments. The more segments, the bigger the total sample size you need. (If you do not see a large graph immediately below this, you should download Flash.)

To veiw the calculator, click the button below

Questionnaire design and analysis

When people think of doing a survey like the one described above, they usually make up some type of satisfaction scale (say, from 1 to 6) and make one end of the scale (eg the 6) be 'extremely satisfied' and the other end (eg the 1) be 'extremely dissatisfied'. 5 and 2 are 'very' etc as shown below. The intent is to provide a few split points in the continuum of opinion. I prefer to always force people to make a choice one way or the other. So I don't give a 'fence sitting' middle point.

6	Extremely satisfied
5	Very satisfied
4	Satisfied
3	Dissatisfied
2	Very dissatisfied
1	Extremely dissatisfied

What you should do is count the number in each category. In the example above, you have combined the 'extremely satisfied' and 'very satisfied' to give the number 'at least very satisfied'. In the example above, you are testing that at least 85% of your customers are 'at least very satisfied'. Note. People often make the mistake of finding the 'average level of satisfaction' and then trying to do something with it. When you do this, you are maybe adding people who are 'extremely dissatisfied' to those 'extremely satisfied' and getting an 'average'. Such an approach is meaningless and should not be used. I know you have seen people do it. I know your spreadsheet will let you. It is still meaningless. It is like averaging apples and bananas. Don't do it.

Sample Size Calculation in Experimental Design

Sampling Theory. In most situations in statistical analysis, we do not have access to an entire statistical population of interest, either because the population is too large, is not willing to be measured, or the measurement process is too expensive or time-consuming to allow more than a small segment of the population to be observed. As a result, we often make important decisions about a statistical population on the basis of a relatively small amount of sample data.

Typically, we take a sample and compute a quantity called a statistic in order to estimate some characteristic of a population called a parameter.

For example, suppose services manager at Telstra is interested in the proportion of Telstra customers who are currently 'very satisfied' or more with Telstra's level of service on a particular issue. Telstra's customer base is 1,500,000 in that state. In this case, the parameter of interest, which we might call

, is the proportion of customers in the entire population of Telstra customers in that state who are 'very satisfied' or more. The services manager is going to commission an opinion poll, in which a (hopefully) random sample of people will be asked whether or not they are satisfied with Telstra service. The number (call it N) of people to be polled will be quite small, relative to the size of the population. Once these people have been polled, the proportion of them who rate Telstra's service as 'very satisfied' or higher will be computed. This proportion, which is a statistic, can be called p.

One thing is virtually certain before the study is ever performed: p will not be equal to

! Because p involves "the luck of the draw," it will deviate from

. The amount by which p is wrong, i.e., the amount by which it deviates from

, is called sampling error.

In any one sample, it is virtually certain there will be some sampling error (except in some highly unusual circumstances), and that we will never be certain exactly how large this error is. If we knew the amount of the sampling error, this would imply that we also knew the exact value of the parameter, in which case we would not need to be doing the opinion poll in the first place.

In general, the larger the sample size N, the smaller sampling error tends to be. (One can never be sure what will happen in a particular experiment, of course.) If we are to make accurate decisions about a parameter like

, we need to have an N large enough so that sampling error will tend to be "reasonably small." If N is too small, there is not much point in gathering the data, because the results will tend to be too imprecise to be of much use.

On the other hand, there is also a point of diminishing returns beyond which increasing N provides little benefit. Once N is "large enough" to produce a reasonable level of accuracy, making it larger simply wastes time and money.

So some key decisions in planning any experiment are, "How precise will my parameter estimates tend to be if I select a particular sample size?" and "How big a sample do I need to attain a desirable level of precision?"

The Sample Size Calculator above provides you with the statistical methods to answer these questions quickly, easily, and accurately.

Hypothesis Testing. Suppose that the service manager was interested in showing that more than 85% of customers were 'very satisfied' or more. Her question, in statistical terms: "Is

> .85?"

In statistics, the following strategy is quite common. State as a "statistical null hypothesis" something that is the logical opposite of what you believe. Call this hypothesis H0. Gather data. Then, using statistical theory, show from the data that it is likely H0 is false, and should be rejected.

By rejecting H0, you support what you actually believe. This kind of situation, which is typical in many fields of research, for example, is called "Reject-Support testing," (RS testing) because rejecting the null hypothesis supports the experimenter's theory.

The null hypothesis is either true or false, and the statistical decision process is set up so that there are no "ties." The null hypothesis is either rejected or not rejected. Consequently, before undertaking the experiment, we can be certain that only 4 possible things can happen. These are summarized in the table below

		State of the World
		H0	H1
Decision	H0	Correct Acceptance	Type II Error
Decision	H1	Type I Error	Correct Rejection

Note that there are two kinds of errors represented in the table.

A Type I error represents, in a sense, a "false positive" for the researcher's theory. From society's standpoint, such false positives are particularly undesirable. They result in much wasted effort, especially when the false positive is interesting from a theoretical or political standpoint (or both), and as a result stimulates a substantial amount of research. Such follow-up research will usually not replicate the (incorrect) original work, and much confusion and frustration will result.

A Type II error is a tragedy from the researcher's standpoint, because a theory that is true is, by mistake, not confirmed. So, for example, if a drug designed to improve a medical condition is found (incorrectly) not to produce an improvement relative to a control group, a worthwhile therapy will be lost, at least temporarily, and an experimenter's worthwhile idea will be discounted. (In our example, the research might fail to identify that more that 85% of Telstra's customers were satisfied even though they were.)

Many statistics textbooks present a point of view that is common in the social sciences, i.e., that

, the Type I error rate, must be kept at or below .05, and that, if at all possible,

, the Type II error rate, must be kept low as well. "Statistical power," which is equal to 1 -

, must be kept correspondingly high. Ideally, power should be at least .80 to detect a reasonable departure from the null hypothesis.

The conventions are, of course, much more rigid with respect to

than with respect to

. For example, in the social sciences seldom, if ever, is

allowed to stray above the magical .05 mark.

The statistically well-informed researcher makes it a top priority to keep

low. Ultimately, of course, everyone benefits if both error probabilities are kept low, but unfortunately there is often, in practice, a trade-off between the two types of error.

References

Statsoft Electronic Textbook. (for power analysis and sample size calculation) Cooke, Craven & Clarke (1981) Basic Statistical Computing. (for the binomial and normal probability and reverse normal algorithms). Abramowitz & Stegun (1972) Handbook of Mathematical Functions. Conover (1971) Practical Nonparametric Statistics.