Part IV: Sampling
…Upon the supposition of a certain determinate law according to which any event is to happen, we demonstrate that the ratio of the happenings will continually approach to that law as the experiments or observations are multiplied. Conversely, if from numberless observations we find the ratio of events to converge to a determinate quantity, …then we conclude that this ratio represents the determinate law according to which the event is to happen.
— Abraham de Moivre (16671754).
Definitions
Population: The entire group of objects about which information is wanted.
Parameter: A numerical characteristic of the population. It is a fixed number, but we usually do not know its value.
Unit: Any individual member of the population.
Sample: A part or subset of the population used to gain information about the whole.
Census: A sample consisting of the entire population.
Sampling Frame: The list or units from which the sample is chosen.
Variable: A characteristic of a unit, to be measured for those units in the sample.
Statistic: A numerical characteristic of the sample. The value of a statistic is known when we have taken a sample, but it changes from sample to sample.
Random Sampling
A simple random sample of size n is a sample of n units chosen in such a way that every collection of n units from the sampling frame has the same chance of being chosen. This is done using either:

Physical Mixing

Random Number Tables

Random Number Generator (in Excel: Data tab – Data Analysis group – Random Number Generation)
Sampling Distributions
Example: You have a box containing a large number of round beads, identical except for color. These beads are a population. The proportion of black beads in the box is p = 0.20. This number is a parameter (assume that I know the rest are white).
Say you reach in and take out 25 beads at a time. Assume this is a random sample of size 25 from the population, so each bead has an equal chance to be picked.

How many black beads do you expect to appear in the sample?

If you take many samples, do you expect ever to find a sample with 25 black beads? One with no black beads? One with as many as 15 black beads?

Is the sample proportion from one sample a good estimate of the population proportion? How likely is to be wildly different from p?
You might expect that about 20% of your sample should be black, that is, about 5 black beads out of the 25. But you will not always get exactly 5 black beads. If you get 4 black beads, then your statistic is still a good estimate of the parameter p = 0.20. But if you draw a sample with 15 black beads, then (a relatively bad estimate of p). How often will you get such poor estimates from a random sample?
Say you performed this experiment 200 times and recorded the table below.
# of black beads in sample

0

1

2

3

4

5

6

7

8

9

Sample proportion

0

0.04

0.08

0.12

0.16

0.20

0.24

0.28

0.32

0.36

# of such samples

3

8

12

34

40

47

24

20

9

3

Fraction of such samples

0.015

0.040

0.060

0.170

0.200

0.235

0.120

0.100

0.045

0.015

The results are shown in this histogram:
Errors
Random Sampling Errors: these are the deviations between sample statistic and population parameter caused by chance in selecting a random sample.
NonRandom Sampling Errors: These arise from improper sampling, and can lead to bias: the consistent, repeated divergence of the sample statistic from the population parameter in the same direction. Some examples are:
Convenience Sampling: Selection of whichever units of the population are easily accessible.
Voluntary Response: A common form of convenience sampling. The sample is chosen from those who respond to questions asked by mail or who callin during television broadcasts.
Nonsampling Errors: Examples include missing data, response errors, processing errors and the effect of the datacollection procedure.
Stratified Sampling
Consider the following situation.
Example: A university would like to know the attitudes of its students on the issue of whether a women's studies department should be opened. The university has 30,000 students of which 25,000 are men and 5,000 are women. A common method of sampling can be done as follows:

Divide the sampling frame into groups of units, called strata. The strata are chosen because we have a special interest in these groups within the population or because the units in each stratum resemble each other.

Take a separate simple random sample in each stratum and combine these to make up the stratified random sample.
If we took a random sample of 200 students, we’d expect only about (200)(0.16667) = 33 to be women. To gain a more accurate picture of each group, we could take a random sample of 100 from each group. Suppose we do this and record the responses to the question: “Do you favor creation of a new degree program in Women's Studies?” In response, 74 women and 42 men say “Yes”. We estimate that 74/100 = 74% of women favor the new program and 42/100 = 42% of men favor it. To estimate the proportion of all students who favor the program, first we estimate how many students are in favor, as follows:
(0.74)(5,000) = 3,700 women, and
(0.42)(25,000) = 10,500 men.
Then, we estimate that 3,700 + 10,500 = 14,200 of the 30,000 students are in favor; that is
14,200/30,000 = 47.3%.
Methodology

Get a random sample.

Analyze the sample, i.e., calculate mean, variance, etc....

Make inferences about the population from the sample.

Give (probabilistic) measures of how good the inference is.
Sampling Distribution of the Mean
There are many occasions in business when we want to know the mean of some random variable, but we find it too expensive or timeconsuming to conduct an exhaustive census of all of the N possible data. In this section we study how to make some useful inferences about a population mean from sample data.
We are interested in the population mean , but since we are using sample data, there will be some uncertainty about our estimate for . Therefore, we need both a measure of central tendency and a measure of dispersion to describe the behavior of sample means.
To estimate , we will use Why? Because:
Note: is a known quantity (as we've already taken our sample), but before the sampling, its value was unknown, hence random (as it depends on what sample we get). Whatever number we happen to see, its expected value is . This suggests is a “good” guess or estimate for .
We refer to this property (when an estimator’s expected error is zero) as “unbiased”; it is a desirable attribute of an estimator.
If the X_{i} are normally distributed with mean and standard deviation , then E() = and
Hence, is normally distributed with mean and standard deviation . The term is often called the standard error of the mean, or simply the standard error.
If an estimator has a low variance (another desirable attribute), we say that it is “efficient”. When estimating population means with sample means, we can make them more efficient by increasing the sample size.
Example: Assume a bank has a large number of personal savings accounts. The mean amount in the accounts is $5,000 and the standard deviation is $1,000. For the sake of argument, let's call someone wealthy if they have more than $6,000 in their account. Then there is about a 16% chance that someone with an account at this bank is wealthy (to see this, calculate P(X > 6000) = P(Z > 1) = 16%).
Now consider picking a set of 5 accounts and taking an average of the amounts. What do you expect your sample average or mean to be? You expect it to be about $5,000. Now how likely is it that your average is on the wealthy side ( 6,000)?
It seems reasonable to think that the chances that the average is more than $6,000 are less than the chances that any individual account is more than $6,000. In fact, in order for an average of 5 accounts to be on the wealthy side you would probably need at least 2 or 3 of the accounts to be wealthy and that is not too likely. The fact is that the average of 5 accounts has a standard error of $1000/ = 447. Therefore $6,000 is 2.24 (that's [60005000]/447) standard errors above the mean. Hence, the chance that an average of 5 accounts comes up wealthy is
P( > 6000) = P(Z > 2.24) = 0.5  0.4875 = 1.25%.
If the X_{i}'s are not normally distributed, but n is large (n 30), then using the CLT we can make the same assertion: is (approximately) normally distributed with mean and standard deviation .
Note: It is clearly unrealistic to assume, as we are doing, that we will be making inferences from a sample when the population parameters are known. In reality we will have sample data, and use it to make inferences about an unknown population. However, we can learn a great deal by studying the kinds of samples that result when the population is known.
Example: A marketing firm would like to know the number of hours per week that teenagers spend watching television in a particular city. Let the unknown mean be , that is, the (true) number of hours per week. Let N = size of population = total number of teenagers in the city. This is probably too large a population to perform a census.
We will sample (randomly) n members out of N (population size). We might telephone a sample of homes and collect data. Our data would look like X_{1}, X_{2}, ..., X_{n} where X_{i} = hours per week that teenager i reports having watched TV.
Suppose that the number of hours watched per week (by the teenagers of the city) has mean = 15 and = 5 and they are normally distributed. What is the probability that with a sample of 40 teenagers, we will find less than 14 (thereby underestimating the amount of time spent watching TV)?
Example: If the hours in fact average = 16 (with = 5) what is the probability that we find a sample mean greater than 16.5?
Sampling Distribution of a Proportion
Example: Polling: Let N = size of population = total number of people who vote, and let n = number of people polled. Let p = actual proportion of people who vote for Obama. Let 1  p = actual proportion of people who vote for Clinton. So, p (before the election) is an unknown population parameter that we would like to know.
Collect data X_{1}, X_{2}, ..., X_{n} where X_{i} = preference of i^{th} person in the sample, either Obama or Clinton. It is easiest to let
Therefore, the X_{i}'s are all Bernoulli random variables. Assuming our sample is random, then we know the distribution of . It is Binomial with n trials and probability of success p. The problem is we do not know the value of p since this is what we are trying to estimate.
Recall: If each X_{i} is Bernoulli (with p as the probability) then is Binomial with n trials and probability p. We then know (from the binomial) that
and that
Now, suppose that an independent sequence of binomial experiments is performed, in which there is a fixed number of trials (n) in each experiment. On each trial either event A occurs or does not occur, so we can look at the sequence of experiments as being independent trials with probability of success p = P(A).
Let X_{1}, X_{2}, … X_{n} be the corresponding number of successes in each experiment. The expected number of successes on any one trial is
And the expected number of successes in any given experiment (a set of n trials) is:
Now, recall from page 5 that
For proportions, this can be rewritten as
This means that the sampling standard deviation of a proportion is given by:
In case of proportions, we sometimes write to designate . (They are both used to represent the sample proportion, but we like to highlight when we are talking about a proportion and not just any old random variable.) So
E() = p and .
We can see that on average (from sample to sample), will vary from the actual p with a standard deviation of .
The question remains: what is the distribution of ? The Central Limit Theorem can be invoked here (as long as n 30). We know, therefore, that is approximately normally distributed with mean p and standard error .
Example: If the true proportion of people voting for Obama is 38% (that's p), what is the probability of randomly sampling 100 people and getting a proportion of at least 44% for Obama? This means, given that the true p is 0.38, what is the chance of getting a sample of 100 people with at least 44 for Obama?
What about if we ask 1000 people?
To recap:

If the X_{i}'s are normally distributed (or n 30), then is normally distributed with mean and standard deviation.

If, for a proportion, n 30 then is normally distributed with mean p and standard deviation .
Confidence Intervals
Given , is there an error term we can associate with this value? That is, is there an interval we can say contains with high probability the true value of or p?
For example, if we find 520 voters out of 1000 for Obama, is it possible to say with a high degree of certainty that the true percentage of voters who will vote for Obama is 52% plus or minus 2%? That is, his support is almost sure to be between 50% and 54%?
This interval (50%, 54%), or 52% 2% is a confidence interval. The confidence level (usually denoted 1  ) is the probability that the true population parameter falls in the interval.
Usually, (alpha) is chosen to be either 10, 5 or 1 percent, so the confidence levels used are usually either 90%, 95% or 99%.
For Normal Distributions or Large Samples
Usage: When the underlying distribution is normal with a known standard deviation, or when the sample is “large”, that is, n 30.
Example: A market researcher wants to estimate the mean number of years of school completed by residents of a particular neighborhood. A simple random sample of 90 residents is taken, the mean number of years of school completed being 8.4 and the sample standard deviation being 1.8.
Let's try to make probabilistic assessments of our estimate of (the true average number of years of school completed). Clearly our best estimate of at this point is = 8.4 years. But to account for variability in our sample, we will give a confidence interval for . Since we have a large sample, according to the CLT, is a standard normal random variable (if we had a small sample then we would need to assume that the original X_{i} 's were normal). Therefore:
Rearranging this, we get:
If the standard deviation is known, then we can calculate this confidence interval. However, in real life the standard deviation is usually unknown. If we have a large sample then s (the sample's standard deviation) is very close to anyway, so we can use s (a known quantity) instead of (unknown). In this case, a 95% confidence interval for is
Plugging in the numbers, we get the following interval:
We can say with 95% confidence that the actual mean number of years is 8.4 plus or minus 0.372 years, or between 8.028 and 8.772 years.
We have derived an interval (based on a random quantity ) which should contain the unknown with probability 95%. This is our 95% confidence interval. How would you construct a 90% confidence interval? An 80% confidence interval?
Example: What is a 90% confidence interval for the average number of years? Do you expect this interval to be wider or narrower than the 95% confidence interval? Again s = 1.8, n = 90 and = 8.4. For 90% confidence we need to replace the 1.96 with 1.645:
Therefore we are 90% confident that the true is between 8.088 and 8.712.
For Proportions
Usage: When the underlying distribution is Binomial with unknown p, and when the sample is large (n 30).
Recall = sample proportion, and p = population proportion. By the Central Limit Theorem, is normal with mean p and standard deviation .
To build a 95% confidence interval for p, we might try to mimic what we've done before, i.e., a 95% confidence interval for p is as follows:
What is wrong with this? There are terms with p in it, and p is what we are trying to find!
Simple solution: (But it is approximate.) In the error term, use instead of p, i.e., approximate the standard deviation of by . An approximate 95% confidence interval for the true p is then:
Example: p = proportion of voters who would vote for Obama. We ask 300 people and find = 0.44 (that is, 132 people for Obama). So a 95% confidence interval for p is:



= 0.44 1.96(0.0287)


= 0.44 0.056


= (0.384, 0.496)

A report on this poll might say “Obama's percentage support is 44% with a margin of error of plus or minus 5.6%.”
Normal Distributions with Small Samples
