Confidence Intervals
A graphic of a Normal distribution with several areas (probabilities) marked. Click graphic to enlarge. |
Last chapter, we covered the Central Limit Theorem and sampling distributions. I noted that these two topics were extremely important for the future of the course. Here, you see why.
Example 0
Let us return to the ignition switches example from the last part. We were given that the lengths of these switches have mean μ = 10.45 cm and standard deviation σ = 0.02 cm. Now, I draw a random sample of size n = 100 from a pile of these switches, measuring the length of each. According to the Central Limit Theorem (CLT), we know that the distribution of the sample mean is
$$ \overline{X} \sim N \left(\mu=10.45;\ \sigma=\frac{0.02}{\sqrt{100}} \right) $$
Note that the mean of the sample means is the same as that of the population. Also note that the standard deviation of the sample means is that of the population divided by the square root of the sample size, $\sigma/\sqrt{n}$. This is what the CLT says.
That was last chapter. Recall that I mentioned that this was rather backwards. Note that μ and σ are both parameters of the population. The reality is that we never know these values. In statistics, we work with samples to draw conclusions about the population. That is the key to understanding the purpose of statistics. We have a sample. We want to learn about the population.
Towards Reality
We have a sample. We want to learn about the population.
In reality, we do not know μ, but we would like an estimate of it. We would also like to know how precise our estimate is. Our estimate of μ is called the point estimate, and the precision is indicated by the margin of error. A point estimate for the population mean is the sample mean. (A point estimate for the population proportion is the sample proportion.) The margin of error formula should look familiar (see last week for the connection, specifically the z-score). If $E$ is the margin of error, then
$$ E = Z \frac{\sigma}{\sqrt{n}} $$
In this formula, $\sigma$ is the standard deviation of the population, $n$ is the sample size from which you are estimating the population mean, and $Z$ is a scaling factor that depends on the distribution (standard Normal) and the level of certainty required ($1-\alpha$). The confidence interval for the population mean, $\mu$, is just
$$ \bar{x} \pm E $$
The computer can take care of calculating the interval. You need to be able to make it do so, and you need to be able to interpret the interval. For the following box, let us restrict ourselves to the 95% confidence interval. This interval is very common in research. A 95% confidence interval is the dual of a $\alpha=0.05$ Type I Error rate, which will be discussed next week.
What is a confidence interval?
First, what it is not.
The 95% confidence interval is not the interval in which the true population mean has a 95% probability of being.
The main reason that the confidence interval is not a probability is that the above wording implies that the population mean varies. It does not. The population mean is a fixed number that we do not know, but would like to learn about.
Now for what it is.
The 95% confidence interval is the interval which contains the true population mean 95% of the time, when the experiment (sampling) is repeated.
This means that if we repeatedly sample and calculate the confidence interval, approximately 95% of those intervals will contain the popuation mean.
The difference is subtle and technical (and not too important, I think). Without discussion of statistical theory, the difference will make little sense. Just be able to repeat the difference. The big difference is that saying the confidence interval contains the population mean with a certain probability suggests that the population mean varies. It does not.
The More Realistic Example
The Crystal Motors GM dealership in Brooklin, NY (1950). Photo courtesy the Library of Congress. |
I do not know the population average ($\mu$) for the length of the ignition switches. To estimate it, I collect a sample of size $n=100$. The mean of that sample is $\bar{x} = 14.5023$ cm. We know the standard deviation of the population is $\sigma = 0.02$ cm. Thus, we are 95% confident that the true population mean is in the interval with endpoints
$$ \begin{align} \bar{x} & \pm Z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \\[1em] 14.5023 & \pm 1.96 \frac{0.02}{\sqrt{100}} \\[1em] 14.5023 & \pm 0.00392 \end{align} $$
Thus, the endpoints of the 95% confidence interval are 14.49838 and 14.50622. In other words: We are 95% confident that the population mean ignition switch length is between 14.49838 and 14.50622 cm.
- Note that these endpoints are dependent on the sample we take; thus, they are random variables themselves. If we repeat this experiment a gazillion times, the interval will contain the true population mean 95% of the time.
- Also note that the value 1.96 is the 2.5th-percentile in the standard Normal distribution; 2.5 percent doubled is the α = 0.05 Type I Error rate. This is where the 95% comes from.
- Finally, notice that we are starting without knowing the population mean. We know, however, the sample mean. Reality dictates that we never know the population mean; we always know the sample mean. We only care about the population mean.
The Realistic Example
Four men in automobile, on snow-covered street, on the New York to San Francisco leg of the 1908 New York to Paris automobile race. Photo courtesy the Library of Congress. |
I do not know the population average ($\mu$) for the length of the ignition switches and I want to estimate it. In the previous example, we knew the population variance, σ. However, if we do not know the population mean, how in the world can we know the population standard deviation? We cannot.
So what do we use in its place? The sample standard deviation, s.
There is a consequence for doing this, however. Because of that additional estimation, we can no longer use the Normal distribution. We must use the t distribution.
So, let us estimate the population mean correctly. I collect a sample of size $n=100$. The mean of that sample is $\bar{x} = 14.5023$ cm. Also, the standard deviation of that sample is $s = 0.017$ cm. Thus, we are 95% confident that the true population mean is in the interval with endpoints
$$ \begin{align} \bar{x} & \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}} \\[1em] 14.5023 & \pm t_{0.05/2,\ 100-1} \frac{0.017}{\sqrt{100}} \\[1em] 14.5023 & \pm 1.984 \frac{0.017}{\sqrt{100}} \\[1em] 14.5023 & \pm 0.0033728 \end{align} $$
Thus, the endpoints of the 95% confidence interval are 14.49893 and 14.50567. In other words: We are 95% confident that the population mean ignition switch length is between 14.49893 and 14.50567 cm.
- Note that as before these endpoints are dependent on the sample we take; thus, they are random variables, themselves. If we repeat this experiment a gazillion times, the interval will contain the true population mean 95% of the time.
- Also note that the value 1.984 is the 2.5th-percentile in the t distribution with 99 degrees of freedom; 2.5 percent doubled is the α = 0.05 Type I Error rate. This is where the 95% comes from.
- Finally, notice that we are starting without knowing the population mean or the population standard deviation. We know, however, the sample mean and sample standard deviation. We are using this sample to better understand the population.