This discussion starter gives you three things. First (and most importantly), it explains confidence intervals using the terms from the previous chapter (sampling distributions). Confidence intervals and p-values (next chapter) form the backbone of statistical inference. Without understanding them, one cannot really understand statistics. Second, it offers some videos explaining the theory of confidence intervals. Third, it gives videos showing how to calculate confidence intervals for the population mean and the population proportion.

Confidence Intervals

[Normal graphic]

A graphic of a Normal distribution with several areas (probabilities) marked. Click graphic to enlarge.

Last chapter, we covered the Central Limit Theorem and sampling distributions. I noted that these two topics were extremely important for the future of the course. Here, you see why.

Example 0

Let us return to the ignition switches example from the last part. We were given that the lengths of these switches have mean μ = 10.45 cm and standard deviation σ = 0.02 cm. Now, I draw a random sample of size n = 100 from a pile of these switches, measuring the length of each. According to the Central Limit Theorem (CLT), we know that the distribution of the sample mean is

$$ \overline{X} \sim N \left(\mu=10.45;\ \sigma=\frac{0.02}{\sqrt{100}} \right) $$

Note that the mean of the sample means is the same as that of the population. Also note that the standard deviation of the sample means is that of the population divided by the square root of the sample size, $\sigma/\sqrt{n}$. This is what the CLT says.

That was last chapter. Recall that I mentioned that this was rather backwards. Note that μ and σ are both parameters of the population. The reality is that we never know these values. In statistics, we work with samples to draw conclusions about the population. That is the key to understanding the purpose of statistics. We have a sample. We want to learn about the population.

Towards Reality

We have a sample. We want to learn about the population.

In reality, we do not know μ, but we would like an estimate of it. We would also like to know how precise our estimate is. Our estimate of μ is called the point estimate, and the precision is indicated by the margin of error. A point estimate for the population mean is the sample mean. (A point estimate for the population proportion is the sample proportion.) The margin of error formula should look familiar (see last week for the connection, specifically the z-score). If $E$ is the margin of error, then

$$ E = Z \frac{\sigma}{\sqrt{n}} $$

In this formula, $\sigma$ is the standard deviation of the population, $n$ is the sample size from which you are estimating the population mean, and $Z$ is a scaling factor that depends on the distribution (standard Normal) and the level of certainty required ($1-\alpha$). The confidence interval for the population mean, $\mu$, is just

$$ \bar{x} \pm E $$

The computer can take care of calculating the interval. You need to be able to make it do so, and you need to be able to interpret the interval. For the following box, let us restrict ourselves to the 95% confidence interval. This interval is very common in research. A 95% confidence interval is the dual of a $\alpha=0.05$ Type I Error rate, which will be discussed next week.

What is a confidence interval?

First, what it is not.

The 95% confidence interval is not the interval in which the true population mean has a 95% probability of being.

The main reason that the confidence interval is not a probability is that the above wording implies that the population mean varies. It does not. The population mean is a fixed number that we do not know, but would like to learn about.

Now for what it is.

The 95% confidence interval is the interval which contains the true population mean 95% of the time, when the experiment (sampling) is repeated.

This means that if we repeatedly sample and calculate the confidence interval, approximately 95% of those intervals will contain the popuation mean.

The difference is subtle and technical (and not too important, I think). Without discussion of statistical theory, the difference will make little sense. Just be able to repeat the difference. The big difference is that saying the confidence interval contains the population mean with a certain probability suggests that the population mean varies. It does not.

The More Realistic Example

[A GM dealershp]

The Crystal Motors GM dealership in Brooklin, NY (1950). Photo courtesy the Library of Congress.

I do not know the population average ($\mu$) for the length of the ignition switches. To estimate it, I collect a sample of size $n=100$. The mean of that sample is $\bar{x} = 14.5023$ cm. We know the standard deviation of the population is $\sigma = 0.02$ cm. Thus, we are 95% confident that the true population mean is in the interval with endpoints

$$ \begin{align} \bar{x} & \pm Z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \\[1em] 14.5023 & \pm 1.96 \frac{0.02}{\sqrt{100}} \\[1em] 14.5023 & \pm 0.00392 \end{align} $$

Thus, the endpoints of the 95% confidence interval are 14.49838 and 14.50622. In other words: We are 95% confident that the population mean ignition switch length is between 14.49838 and 14.50622 cm.

The Realistic Example

[On the NY-to-Paris Course]

Four men in automobile, on snow-covered street, on the New York to San Francisco leg of the 1908 New York to Paris automobile race. Photo courtesy the Library of Congress.

I do not know the population average ($\mu$) for the length of the ignition switches and I want to estimate it. In the previous example, we knew the population variance, σ. However, if we do not know the population mean, how in the world can we know the population standard deviation? We cannot.

So what do we use in its place? The sample standard deviation, s.

There is a consequence for doing this, however. Because of that additional estimation, we can no longer use the Normal distribution. We must use the t distribution.

So, let us estimate the population mean correctly. I collect a sample of size $n=100$. The mean of that sample is $\bar{x} = 14.5023$ cm. Also, the standard deviation of that sample is $s = 0.017$ cm. Thus, we are 95% confident that the true population mean is in the interval with endpoints

$$ \begin{align} \bar{x} & \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}} \\[1em] 14.5023 & \pm t_{0.05/2,\ 100-1} \frac{0.017}{\sqrt{100}} \\[1em] 14.5023 & \pm 1.984 \frac{0.017}{\sqrt{100}} \\[1em] 14.5023 & \pm 0.0033728 \end{align} $$

Thus, the endpoints of the 95% confidence interval are 14.49893 and 14.50567. In other words: We are 95% confident that the population mean ignition switch length is between 14.49893 and 14.50567 cm.

Useful Theory Videos

The following videos give more insight into the confidence interval. Since it is so important to statistics, I strongly suggest you watch these carefully.

In addition to these two videos, there is a large number of videos on YouTube for understanding confidence intervals. The following search link will take you to YouTube and provide you with a non-exhaustive list: confidence interval theory.

Useful Practical Videos

These videos focus on how to calculate confidence intervals for the population mean and the population proportion. They focus on the techniques, not the theory. Again, I strongly suggest you watch the theory videos, above.

One-Sample Population Mean

The following videos show how to calculate confidence intervals for the population mean.

One-Sample Population Proportion

The following videos show how to calculate confidence intervals for the population proportion.

That is it! Again, I strongly encourage you to watch the theory videos. Confidence intervals are one of the most important concepts in statistics. In the next module, we cover the other “most important concept” in statistics—the p-value.