The p-value

This gives the theory behind hypothesis tests, as well as links to other discussions.

The Three Hypotheses


A laboratory at Howard University (c. 1900). Photo courtesy the Library of Congress.

Scientists begin with their research hypothesis, a claim about the state of the world. For this week, the research hypothesis will be about a single population mean or proportion. There are three parts to the research hypothesis: the population parameter, the (in)equality sign, and the hypothesized value. For instance, this is a research hypothesis in words:

Last week, the average length of my tree sprouts was 35 inches. I hypothesize that they have, on average, grown since then.

Translated into symbols, this claim is

μ > 35 inches

The population parameter is μ, the population mean. The (in)equality sign is “greater than.” The hypothesized value is 35 inches.

The claim (research hypothesis) is what the scientist cares about… the only thing. However, because of the effects of probability, statisticians need two other hypotheses: the null hypothesis and the alternative hypothesis. The alternative hypothesis is either the research hypothesis or its logical opposite. If there is an “equals part” to the research hypothesis, then the alternative hypothesis is the opposite of the research hypothesis. If there is no equals part, then the alternative hypothesis is the research hypothesis.

The null hypothesis is always the logical opposite of the alternative hypothesis. From the Feynman video, we can interpret this as the claim dividing all possible worlds into two groups: one that agrees with the null hypothesis and one that agrees with the alternative hypothesis.

And, since the research hypothesis above uses the “>” sign, the alternative hypothesis is the same as the research hypothesis:

H_a : μ > 35 inches

Finally, since the alternative hypothesis is the logical opposite of the null hypothesis, the null hypothesis is

H₀ : μ ≤ 35 inches

Here is a really good table providing the other two hypotheses given the research hypothesis. Feel free to memorize it, or learn the patterns.

Table 1: A listing of the null and alternative hypotheses as determined by the research hypothesis.
Research Hypothesis	H₀	H_a	Tails
parameter < value	parameter ≥ value	parameter < value	left
parameter > value	parameter ≤ value	parameter > value	right
parameter ≠ value	parameter = value	parameter ≠ value	two

parameter ≤ value	parameter ≤ value	parameter > value	right
parameter ≥ value	parameter ≥ value	parameter < value	left
parameter = value	parameter = value	parameter ≠ value	two

The Logic of Hypothesis Testing

There is no “accept.” There is only “reject” and “do not reject.”

The logic of hypothesis testing is that we gather data (reality) to test whether the null hypothesis is reasonable. If not, then we reject it and conclude that the alternative hypothesis is correct. If it is reasonable in light of the data, then we cannot conclude the alternative hypothesis matches reality.

There are two ways of determining if the null hypothesis is reasonable. The first way is the critical value method (what the book calls the “classical” method). The second way is the p-value method. Let us look at these two methods in the next two sections.

The Critical Value Method

Originally, we only had the tables.

The critical value method has you calculate the value of the test statistic based on the observed data. You then compare that value to a critical value from the tables. If the observed value is more extreme than the critical value (in the direction of the alternative hypothesis), then you are in the “rejection region” and you reject the null hypothesis.

Compare the test statistic to the critical value.

We have already done this method without knowing it. This is how to use the confidence intervals to test hypotheses. If the hypothesized value is in the confidence interval, then the value is “reasonable.” If it is not in the confidence interval, then it is not reasonable. Note that the rejection region is everything outside the confidence interval.

Note that confidence intervals and two-tailed tests will give the same conclusion. However, we are able to test left-and right-tailed hypotheses, which we could not do with our confidence intervals.

The P-Value Method

Now, we have computers to do the work.

The critical value method was only able to tell you whether to reject the null hypothesis or not. The p-value method also lets you know how well the data support the null hypothesis.

Compare the p-value to alpha.

The definition of the p-value is “the probability of observing data this extreme, or more so, given that the null hypothesis is true.” It is a probability. It assumes the null hypothesis is true.

Here is the logic: If the p-value is too small, then it is unlikely to observe this data if the null is true. We observed this data, so the null hypothesis is unlikely.

If you prefer a “gut-level understanding” of the p-value, it is how well the data support the null hypothesis.

Example I: The Mean

In general, calculating statistics by hand is a waste of time. We should let the computer do it for us. However, let us do this toy example by hand; it may make some things manifest. Also, by default, I set my Type I Error rate at α=0.05. Other levels are acceptable, but this appears to be the “usual” value in most sciences.

As a scientist/researcher, I hypothesize that the average length of the produced ignition switches is 14.50cm. With this, the research hypothesis is μ = 14.50cm.

Now, according to Table 1 above, this means our two statistical hypotheses are

H₀ : μ = 14.50 cm
H₁ : μ ≠ 14.50 cm

Now, to test this, I collect data. I measure the lengths of 30 ignition switches. In that sample, the mean length was 14.41cm, and its standard deviation was 0.2cm. From this information, do the data support my hypothesis?

The Critical Value Method


A graph of the distribution of the test statistic if the null hypothesis is correct. The red region is the “rejection region.” Because what we observed is in the rejection region, we reject the null hypothesis. The blue region is the confidence interval.

The critical value method: This is a one-sample test on the population mean. It is a two-tailed test. From Table VI, the critical value is t(29, 0.025) = 2.045. Thus, if my test statistic is greater than 2.045 or less than -2.045, I am in the rejection region. The test statistic is

$$ \begin{align} t &= \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \\[1em] &= \frac{14.41 - 14.50}{0.2/\sqrt{30}} \\[1em] &= \frac{-0.09}{0.0365148372} \\[1em] & = -2.4648 \end{align} $$

The test statistic is -2.4648, which is more extreme than the critical value of -2.045. Thus, we are in the rejection region. Thus, we reject the null hypothesis. Thus, we conclude that the average length of ignition switches is not 14.50cm.

The p-value Method


A graph of the distribution of the test statistic if the null hypothesis is correct. The red region is the p-value, the probability of observing a test statistic (reality) this extreme, or more so, given the null hypothesis is true.

The p-value method: This is still a one-sample test on the population mean. It is a two-tailed test. The value of the test statistic is still -2.4648. Using the T distribution calculator in StatCrunch or in Excel or on our calculator, 2 × P[T < -|2.4648|] = 0.0199. Why did we double the probability? The test was a two-tailed test. Here is a table that lists the effect of tailedness on these calculations.

Since this p-value is less than the selected α=0.05, we reject the null hypothesis and conclude that the average ignition switch length is not 14.50cm.

It is not surprising that we came to the same conclusion using the two methods. This is how it should be.

Overall

Note that in both methods we are comparing what we observe to the most extreme reasonable value we would expect if our null hypothesis is correct. In the critical value method, the test statistic is what we observed and the critical value is that extreme reasonable value. In the p-value method, the p-value is what we observe and the alpha level is the extreme reasonable value.

Example II


Two Spanish Reals, the Pieces of Eight of pirate fame. Photo courtesy the Classical Numismatic Group, Inc..

The first example dealt with a hypothesis test about the population mean. This example will deal with the population proportion. I hypothesize that my coin is fair. To test this, I flip it 1000 times and count the number of heads. There are 489 of them. Is this sufficient evidence that the coin is not fair?

We are given that the research hypothesis is p = 0.500. From Table 1 above, this gives

H₀ : p = 0.500
H₁ : p ≠ 0.500

This is a two-tailed test because the alternative uses the ≠ sign. This is a one-population test on the population proportion because there is only one p in the null hypothesis.

The Critical Value Method


A graph of the distribution of the test statistic if the null hypothesis is correct. The red region is the “rejection region.” Because what we observed is not in the rejection region, we cannot reject the null hypothesis. The blue region is the confidence interval.

Using the critical value method, we first need to calculate the test statistic. Then, we need to determine the critical value. Finally, we compare the test statistic to the critical value.

The formula for the test statistic is found on page 487. I calculate it to be z = -0.696.

Now for the critical value. Using Table V, Z(0.025) = -1.96. Since this is a two-tailed test, the critical values are -1.96 and +1.96.

Using Excel to obtain the critical values is even easier. The code is =NORM.S.INV(0.025). Again, since this is a two-tailed test, we get the two critical values to be -1.96 and +1.96.

Finally, we compare the test statistic and the critical value. As this test statistic is not more extreme (farther from 0) than the critical value, we are not in the rejection region. We cannot reject the null hypothesis. We cannot conclude that the coin is not fair.

The p-value Method


A graph of the distribution of the test statistic if the null hypothesis is correct. The red region is the p-value, the probability of observing a test statistic (reality) this extreme, or more so, given the null hypothesis is true.

Using the p-value method, we calculate the p-value as 2 × P[Z < -0.696] = 0.4866. As this value is not less than α=0.05; we cannot reject the null hypothesis. There is not enough evidence to conclude that the coin is unfair.

Note I: We multiplied the probability by 2 because this is a two-tailed test. Here is a table of p-value calculations based on the tailedness. Again, this would be something helpful to know.

Table 2: The effect of tailedness on the critical value decision and the p-value calculation.
In this table, TS is the value of the test statistic, CV is the value of the critical value, and D is the distribution of the statistic (t for means and Z for proportions).
H₁	Tails	Reject H₀ if	p-value
<	left	TS < CV (based on α)	P[D < TS]
>	right	TS > CV (based on α)	P[D > TS]
≠	two	\|TS\| > \|CV\| (based on α/2)	2 × P[D < -\|TS\|]

Note II: We did not conclude that the coin is fair. We can only conclude that we did not detect unfairness. This is a subtle, yet important, point. The coin could be weighted 50.001% heads and we would not be able to detect that level of unfairness. In fact, the confidence interval gives an entire set of reasonable values for p based on this data. Only one of them is “fair” — 50%. So, all we can do is conclude that there is not sufficient evidence to call the coin unfair.

Which Method to Use?

The two methods will give the same conclusion. The strength of the critical value method is that it is easier to do by hand. The strength of the p-value method is that the p-value gives information about how well the data support the null hypothesis. Since computers are used in statistical analysis, the second method is preferred.

Useful Theory Videos

The following videos give more insight into the confidence interval. Since it is so important to statistics, I strongly suggest you watch these carefully.

poysermath

And that is it. This mini-lecture covered some additional aspects of hypothesis testing. It started out introducing the research hypothesis and the two statistical hypotheses based on it, the null and the alternative.

It, then, gave a brief overview of the two methods for testing hypotheses: the traditional (critical value) method and the p-value method. In both cases, using technology to perform the calculations was an important step. So, use technology and smile!