Here, we cover simple linear regression. If you have not transferred to using technology to do the calculations, you will almost have to here. The calculations are sufficiently difficult that the probability of making a button-pushing error is very high. Use the computer to do mindless computation. Use your brain to interpret.

To illustrate the interpretation of simple linear regression, let us work through an example.

The Research Question

[selma lagerlöf]

The awesome Selma Lagerlöf, Nobel Prize laureate in Literature for Sweden (1909). Photo courtesy the Wikimedia.

I wonder the following:

Is there a relationship between the average amount of chocolate eaten in a country and the proportion of Nobel Prize laureates it has?

This is my research question because it frames my research and because it is a question. Note that I am asking about the relationship between two numeric variables. Looking for a significant relationship between two numeric variables requires either correlation analysis or linear regression. The former is useful if you only care about the existence of a relationship. The latter is useful if you want to better understand that relationship. Both can be used to answer the question, although linear regression is much more powerful. Also, both are done on a computer, so they are equally easy to do.

The Data

Now, how could I arrive at an answer to my research question? As always, I first need to collect data. Here is a table of the data I collected:

Table 1: Data for 23 countries regarding their chocolate consumption and the number of Nobel Prizes received. Both are adjusted for the size of the population.
Country
 
Nobel Prizes
[per Million]
Chocolate Consumption
[bars per capita]
Australia 0.545 60
Austria 2.433 79
Belgium 0.862 68
Brazil 0.005 25
Bulgaria 0.142 22
Denmark 2.526 86
Estonia 0.000 79
Finland 0.760 70
France 0.899 74
Germany 1.267 114
Greece 0.186 45
Hungary 0.904 35
Italy 0.326 33
Japan 0.149 22
Lithuania 0.283 61
Norway 2.337 98
Poland 0.312 45
Portugal 0.186 45
Spain 0.170 33
Sweden 3.185 66
Switzerland 3.154 108
United Kingdom 1.887 103

Now that we have this data, what do we do with it?

The Graphics

Let us take a clue from high school math. Back in the good old days, when given data of this type (numeic vs. numeric), we created a scatter plot. Here is a scatter plot of the number of Nobel prizes (per million) against the chocolate consumption (per capita).

Nobel Prize rate against the Chocolate Consumption rate

That is a lot of anonymous dots. Since each dot represents a country, it may make a better graphic if we change the dots for the flag of the country. That places more context and information in the graphic. Here it is

Nobel Prize rate against the Chocolate Consumption rate

Now that we have a scatter plot of the observations, let us (as we did in high school) draw a line that summarizes the relationship between these two variables in this sample. This, let us call a “line of best fit.” Were the line of best fit horizontal, then there would be no evidence of a relationship between the two variables. If there is a slant to the line, there apparently is a relationship.

Here is a graph of the data, with flags to represent the countries, and a line of best fit.

Nobel Prize rate regressed on Chocolate Consumption rate

The Hypotheses

The hypothesis relates to one of two parameters. Both can be used to test for independence (non-related). We can either hypothesize about the correlation or about the slope. If we hypothesize about the former, the null hypothesis is ρ=0. If we hypothesize about the latter, the null hypothesis will be β1=0.

The alternative hypothesis will either be no relationship (≠), a positive relationship (>), or a negative relationship (<). According to the wording above, “is there a relationship,” the hypotheses will be

H0 : ρ = 0
H1 : ρ ≠ 0

This is equivalent to these hypotheses

H0 : β1 = 0
H1 : β1 ≠ 0

The symbol ρ is the Greek letter “rho.” The symbol β is the Greek letter “beta.” Once again, Greek letters are the population parameters that we care about. The Latin versions, r and b, are the sample statistics we use to learn about the population parameter.

The Regression Line

I have included the regression line to the graphic without showing the calculations. In fact, I let the computer do the calculations for me. Note that the regression line is slanted. Thus, there is (apparently) a relationship between the two variables. In fact, thanks to the computer, we know the formula for the line is

y = -0.5496 + 0.0255 x

Here, y is the dependent variable (Nobel Prizes per million people) and x is the independent variable (Chocolate Consumption in bars per person). We know this formula because the statistical program gave it to us. Do not waste your time doing these calculations by hand.

The Statistics I

Alright. We have done some good things here. Mainly, we found the regression line. Now, we will determine if that relationship (slope) we found is statistically significant. Were the relationship not be statistically significant, then we could not conclude that the slope is non-zero (that there is a relationship). If it is statistically significant, then we have a lot of evidence that the two variables are related.

The formulas are in the book. The work should be done using technology. TI calculators and Excel can give the line’s equation and some tests of significance. In Excel, the menu trail is DATA | Data Analysis | Regression. Once you have done that, you will get output that looks like that in the LinReg Results sheet in this workbook.

As always with p-values, the basic rule is to look for — and interpret — the p-values.

Look for the p-values.

The p-values always interpret the same way: If p-value ≤ α, then you reject the null hypothesis. You have sufficient evidence that the relationship is not null (zero). You have evidence that there is a relationship. In this example, the p-value related to the slope is 0.000293429. Since this is less than α = 0.05, we reject the null hypothesis. We detected a (non-zero) relationship between these two variables.

In the same area on the sheet, Excel also provides a 95% confidence interval for that estimate. We are 95% confident that each increase of 1 chocolate bar per person corresponds to an increase of between 0.013 and 0.038 Nobel Prizes per million people.

Predicting Denmark

Note this important thing: We can predict the number of Nobel Prizes [per million people] if we know the number of chocolate bars [per capita] eaten in the country. For instance, we know that Danes eat, on average, 86 chocolate bars per year. Plugging this number in for x in the line’s formula gives our estimate of Nobel Prizes awarded per 1,000,000 Danes:

y = -0.5496 + 0.0255 (x)
  = -0.5496 + 0.0255 (86)
  = -0.5496 + 2.193  
  = 1.6434      

Thus, this model predicts that Denmark earned 1.6434 Nobel Prizes per million people. Excel gives a more-precise answer: 1.6419. Looking at the “Predictions” sheet in this workbook provides the 95% confidence interval for this estimate. We are 95% confident that the average number of Nobel Prizes per million citizens is between 1.20 and 2.09 when the chocolate consumption is 86 bars per capita. Additionally, are 95% confident that the actual number of Nobel Prizes per million citizens is between 0 and 3.31 when the chocolate consumption is 86 bars per capita.

The first interval is the confidence interval. It contains all reasonable values for the mean (expected value). The second interval is the prediction interval. It contains all reasonable values for the observation.

The Statistics II

There are some other statistics you will need for this course. These are estimates of three different standard errors: se, sb0, and sb1. The first is the standard error of the estimate. The second is the standard error of the constant term, b1. The third is the standard error of the slope (effect) term, b1. The last two are used to estimate the test statistic (t) for the two coefficients, which means they are used to calculate the respective p-values. The first is used to help calculate the latter two.

From the regression table, the estimate of the standard deviation is se = 0.7710. The standard error for the intercept is sb0 = 0.3976. The standard error for the slope is sb1 = 0.0059. Note that the test statistics (T-Stat) equal the parameter estimates divided by their standard error.

The Conclusion

Note that there are three things happening here. First, you (had the computer) fit the data with a regression line. Second, you used that line to predict y-values given x-values. Third, you used the computer output to tell you if the relationship (slope) was statistically significant, if it was significantly different from zero.

The Videos

Excel

In Excel, here are two videos to show you the calculations:

In addition to these videos, there is a large number of videos on YouTube for doing simple linear regression in in Excel: Linear Regression in Excel.

And that brings us to the end of this extended example. In this example, we saw how to use software to calculate the line of best fit (regression line) for a given set of data. These data must consist of two numeric variables. Each measured on the same population.