This gives you three things. First and second, it shows you how to calculate the sample statistics of Sections 3.1 and 3.2. Third, it offers some videos showing how to calculate them using StatCrunch and Excel. Again, business-track people should be able to use both pieces of software; others, just the former.

As usual, this page should serve simply as a supplement to and not as a replacement for taking good notes, reading the assigned sections in the text, and watching the assigned videos. Before you start this lecture, make sure you know the classes of variables and why that classification matters.

Measures of Center

Among other things, the science of data analysis focuses on summarizing the data so that the relationships are manifest. If we reduce our measurements to a single value, we would want that single value to represent the entire variable as much as possible. That “typical” value we call the measure of center. Different types of variables allow us to perform different mathematical operations on them. Thus, different types of variables can have different measures of center.

Categorical (Qualitative) Variables

[eyeball]

A part of an eyeball. The colored part of the eye is called the iris after the Greek messenger goddess (before Hermes took her role… or became her).

Recall from above that the only mathematical operation that makes sense for categorical variables is counting. Thus, the measure of center can only depend on counting. The appropriate measure of center for categorical variables is the mode. The mode is the value that occurs most often in the data.

Let us suppose that my variable is eye color, as defined above. I sample 15 people and measure each person’s eye color. Here are the values I measured:

BlueBlueBrownBrownBrown
GreenBrownOtherGreenBrown
BrownBrownBrownBlueBrown

The value “Brown” occurs most frequently, 9 times. Thus, the mode of this data is “Brown”.

A set of data can have one mode, more than one mode, and no mode. By convention, if no data value occurs more than once, then the data have no mode.

Numeric (Quantitative) Variables

Recall that if a variable is quantitative, then both subtraction and addition make sense. The measure of center for quantitative variables takes advantage of that. The problem is that there are two possible measures of center that use subtraction: the mean and the median. Both can be used. However, experience has taught us that when the data has a heavy skew, the median better represents the typical value; when the data is symmetric, the mean does.

The mean has a nice formula (page 135). The population mean is symbolized by the Greek lowercase letter mu, $\mu$. The sample mean, what we usually work with in statistics, is symbolized by a Roman letter with a bar on top. For instance, if the variable is symbolized by x, then the (sample) mean of the variable is symbolized by $\bar{x}$ (pronounced “x-bar”).

The median does not have a nice formula (page 135). To calculate the mean of a sample, rank the data from smallest to highest. The median divides the data into two halves.

Example:

Let us suppose that my variable is height. I sample 15 people and measure each’s height in inches. Here are the values I measured:

7068716266
6573766070
6668666971

A histogram of the data does not suggest the data are heavily skewed, thus the mean should be used. The mean of the sample is $\bar{x} = 68.06666$. For the record, the median is 68. That the two are very close suggests the data are not heavily skewed.

To get the histogram in StatCrunch, after you type the data into the first column (var1), go to Graphics | Histogram. Select var1 and Create Graph!. To get the mean (and median), click on Stat | Summary Stats | Columns. Then select var1 and press Calculate.

Measures of Spread

[dartboards]

Variance and Bias. The optimal situation is when there is neither variance nor bias. Unfortunately, in this world, both exist. Statisticians seek to understand both.

There is no “typical” appropriate measure of spread for categorical data. This is because spread is harder to define when there are no numbers involved. Numeric variables, however, have measures of spread.

The appropriate measure of spread depends only on the appropriate measure of center. For the mode, there is no measure of spread. For the median, the appropriate measure of spread is the interquartile range (IQR), which is Q3 − Q1. For the mean, the appropriate measure of spread is the standard deviation or the variance. As with the measure of spread, you need to determine which to use. Once you have done that, let the computer do its job.

If we only have one number to summarize our variable, we should use the appropriate measure of center. If we have the luxury of a second number, that should be the measure of spread. The measure of spread shows how well that measure of center represents each data value. If the spread is large, then the center does a poor job of representing a lot of the data. If the spread is zero, we can reduce all of the data to that measure of center with no loss of information.

Using the above data, the (sample) standard deviation is 4.113856, the IQR is 5, and the (sample) variance is 16.92381.

To get the standard deviation, variance, and interquartile range, click on Stat | Summary Stats | Columns. Select var1. Press Next>. From the list on the left, select the statistics you need. In this program, the sample variance is “Variance,” the sample standard deviation is “Std.Dev.,” the population variance is titled “Unadj. Variance.” Similarly, the population standard deviation is titled “Unadj. Std.Dev.”

Again, once you have figured out how the calculations are done by hand (see the Project Scarlet link), I strongly urge you to use available technology to do the calculations.

Useful Technology Videos

These two videos show how to perform the sample statistics calculations discussed above. They are specifically for Microsoft Excel.

In addition to these two videos, there is a large number of videos on YouTube for calculating sample statistics in Excel. The following search link will take you to YouTube and provide you with a non-exhaustive list: Sample Statistics in Excel.

That is it. In this mini-lecture, we looked at the sample statistics and how to calculate them by hand and by machine. Get used to performing these calculations by machine. Becoming familiar with using technology will become more and more important as the term continues. Take time now to learn how to use Excel. You will be better off in the future.

There is no “Check It” section for this mini-lecture, because Project Scarlet serves in that role for these concepts. Visit that site for necessary practice.