Understanding Mean and Variance

Two central ideas in statistics are mean and variance. Before analyzing our data, we will first explore these topics to gain a better understanding of what they are and what they tell us.

The Mean

What is the mean? Think of the mean like a balancing point for your data. Imagine each data point is like a weighted circle that you will add to a platform like the one below. The platform is like a number line where you will plot your data points, and the triangle is the fulcrum (where the platform sits and pivots).

Watch what happens when we add a data point to the platform.

The platform is no longer balanced. The weight of the yellow circle throws it off balance.
How can we make this balanced again?

If I have another data point I can place on the other side, I can balance it out.

But what if my data stacks up like this picture below? How can I make this balanced if I plotted all of my data points and have none left to balance it out?

You might be tempted to rearrange the circles and move them around to make the platform balanced. If these circles are my data points and the platform is a number line, I can't move them around because that will change the value of my data points. What else can I do to balance the platform?

I need to shift the fulcrum to the point of balance.

As you add data to the number line platform, the fulcrum must shift to keep the platform straight (balanced). Finding the mean of your data set is like finding this point of balance. Wherever the fulcrum ends up sitting under your number line to make the platform balanced, that value is the mean of your data set.

Let's try it now with some real data.
Here is a set of 5 data points.
We will place the data points on a number line and find the balance point so our number line stays level.

Now, if I want to balance this number line on a fulcrum, what number would the triangle need to be under?
How can we figure this out? Remember when we had one circle on one end of the platform, and we balanced it by placing a circle on the other end? Why did this work in balancing the platform?

If we have an equal distance on either side of the platform from the data to the fulcrum, then our platform will balance. So let's figure out where the fulcrum needs to be for our data to have this equal distance on either side.

Looking at the data above, notice that we have 5 data points. Three are grouped closely together and two are further apart. Where do you think the balance point might be to find the equal distance between data points on both sides?
Let's try the mid point first and see what we get. If I put the fulcrum under 5, the yellow circle is zero distance from the balancing point. The red circle is 4 units away. That is the only data point on that side. Are the orange, purple, and green circles 4 units away from 5 all together?
The green circle is 2 units away, the purple circle is 3 units away, and the orange circle is 4 units away. That's too many units on the left side. The number line would be unbalanced. If I have too much on the left side and not enough on the right side, which way do I need to shift the fulcrum?
Let's shift it to 4 and see what happens. Now I have two data points on the right and three data points on the left. Let's see what the distances are.
The red circle is 5 units away and the yellow circle is one unit away. That is a total of 6 units on the right.
The green circle is one unit away, the purple circle is 2 units away, and the orange circle is 3 units away. That is also a total of 6!
So if we place the fulcrum under the 4, the number line is balanced. This balancing point represents the mean of our data set. So the mean of this data is 4.

***Notice that finding the mean did not put the fulcrum in the middle of the number line, and there is also not an equal number of circles on the left and on the right. The number line is still balanced because the values of the circles from the fulcrum on the left and on the right are equal.

Now, that wasn't so hard to figure out, but we only had 5 data points. Can you imagine if we had 50, or 100, or 10,000? Would you want to try to find the balancing point on a number line with that many circles?? Me either.
So instead, we can use a formula to calculate the mean. Can you guess how we might be able to calculate the mean for our data? If you think about the value of each data point (orange = 1, purple = 2, green = 3, etc.), and what we did with these values in the example above, you might be able to figure it out. I wanted to find the central value, or the average value, of this set of numbers.

So, if we add up all the values of our data points and divide by the total number of data points, we can easily calculate the mean.

When working with a large data set like the one in this study, it can be useful to represent the entire data set with a single value that describes the average value of the entire set. In statistics, that single value is called the central tendency and the mean is one way to describe it. You will see in a later part of this investigation that in some cases, the median is a better way to represent the central tendency of your data. This distinction will help you determine what type of statistical test to run to analyze your data. So central tendency is an important topic in statistics. For this study, we will be focusing on the mean.

Let's investigate the mean further with the TI-Nspire. If you have a TI-Nspire calculator, or the TI-Nspire app or software, get that out now. If not, then get a ruler, pencil, and some quarters and you can still follow along as we work through this part of the investigation.
Have your app or software open, or your calculator plugged in to your computer, and then click here to download this investigation to your TI-Nspire. (This is a modified activity from Texas Instruments for the TI-Nspire.)

You can see that the first page of the investigation explains that we will be exploring the balance point of quarters taped to a ruler. This concept is very similar to our previous investigation when we balanced the number line with the data circles.

Click on page 1.2 to continue. Notice that you now have a ruler (the number line) with quarters (the circles) taped to it. Is the ruler balanced on the fulcrum? Why or why not?
**If you are working with a pencil, ruler, and quarters, tape the quarters onto your ruler at the positions indicated below (1 inch, 2 inches, 3 inches, 6 inches, and 8 inches). Place the pencil under the ruler at 7 inches. Is the ruler balanced?

Check your answer by scanning the QR code below.

Where would the fulcrum need to be to balance the ruler? Think about which direction the fulcrum would need to shift.
Then click on the fulcrum and drag it to see if you can balance the ruler.
What number is the fulcrum under when the ruler is balanced?
**Think about which direction your pencil needs to roll to balance the ruler. Move the pencil in that direction and see where the pencil needs to be to balance the ruler. What inch measurement is the pencil under when the ruler is balanced?

When you finish exploring this on the software or with your own ruler, watch the video below to check your answer.

Let's try one more. Click through to page 2.2.
I want you to do the same thing as before. Look at the ruler and decide whether or not it is balanced. If it is not, think about which way you need to move the fulcrum before you start moving it. Try to figure out what number the fulcrum would need to be under to balance the ruler. Then move the fulcrum to see if you are right.
**Re-arrange the quarters on the ruler to match the picture below. Having two circles on 10 means you would tape two quarters to the ruler at 10 inches. Then repeat the process as outlined above.

What number is the point of balance?

So when you calculate the mean of a data set, I want you to remember what you are doing with the data and what the mean represents.
Also remember, to calculate mean without having to move a fulcrum to balance a number line, you add together all of the numbers in the data set and then divide the sum by the total count of numbers.

If you want to practice your skills in calculating the mean, try out the game below.

VARIANCE

What is variance?

If you look at the pictures below, you can see that the data points in the first picture are much more spread out than the data points in the second picture.
Calculating variance is a way to measure this spread of data.

Variance is closely related to the mean of a data set. In fact, the first thing you need to do when calculating variance is to calculate the mean.

We have already calculated the mean of our data set and found it to be 4. The next step for variance will also be familiar to you because we used this method to calculate the mean when we were looking for the point of balance for the number line.
We need to find the distance from each data point to the mean. After we find this value, we will then square it.
Let's look at this in the form of an area model.
If we find the distance from our data point to the mean, and then create a square with those dimensions, then the size of the square represents how far away the data point is from the mean. For example, if a data point is three units away from the mean, then I would make a 3 unit by 3 unit square. If a data point is five units away from the mean, I would make a 5 unit by 5 unit square. This bigger square shows that this data point is further from the mean than the first data point.
Let's make these squares for our data points.

Ok, now that we have found the squares of all of our differences from the mean, we need to take those squares and build a rectangle, just like building with legos. The base of our rectangle needs to be as long as the number of our data points. We have 5 data points, so the base is 5 units long. Let's stack our squares and see what happens.

This height of our stacked rectangles is the value of the variance.
Now, you might be thinking, this was easy to do because we had a square that was exactly 5 units long, and we had all these squares that neatly stacked on top to make a nice rectangle. Yes, I know. (It's not as easy as you think to get it to work out that way! :) And you also might be thinking that it may be much more difficult to calculate the variance this way if you had a larger data set, or one where the squares didn't neatly stack up. You could always think of these squares as being built with legos that break apart and can be rearranged into a rectangle, but who has time for that, right? If I'm going to play with legos, I'd rather be building the millennium falcon and not the variance of a data set...

It's a good thing we also have a formula to calculate the variance. I already told you that the first thing you have to do is calculate the mean. Do you remember the next part? Remember the squares.
We need to square the difference from the mean. Then we need to add them all up and divide by the number of data points. Does this sound familiar?? Essentially, you are finding the average (or the mean) of the squares of differences.
When we stacked our squares on a unit length of n (the number of data points), we were averaging the squares of differences.

**Remember, this number tells us how spread out our data is, so if you have a larger variance, you have more spread in your data and your squares will stack higher. This makes sense since the rectangles would be bigger if they were further away from the mean. Imagine stacking a bunch of red and orange squares. If your variance is small, then your data is clustered closer to the mean of the data set. When you stack the squares for this data set, the height would be much smaller since the squares have less area. Imagine stacking a bunch of green and yellow squares. They wouldn't stack very high because they are smaller.

Another way to think about variance is to visualize your data in a box and whisker plot. You might remember doing this in Algebra class. If your data have a larger variance, then the spread of your data will be greater. This means that the box and whisker plot representing your data will have a longer box and whiskers. If you have less spread in your data, the box and whiskers will be shorter. Click on the graph below to go to a box and whisker plot I have set up in Desmos.
You can change the values in the list of data (L1) to see the difference. Make the data very close together and see what your box and whisker plot looks like now. Change the data set to numbers that have a very large difference in value and see what the box and whisker plot looks like now.

Now that you've had a chance to play around with a box plot that I made, I want you to create your own. Click on the Box Plotter picture below to start the activity. Create a box plot that has a small variance. Remember, this will have a data set with numbers that are close to your mean. Then underneath that box plot, create another one with a large variance. This will have a data set with numbers further away from your mean. Compare the two box plots. What do you notice? Now, create a third box plot. Randomly input 9 numbers. Create the box plot. What do you notice? Does your data set have a large or small variance? How can you tell?

Now that you understand the concepts behind mean and variance, let's move on to figuring out the mean and variance of the data set for this study.

Mean and Variance for Math Gifted and Talented Data

Now let's apply these same concepts to a much larger data set.
Below is the data for our study. Excel is an easy way to calculate the mean and variance of a large data set.
The first thing we need to do is arrange our data in the table. Notice that we have a state column, a male column and a female column. These numbers represent the percent in decimal form of males and females enrolled in the math gifted and talented programs in the United States in 2011-2012. At the bottom of these two columns, we calculate the mean by summing up all the numbers and dividing by the number of data points. In our case the number of data points (n) is 51 (50 states plus Washington D.C.). This is the average enrollment by gender in the United States. The fourth and fifth columns calculate how far away each state percentage is from the mean. Notice how some are positive and some are negative. If the value is greater than the mean, the state enrolls more than the national average. If this is true, then the deviation (how far away it is from the mean) is positive. If the value is less than the mean (the state enrolls less than the national average), then the deviation is negative.

You might have been wondering why we square the differences when calculating the variance. Can you answer that question now? If I were to add up the differences from the mean, what would happen with the positive and negative values?
Scan the QR code to check your answer.

The sixth and seventh columns show the square of the differences.
At the bottom of the chart, we sum up these squares of differences and divide by the number of data points to get the variance (the spread of our data).

You might have noticed that there is another calculation at the bottom of the chart - Standard Deviation. You might have also noticed that I designated the variance as s^2 and the standard deviation as s. Can you guess how we might calculate the standard deviation?

If we take the square root of the variance, we will get the standard deviation. Standard deviation tells us the same thing that variance does. Finding the variance is a step to getting to the standard deviation. We needed to square our differences to take care of the problem of positive and negative numbers canceling out. So at the end, we take the square root to get the standard deviation.

Now that we have calculated the standard deviation of our data set, what does this mean?
This number is the value that represents one standard deviation away from the mean of our data set. Remember from our excel spreadsheet that deviation is how far away a number is from the mean. By finding the standard deviation we created a way to measure that distance in a consistent way for all of our data points.
So think back to the difference in size of our squares. If one data point is two standard deviations away from the mean and another data point is three standard deviations away from the mean, the second data point would be further away (has a larger square).

Below you can see the graph of a normal distribution of data. Most of the data points fall within one standard deviation of the mean. Almost all of the data points fall within two standard deviations of the mean. This is what you would expect to see if your data are normally distributed. This is sometimes called a bell curve because it looks like a bell.

Once we have our data in Excel, we can also easily create different graphs to represent our data.
Choosing which graph to use to best represent our data is sometimes a challenge. I can take data in an excel spreadsheet and create over 20 different graphs, but that doesn't mean all of those graphs will give me something useful or meaningful to look at for my study. Make sure when you choose the graph you want to visualize your data, you understand what it is representing. Try to explain it to yourself or a friend to see if it matches up with what you want it to show.

The first graph I created is shown below. It is a stacked bar graph. This shows both gifted students and non-gifted students by gender for each state. By stacking the different representations on the same graph, you can compare the percentages by gender in each state.

Since we are interested in seeing the difference between male and female enrollment, the bar graph below shows male and female percentages next to each other by state. What trend do you see when you look at this graph?
Analyze the graph, think about your answer, then scan the QR code to check.

Does the graph show us if there is a statistically significant difference?
No, it just shows the percentages. We will have to run tests to see if the difference is significant. The graph does show us something interesting. I assumed there would be more males than females enrolled. From this graph, it looks like there were more females than males enrolled in GT math programs in 2011-12.

Below is a scatter plot of our data using Excel.
What do you think this is showing you about the spread in our data? Can you interpret this graph?
Scan the QR code below to check your answer.

So what does this mean as it relates to our study?

This graph shows how much variation there is between the states' enrollment of math gifted and talented students overall. It does not show very well the variation in female and male enrollment for the nation. This graph does provide some information to me, but I want to adjust the graph to show the variation in male and female enrollment.

****Remember, when interpreting any graphical representation of data, make sure you understand what it is you are looking at. And, when you choose the graphs to analyze your data, make sure they show what you want them to show.

Tinker plots is also a great way to graphically represent data. You can import your data directly from an excel spread sheet and create graphs like the ones below.

The graph below shows the percent in decimal form of the number of male gifted students in each state. The red line represents the mean of the data. This is another way to visually represent mean and variance. If you think about these rectangles as a stack of legos, you can break up the tall ones at the red line and use the pieces to build the smaller rectangles up to the red line. When you are finished, you will have 51 rectangles that all have the same height as the red line. This is the mean of the data set. How much you have to break off or build up is like the variance - how far away you are from the mean. You can also see that there is a variation in the blue color of the rectangles. Those with less value are a lighter shade and those with a greater value are a darker shade. If you look at Florida, the percentage of male students enrolled is very close to the national average. So, if a rectangle is close to the same shade of blue as Florida, it is close to the mean. If it is a darker blue, it is greater than the mean (positive deviation on the excel spreadsheet). If it is a lighter shade of blue, it is less than the mean (negative deviation on the excel spreadsheet). You can see that Maryland is a dark blue and much greater than the mean. Massachusetts is a very light blue and much less than the mean.

The graph below is the same as the graph above, but represents the percent in decimal form of the number of female gifted students in each state. Take a look again at Florida, Maryland, and Massachusetts. Does the data look the same or different for those states in this graph? Look at the two graphs and then scan the QR code below to check your answer.
What about your state? Does the data on the male graph and the female graph look about the same or are they different?

How about this graph below. What can you see in this representation of the data? Can you gain any useful insights for our study from this graph?
Scan the QR code below to check your answer.

Now that you understand mean and variance, and we have calculated those for our data set, let's move on to analyzing the data to see if we can reject or fail to reject the null hypothesis. Creating graphs has helped us visualize the data, and you may have an educated guess about the results. Running statistical tests on the data will either back up or refute those educated guesses. Before we move on, I want you to form your own opinion now on how you think our results will pan out. When you are ready, click on the button below to continue the investigation on the "analyzing the data" page.

Click here to continue the investigation with the 'Analyzing the data' page.