William Seally Gosset was a statistician who worked at the Guinness brewery in the first part of the 20th century. Gosset was interested in developing methods for comparing the quality of beer from batch to batch. Due to the lengthy nature of the brewing process, quality control methods were by necessity limited to comparisons involving relatively small sample sizes. Thus, Gosset worked to develop statistical techniques for comparing the properties of things with small samples.

Gosset recognized that the statistical methods he developed had broad applicability across the sciences, but because the work had been performed while he was employed by Guinness, the company considered his work to be the property of the corporation. Thus, Gosset published his statistical method, based on something he called the "t-statistic", under the ridiculously modest pseudonym "Student". And today, Student's t-test is still regularly used in scientific hypothesis testing. This is just one example of the good things that have been derived from fermented malt beverages.

One form of the t-test is used to explore a relatively simple kind of question. It is used to test whether the average value of some property of two groups of things is the same (or different). So, for example let's say we wanted to check whether the average stature of adult men in Texas and Wyoming is different. One way we could answer this question is to measure every man in each state and compare the averages, but obviously this would be an extremely impractical if not impossible approach to the problem.

Another tact we might take is to select 100 men from each state, measure their heights, and calculate averages. In almost every case, however, no matter which men we select, we will observe differences in those averages. For example, we might find that men in Wyoming average 69.23" in height, while men in Texas average 69.18", because, of course, everything is big in Wyoming. Given those values, however, the question will inevitably arise as to whether this difference is significant. The key word in that sentence is "significant". What does that mean?

If we assume that the average stature of adult men in each state is the same, and from each we were to draw samples of 100 , what is the probability that we would observe a difference in average stature of this magnitude? If the probability is really tiny, say less than 1%, then we could be very confident that this difference is real. If the probability is really high, say 90%, then we would have to conclude that there probably is no difference in average stature among the two populations as whole. Student's t-test allows us to calculate this probability. It is a way of telling us whether an observed difference is meaningful. It provides us with a way to quantify certainty.

So what does this have to do with bowling? Well, from week to week, month to month, and year to year, we perceive differences in our bowling ability. Last year, my bowling ability seemed to improve dramatically. Six weeks into the season this year, I feel like my bowling has been anything but good. In fact, I feel like I have been bowling a lot worse. My perception is that I am a worse bowler this year than last, but sadly perception can be a sorry judge of reality.

I did a simple t-test. I compared my average game score for 93 games last season to my average score for the first 15 games of this season. I ended last season with an average just under 164. I have begun this season with an average of just under 160. Here's what the t-test tells me. If you begin with the assumption that there has really been no change in my bowling ability, what is the probability of observing this difference in average given this number of games. According to William Sealy Gosset's method, the probability is about 53%. In science, we would say that this difference is not significant.

In other words, my bowling ability this year seems to be pretty much equivalent to my bowling ability from the prior season. Sure, my average game scores are a few pins less, but my underlying skills pretty much seem to be where they where they were when we left off last year. So, I should quit freaking out about it, and you should do the same. Why are you freaking out about my bowling ability anyway?

What about the rest of the Movements? Well, the same goes for all of us. As you can see from the graph above, two of us have averaged a few pins less and the other two a few pins more. The average of the team as a whole is remarkably constant. Last year, we averaged 149.7 pins per game. This year, it's 149.6. None of these differences are significant.

I shouldn't really be surprised by these results, except that I thought that the lack of summer bowling might negatively impact my game. That does not appear to be the case. I should also note that this finding should not be interpreted to mean that our skill at bowling is not changing. I think it is, but that change can only be detected over much longer time scales. What never ceases to amaze, though, is the human capacity for seeing causality and difference where there is none. Deep in my gut, I have felt like my bowling has been worse, but my gut reaction couldn't have been more wrong.

Subscribe to:
Post Comments (Atom)

Does the reverse also hold true? I feel like my bowling has improved from last season but it's probably just a figment of my imagination.

ReplyDeleteYeah, me, too. I think I detected a small leak in my figment on Tuesday night...

ReplyDeleteSure, the reverse could hold. It just depends on how big the difference is.

ReplyDeleteI think you aren't giving enough consideration to your gut feelings. There is probably a lot of information there that you are unable to make sense of, but your subconscious mind has a grasp of.

ReplyDeleteThis is not the same as measuring height or comparing averages. In your t-test example you assume that the average stature of adult men in each state is the same, and from that calculate the probability that we would observe a difference. If the probability was 53% that the difference was not significant, then the probability was 47% the difference was significant. I don't know what science finds that difference insignificant, but if you are predicting rain, I'm taking an umbrella to work. To me, the most you were able to determine was that the data was not sufficient.

Your gut has a lot more data that is significant and has been evaluated and processed. Whether or not that gut feeling has evaluated and processed the data appropriately may depend on your psychological makeup. You may want to get a unbiased opinion for that. :-)