Thursday, January 05, 2012

Conference bowl records mean nothing: A guide for the innumerate

A lot of stupid people who work for prominent publications are saying a lot of really stupid things these days. They are arguing that the Big Ten is a horrible terrible no-good conference because its teams went 4-6 in bowl games and that's terrible and "can't be spun." They are arguing that the SEC is the greatest conference since the Council of Nicaea in 325 CE despite being 4-2 because shut up that's why and they're the ess-eee-see.

Now we like our statistics here, and what we like better than our statistics is fundamentally sound analysis of our statistics. So we're going to do a fundamentally sound analysis of conference bowl game records, and let you all in on the secret: they're meaningless.

The topic is statistics and probability, so I'm going have to be careful and longwinded because our language is not designed to discuss these topics briefly. For our first analysis, we're going to concentrate on the Big Ten. We're going to propose a null hypothesis and that hypothesis is that each of the bowl games the Big Ten teams were played in were perfectly evenly matched. That is to say, each team had a 50% chance of winning its bowl game.

To see if the Big Ten was really bad, we're going to perform a standard statistical significance test. We would expect the conference to win half of its bowl games, and so we need to calculate the probability that they would win four games or fewer given the null hypothesis. This probability is called the p-value. For the B1G, it's the sum of the probability that they would somewhere between 0-4 games inclusive. Using basic combinatorics, we can calculate this probability as (1+10+45+120+210)/1024 = 37.7%. In order to reject the null hypothesis, we need the p-value to be no greater than 5%. So based on the 4-6 bowl record, we can conclude...nothing! Maybe the B1G is as good as everyone else, or maybe not. We can't say. We could modify the null hypothesis slightly and make each B1G team a slight favorite, and we still wouldn't be able to reject that hypothesis. So we don't know if the B1G is better, worse, or the same quality as the other conferences. There's not enough information.

There's never enough information. If a conference plays in less than four bowl games, the p-value can never be less than 5%, so we can rule out concluding anything useful about the quality of the Mountain West, WAC, Sun Belt, and Independents immediately. If a conference plays in 5-7 bowl games, they have to lose all of them to reject the null hypothesis in favor of the hypothesis that they suck, and they have win all of them to reject the null hypothesis in favor of the hypothesis that they're really good.

If a conference plays 8 or 9 games, we can reject the null hypothesis in favor of the "they suck" hypothesis" if they win zero games or one game, and we can reject it in favor of the "they kick ass" hypothesis is the lose zero games or one game. If a conference plays 10 games, "they suck" if they win two games or fewer, and "they kick ass" if they win eight games or more.

The Big East, MAC, and CUSA have each played four games so far and are 3-1, so each conference will finish with either a 3-2 or 4-1 record. Those cases give a p-value of at least 18.8%, so we learned nothing from the bowl games. The ACC went 2-6, which is bad, but not so bad that we can conclude that "they suck." The SEC will go either 4-3 or 5-2 - we're excluding LSU/Alabama because one of them has to win and one of them has to lose - and neither of those records proves "they kick ass." The Pac 12 is 2-5, which is bad, but not statistically significantly bad. Also, to be fair to them, they would have done better if they'd had USC in their bowl lineup instead of UCLA.

That leaves one conference left, the Big XII. They're currently 6-1, and they have the chance to go 7-1 if Kansas State beats Arkansas in the Cotton Bowl. If KSU wins, then the p-value for the Big XII will be 9/256 = 3.5%. We may have significance. The Big XII can prove that "they kick ass!"

But, as Chris Berman Lee Corso (edited) would annoy you by saying, not so fast, my friend. The significance threshold of 5% only counts if we're doing one comparison. We've done eight comparisons here, and that increases the probability that at least one of our null hypotheses will be rejected eightfold. So the correct thing to do here is a Bonferroni correction and reset our threshold for significance to 5% divided by 8, or 0.625%. Thus, even if KSU wins, we still won't have enough data to conclude "the Big XII kicks ass."

So there you have it. When pundidiots get behind their keyboards or in front of their TV cameras and say, "The B1G sucks because they went 4-6" or "The Big XII rules because they're 6-1," they are full of shit. There is 100 years of classical statistical analysis that can be used to demonstrate that they're full of shit because they're jumping to conclusions that can't be made from the data. The only way for a conference to prove its superiority or inferiority in bowl games is to play in a lot of them (at least eight) and win all of them or lose all of them. If anyone can provide an example of a conference going either 0-8 or 8-0 in bowl games, let me know about this rare occurrence where we learned something useful about conference strength from them.

Of course, if we really wanted to evaluate conference strength, we could go back to the regular season, look at how teams perform in non-conference games and how strong their non-conference schedules were, maybe factor in things like margin of victory, and work it all out using a fast computer...but no, we're not allowed to do that. That would be unsporting.

Note for experts: there are issues regarding the independence of the different tests we're taking and for a completely rigorous analysis we probably need to break out heavier statistical machinery. But I just wanted to get the main point across, which is small sample size = no conclusions can be made.

2 comments:

Unknown said...

Not to mention that the Big Ten routinely plays virtual road games against higher-ranked teams. This year alone: Penn State, Wisconsin, and possibly MSU fit both categories, while Iowa, Nebraska, Michigan, Ohio State, Northwestern, Illinois fit at least one. Even Purdue played something approaching a road game, but they played Western.

David said...

No argument here. The only reason I didn't mention that was because it wasn't necessary to make my point. If you factor in the road games, 4-6 probably becomes better than expected.