Wednesday, December 07, 2011

Everyone else has a solution for the BCS, so why can't we?


NCAA institutions are a varied lot: both private and public, both large and small, both teaching-oriented and research-oriented, both colleges and universities. However, without doing the research, there is at least one thing that I am certain every single one of them has: a professor of statistics.

The Michigan Statistics Department has an awesome logo. That is all.
In this post we will discuss two of the many problems with the BCS. The first is that there are humans involved in the formula who are swayed by things such as "brand name," "reputation," and "whatever bullshit Gary Danielson spews during the fourth quarter of the SEC Championship game." Vishnu Parasuraman at Grantland proposes replacing the polls with a committee. The only effect that would have is that the biases of "brand name," "reputation," and "Gary Danielson" would be hidden in a board room in Indianapolis instead of being out in the open for us to mock. Lloyd Carr is the sort of person who'd be on the selection committee, and he's already a Harris Poll voter. You don't see the NCAA putting Ken Pomeroy on the March Madness committee.

This is not to say that computers are unbiased; the second problem with the BCS is that the computer rankings are hobbled and slanted by the silly biases of Jeff Sagarin, Peter Wolfe, Wesley Colley, Kenneth Massey, Richard Billingsley, and some dudes at the Seattle Times (a.k.a. Seattle's fourth best newspaper, behind the Stranger, the Weekly, and the Sinner). Only six. That's ridiculous.

My solution to these problems: no more committees, no more polls. Only computers. But not just six computers, that's not enough! In order to come up with a reasonable BCS computer ranking formula, we need hundreds of rankings. Every school in the NCAA should be invited to submit a team ranking system before the start of the season, and the official football rankings should be a combination of hundreds of rankings created by the smartest statistics students at all of the NCAA's member institutions.

Here's the plan. After the jump.


Part I: How to make an algorithm

0. Rankings will not be allowed to take into account conference affiliation, preseason rank, records from previous seasons, fanbase size, and other such irrelevant factors that are routinely applied in the current system. The only exception is that rankings can use Bayesian methods from a preseason prior distribution to develop early-season rankings, as long as this is no longer needed by the end of the year. (This is what the Sagarin ratings do to get weekly rankings until all the teams are well-connected.)

1. The NCAA will provide a sample input of the data it will provide each algorithm, using prior seson data. In a change from the current computer rankings, the data set will include all play-by-play data.

2. Each member institution that cares to participate will submit a rating. Each rating algorithm will be open source and posted to an official NCAA website 60 days before the season starts for analysis and/or derision.

3. Each algorithm will be required to not reward margins of victory greater than 20 points (to prevent running up the score) and will be required to implement a "garbage time detector" specified by the NCAA to determine when a game is out of hand. Teams may receive higher rankings for reaching garbage time earlier in the game.

4. Each rating system will output a ranking, giving its highest ranked team a score of 1 and its lowest ranked team a score of 0. Algorithms may rank FCS and lower divisions if they desire, but the results must be scaled so that the lowest ranked FBS team has a score of 0.

Part II: Aggregating the algorithms

Once all of the individual rankings are tabulated, they will be aggregated into one absolute ranking. The procedure for this is as follows:

1. The scores for each team in each ranking will be averaged to get a mean score for each team.

2. Each ranking will be evaluated against the mean score using a deviation statistic which is


The 10% of rankings with the worst (i.e., largest) values of the deviation statistic will be discarded. This step is done to eliminate extreme deviations from the group consensus. (e.g., voting New Mexico #1)

3. The scores for each team in the remaining 90% of the algorithms will be averaged, yielding a final mean score. The rankings will be calculated by ordering the teams according to the final mean score.

Example: Suppose the Big Ten were to abolish its divisions and choose its championship game participants this way. Each of the 13 member schools of the CIC would submit a ranking. Here are the sample rankings:

 
We calculate the deviation statistic for each ranking, resulting in the following scores for each ranking:

Chicago: 1.10
Illinois: 1.67
Indiana: 0.83
Iowa: 0.86
Michigan: 1.31
MSU: 1.07
Minnesota: 1.53
Nebraska: 1.70
Northwestern: 0.90
Ohio State: 1.25
Penn State: 0.83
Purdue: 1.73
Wisconsin: 0.97

We find that Purdue's ranking deviates the most from the mean score, so we discard it (with 13 teams, the worst 10% rounds to just one ranking) and calculate the final mean score from the remaining eleven  twelve rankings. The final ranking is thus #1 Wisconsin (0.96),  #2 MSU (0.92), #3 Michigan (0.81), #4 Nebraska (0.72), #5 Penn State (0.61), #6 Ohio State (.444), #7 Northwestern (.443), #8 Iowa (0.33), #9 Purdue (0.26), #10 Illinois (0.22), #11 Minnesota (0.11), and #12 Indiana (0.01).

Part III: Answering your questions

Won't schools try to game the system to rank their own team as highly as possible?

Of course they will! Just like they do now in the Coaches' Poll. The first improvement over the current system is that every school will have the chance to game the system this way. The second improvement is that is their ranking will only be one out of at least a hundred and the biases of different schools (e.g. Michigan & Ohio State) should cancel each other out. Also, since a school's ranking won't count in the final mean score if they deviate too far from the poll consensus, the programmers will have to be clever to design their algorithms to boost their school's rank without getting their poll tossed. If a school figures out a way to do it, then they should be rewarded for being smart. And since the code is open-source, every other school can pull the same trick the next year, giving the sneaky school at most a one year advantage.

Won't this be incredibly confusing?

Yes and no. On the one hand, with so many different ranking systems as part of the code, it will be very difficult to figure out the implications of different game outcomes. From the point of view of the coaches and players, the only thing they need to understand is winning is better than losing, winning medium is better winning small, and winning big isn't any better than winning medium. On the other hand, because the entire codebase will be open source, an enterprising analyst can study everyone's algorithm and figure out how and why different scenarios could play out. By giving everyone access to the ghost in the machine, it will be easier to predict than a system where we have to guess at whatever the voices in the heads of Harris Poll voters are saying. During the last week of the season, ESPN could give real-time poll updates as each game ends, which would be a much better use of their time than guessing at what the polls will say.

Won't this still cause a problems if there is no clear #1 vs. #2 matchup?

Yes it will. But the same principle can be used to pick the teams in a playoff of any size - for example, six. In fact, modified versions of the same basic procedure could be used to pick the at-large teams for March Madness or the Frozen Four. All you'd have to do is change the football statistics used to make the individual rankings into basketball or hockey statistics.

You're very vague about this garbage time detector thing.

That's because while I know a fair amount about statistics in general, I'm not an expert in advanced football stats. The garbage time detector would probably be based off of win probability added. Garbage time could go into effect when a team has a 95% chance of winning the game, and would be maintained as long as the team maintains a high probability of winning. This is the part of the process that still needs work.

Why would you eliminate polls that are far from the consensus? Maybe they're right and the majority is wrong.

Maybe so! But, if each ranking is independently trying to arrive at the true ordering of teams and each ranking has its own random biases, the "wisdom of crowds" should take over and the consensus ranking should be close to the true ranking. There may be a version of the central limit theorem out there you could use to actually prove this.

However, just to make sure the consensus isn't wonky, we'll propose one last twist. There will be a prize for the "best" ranking. We'll evaluate the best ranking according what I call the upset statistic, which is:


where M.O.V. is margin of victory. The intuition is that a good ranking is one with as few upsets of highly-ranked teams by lower-ranked teams as possible, and that if a really low-ranked team blows out a high-ranked team, there's probably something wrong with your ranking. This upset statistic is based on the integer-valued rank an algorithm gives a team, not the continuous number between zero and one, and is thus very difficult to optimize.

The prize would be scholarship money for the winning school's ranking team, and could be either given out by the NCAA directly, or a corporate sponsor could get naming rights in return for ponying up the scholarship dough. I suggest we call it the "Dr. Pepper Ten Student Algorithm Poll." It's not for innumerates! Seriously though, don't drink Dr. Pepper Ten. Don't reward hyper-misogynist crap advertising.

No comments: