I have reached the correlation section in a course that I teach and I
hit upon the idea of using data from the weekly Bowl Championship
Series (BCS) rankings to illustrate different techniques for assessing
correlation.

For those not familiar with college football in the United States
(where "football" refers to American football, not what is called
soccer here and football in most other countries) I should explain
that many, many universities and colleges have football teams but each
team only plays 10-15 games per season, so not every team will play
every other team.  The game is so rough that it is not feasible to
play more than one match per week and a national playoff after the
regular season is impractical.  It would take too long and the players
are, in theory, students first and athletes second.

In place of a national playoff there are various polls of coaches or
sports writers that purport to rank teams nationally.  Several
analysts also publish computer-based rankings that use complicated
formulas based on scores in individual games, strength of the
opponent, etc. to rank teams.

Rankings from two of the "human polls" (the Harris poll of sports
writers and the USA Today poll of the coaches) and from six of the
computer polls are combined to produce the official BCS ranking.  The
Wikipedia entry for "Bowl Championship Series" gives the history and
evolution of the actual formula that is currently used.

This season has been notable for the volatility of those rankings.
One is reminded of the biblical prophesy that "The first shall be last
and the last shall be first".

Another notable feature this year is the extent to which the
computer-based rankings and the rankings in the human polls disagree.
I enclose a listing of the top 25 teams and the components of the
rankings as of last Sunday (2007-10-21).  (Almost all college football
games are played on Saturdays and the rankings are published on
Sundays).  The columns are
Rec - won-loss record
Hvot - total number of Harris poll votes
Hp - proportion of maximum Harris poll votes
HR - rank in the Harris poll (smaller is better)
Uvot, Up, UR - same for the USA Today poll
Cavg - Average score (it's actually a trimmed mean) on computer-based
rankings (larger is better)
BCS - BCS score - the average of Hp, Up and Cavg
Pre - BCS rank in the previous week

As I understand it, the votes in the Harris and USA Today polls are
calculated by asking each voter to list their top 25 teams then
awarding 25 points for a team ranked 1, 24 points for a team ranked 2,
etc. on each ballot and calculating the total.  Apparently there are
now 114 Harris poll participants and 60 USA Today poll participants
giving maximum possible scores of 2850 and 1500, respectively.

The Cavg column is calculated from 6 scores of 0 to 25 (larger is
better) dropping the largest and smallest scores.  The raw score is
out of 100 and the proportion is reported as Cavg.

The data frame is available (for a little while) as
http://www.stat.wisc.edu/~bates/BCS.rda

The raw scores and the rankings from the Harris and USA Today polls
are in fairly good agreement but the Cavg scores are very different.
Although scatterplots will show this  I feel that correlation measures
may be thrown off by the large number of zeros in the Cavg scores.
What would be the preferred of measuring correlation in such a case?
What would be a good graphical presentation showing the lack of
agreement of the various components of the BCS score?
                 Rec Hvot Uvot Cavg Pre      Hp HR     Up UR    BCS
Ohio St.       (8-0) 2847 1498 0.93   2 0.99895  1 0.9987  1 0.9759
Boston College (7-0) 2676 1412 0.97  23 0.93895  2 0.9413  2 0.9501
LSU            (7-1) 2550 1319 0.96  31 0.89474  3 0.8793  3 0.9114
Arizona St.    (7-0) 2003 1089 0.86  35 0.70281  8 0.7260  7 0.7629
Oregon         (6-1) 2281 1225 0.67   3 0.80035  5 0.8167  5 0.7623
Oklahoma       (7-1) 2521 1306 0.51  32 0.88456  4 0.8707  4 0.7551
West Virginia  (6-1) 2157 1134 0.61  36 0.75684  6 0.7560  6 0.7076
Virginia Tech  (6-1) 1831 1052 0.69   4 0.64246 10 0.7013  9 0.6779
Kansas         (7-0) 1671  911 0.75   6 0.58632 11 0.6073 10 0.6479
South Florida  (6-1) 1627  813 0.81  13 0.57088 12 0.5420 12 0.6410
Florida        (5-2) 1867  906 0.61   8 0.65509  9 0.6040 11 0.6230
USC            (6-1) 2100 1060 0.17   7 0.73684  7 0.7067  8 0.5378
Missouri       (6-1) 1568  790 0.53   9 0.55018 13 0.5267 13 0.5356
Kentucky       (6-2) 1156  604 0.55  34 0.40561 15 0.4027 15 0.4528
Virginia       (7-1)  650  466 0.76  12 0.22807 20 0.3107 18 0.4329
South Carolina (6-2) 1031  474 0.39  33 0.36175 17 0.3160 17 0.3559
Hawaii         (7-0) 1265  617 0.00  11 0.44386 14 0.4113 14 0.2851
Georgia        (5-2)  711  402 0.23  14 0.24947 19 0.2680 19 0.2492
Texas          (6-2) 1054  527 0.00  15 0.36982 16 0.3513 16 0.2404
Michigan       (6-2)  643  325 0.26  18 0.22561 21 0.2167 21 0.2341
California     (5-2)  873  397 0.02   5 0.30632 18 0.2647 20 0.1970
Auburn         (5-3)  333  179 0.33  10 0.11684 23 0.1193 23 0.1887
Connecticut    (6-1)   80   75 0.33  29 0.02807 29 0.0500 28 0.1360
Alabama        (6-2)  322  177 0.15  27 0.11298 24 0.1180 24 0.1270
Penn St.       (6-2)  404  294 0.01  20 0.14175 22 0.1960 22 0.1159
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to