The rather heated discussion on the median has a historic quality. It is 
not quite as old as the arguments about how many angels can sit on the 
head of a pin, but not too far off. Perhaps I can cool things down by
providing some perspective.

There are, to the eyes of someone who has been teaching and working with 
this stuff for nearly 4 decades, three issues that are causing conflict:

1) The question of what we actually want "MEDIAN" to compute.
2) The question of how to express this clearly.
3) The question of how to actually do it.

Question (1) concerns the allowed values for the quantity returned by a 
function of type "quantile", the general term for a median, percentile, 
or any other quantity that returns the "value in the scale of the data 
with rank k from the minimum (or, alternatively, from the maximum".

We can insist on returning values from within the original data set, in 
which case we must sometimes choose between two values, or we can 
specify a rule for interpolation.

The most common rule for the median is to interpolate when the number of 
elements, n, is even, leading to the "definition" that the median is the 
average of the middle ranked values for this case, while it takes the 
value of the middle rank datum when n is odd. I believe all the 
spreadsheets use this rule. Percentiles are usually (but not always) 
computed using interpolation, and I have seen some variety in the values 
given for quartiles in different statistical packages. Rob Hyndman 
published a quite good paper in The American Statistician about 8-10 
years ago on the subject.

John Tukey (stem and leaf graphs, box and whisker plots, five number 
summaries, Exploratory Data Analysis, etc.) liked to have measures that 
belonged to the data set. So he defined Depth as the rank of an 
observation from the nearest "end" (highest or lowest) and Tukey 
boxplots use Hinges rather than Quartiles (which are generally 
interpolated, but not always by consistent formulas).

See http://macnash.admin.uottawa.ca/files/stembox.txt for a bit of a 
discussion on Depths etc.

Ultimately, this question boils down to where to cut to divide 4 candies 
among 5 children. No matter what you do, things get ugly. For large n, 
whether you use interpolation, and what formula for interpolating, 
becomes less important.

When we come to (2), I am surprised nobody is using RANK rather than 
SORT. To compute quantiles, one does need to rank the data in some way, 
but we don't actually need to sort it. There were a number of articles 
about 30 years ago on ranking vs sorting, and if anyone is interested I 
can try to dig some up. If I were writing the MEDIAN definition for the 
TC, I would change it in one word to:

MEDIAN logically _ranks_ the numbers (lowest to highest).  If given an 
odd number of values, MEDIAN returns the middle value. If given an even
number of values, MEDIAN returns the arithmetic average of the two
middle values.

In other words, I'll live with the interpolation. Actually, I believe 
Tukey (someone correct me if I'm wrong, my copy of EDA is in another 
office) defined median as the value with maximum depth from either end 
or the arithmetic average of the two values having maximal depth if the 
value is not unique.

Question (3) is one for the programmers and algorithm performance 
people. I think a fast ranking algorithm would be very useful to improve 
performance in a lot of spreadsheet features, not just the median. But 
it should be an algorithm that is quite "clean" so that application to 
more than just a column or row range doesn't end up causing all kinds of 
potential for bugs. For the sake of development and testing, it may be 
useful to use compile or even execution switches based on some stored 
control parameter, so that informed users can try out different ranking 
algorithms. Note that this may be helpful for increasing sort 
performance on very large sheets.

Note that RANK() gives the rank of a number in a set. We actually want 
the number that is ranked at some level.

No doubt the discussion will continue.

JN

_______________________________________________
gnumeric-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/gnumeric-list

Reply via email to