Dear Prof. Nash,

your reply was truly informative. I have overlooked ranking and deeply 
regret this. I definitely can live with this modified wording of the 
definition, using "MEDIAN logically _ranks_ the numbers (lowest to 
highest)." instead of sorting.

I do not know if you participate in the OASIS draft preparation, but it 
would be a huge gain for the project if you could review the draft document.

I need to correct some of my initial claims about the median algorithm 
which I will briefly discuss below.

2. I noticed some errors in my reasoning about the median algorithm. 
While inserting an element into an n-sized array can be done faster than 
log(n), you cannot work with plain arrays, but need more complex data 
structures (which would undoubtedly introduce a penalty for every other 
operation on this array which does clearly backfire). On the other hand, 
a memmove() is really slow on the x86 architecture (vector architectures 
and DSP chips allow block memory moves, but to my knowledge, there 
hasn't been something similar implemented on the x86 architecture - at 
least I could not find any literature on it), so IF using this insert 
method (and moving parts of the array), O() will unfortunately approach 
O(n^2). I am really saddened that the x86 architecture hasn't moved any 
further in the last dozen of years.

3. Nevertheless, I've tested various algorithms during this weekend, and 
non-sort algorithms are really 20-50 times faster than the Quick Sort 
(see QuickSelect, http://ndevilla.free.fr/median/median/node20.html ). 
The weird thing was, that, while I tried to implement my algorithm (and 
coded some conditions wrong, therefore breaking the whole algorithm), I 
still could compute with lightning speed an almost accurate median on 
large data sets of random numbers (often accurate, and always within +- 
0.5 from the true median, even though the broken algorithm dropped 
pseudorandomly some half of the elements without doing any comparison on 
them). This makes me conclude, that on large data sets having at least 
some random distribution, one can devise even faster algorithms.

The reason why I encourage alternative algorithms for computing the 
median stems both from the speed of such algorithms and the way they 
function: some of them read the data set only once (e.g. QuickSelect), 
while others do NOT modify the input array. So there are many situations 
where they are preferable over a sorting algorithm. Even if the data set 
is of moderate size, if you compute 100 medians in your sheet and have 
to correct an element (therefore triggering a formula update), a quick 
algorithm makes a difference.

Kind regards,

Leonard Mada



Prof J C Nash wrote:
> The rather heated discussion on the median has a historic quality. It is 
> not quite as old as the arguments about how many angels can sit on the 
> head of a pin, but not too far off. Perhaps I can cool things down by
> providing some perspective.
>
> There are, to the eyes of someone who has been teaching and working with 
> this stuff for nearly 4 decades, three issues that are causing conflict:
>
> 1) The question of what we actually want "MEDIAN" to compute.
> 2) The question of how to express this clearly.
> 3) The question of how to actually do it.
>
> Question (1) concerns the allowed values for the quantity returned by a 
> function of type "quantile", the general term for a median, percentile, 
> or any other quantity that returns the "value in the scale of the data 
> with rank k from the minimum (or, alternatively, from the maximum".
>
> We can insist on returning values from within the original data set, in 
> which case we must sometimes choose between two values, or we can 
> specify a rule for interpolation.
>
> The most common rule for the median is to interpolate when the number of 
> elements, n, is even, leading to the "definition" that the median is the 
> average of the middle ranked values for this case, while it takes the 
> value of the middle rank datum when n is odd. I believe all the 
> spreadsheets use this rule. Percentiles are usually (but not always) 
> computed using interpolation, and I have seen some variety in the values 
> given for quartiles in different statistical packages. Rob Hyndman 
> published a quite good paper in The American Statistician about 8-10 
> years ago on the subject.
>
> John Tukey (stem and leaf graphs, box and whisker plots, five number 
> summaries, Exploratory Data Analysis, etc.) liked to have measures that 
> belonged to the data set. So he defined Depth as the rank of an 
> observation from the nearest "end" (highest or lowest) and Tukey 
> boxplots use Hinges rather than Quartiles (which are generally 
> interpolated, but not always by consistent formulas).
>
> See http://macnash.admin.uottawa.ca/files/stembox.txt for a bit of a 
> discussion on Depths etc.
>
> Ultimately, this question boils down to where to cut to divide 4 candies 
> among 5 children. No matter what you do, things get ugly. For large n, 
> whether you use interpolation, and what formula for interpolating, 
> becomes less important.
>
> When we come to (2), I am surprised nobody is using RANK rather than 
> SORT. To compute quantiles, one does need to rank the data in some way, 
> but we don't actually need to sort it. There were a number of articles 
> about 30 years ago on ranking vs sorting, and if anyone is interested I 
> can try to dig some up. If I were writing the MEDIAN definition for the 
> TC, I would change it in one word to:
>
> MEDIAN logically _ranks_ the numbers (lowest to highest).  If given an 
> odd number of values, MEDIAN returns the middle value. If given an even
> number of values, MEDIAN returns the arithmetic average of the two
> middle values.
>
> In other words, I'll live with the interpolation. Actually, I believe 
> Tukey (someone correct me if I'm wrong, my copy of EDA is in another 
> office) defined median as the value with maximum depth from either end 
> or the arithmetic average of the two values having maximal depth if the 
> value is not unique.
>
> Question (3) is one for the programmers and algorithm performance 
> people. I think a fast ranking algorithm would be very useful to improve 
> performance in a lot of spreadsheet features, not just the median. But 
> it should be an algorithm that is quite "clean" so that application to 
> more than just a column or row range doesn't end up causing all kinds of 
> potential for bugs. For the sake of development and testing, it may be 
> useful to use compile or even execution switches based on some stored 
> control parameter, so that informed users can try out different ranking 
> algorithms. Note that this may be helpful for increasing sort 
> performance on very large sheets.
>
> Note that RANK() gives the rank of a number in a set. We actually want 
> the number that is ranked at some level.
>
> No doubt the discussion will continue.
>
> JN
>
> _______________________________________________
> gnumeric-list mailing list
> [email protected]
> http://mail.gnome.org/mailman/listinfo/gnumeric-list
>
>   

_______________________________________________
gnumeric-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/gnumeric-list

Reply via email to