Re: returning top two values

Rob Wed, 22 May 2002 04:17:00 -0700

I see at least three different techniques, each of which performs better for
different datasets. First, let's formalize our environment:

create table theTable (family int, member int, score int);
insert into theTable values (1,1,10), (1,2,15), (1,3,12), (1,4,17);
insert into theTable values (2,1,5),(2,2,7),(2,3,9),(2,4,10);
insert into theTable values (3,1,4),(3,2,8),(3,3,2);

You'd like to effectively select values based on top "nth order statistic",
or the nth biggest number, which is equivalent to MAX() for n==0 (or n==1,
if you think one-based). Unfortunately, I don't believe MySQL has a general
"nth order statistic" operation. What's more, you want to join the selected
order statistic back to its original data row, which MySQL can't do natively
even with MAX().

The first technique is the one that's been suggested: Run a simply query for
each family and limit yourself to two results:

select * from theTable where family=1 order by score desc limit 2;
select * from theTable where family=2 order by score desc limit 2;
select * from theTable where family=3 order by score desc limit 2;

This is efficient only if you've got a very small number of families and are
willing to code a loop into your interface language.

Alternatively, you can compute your order statistic explicitly by joining
the table against itself, and then filter based on that:

select max(curEntry.family) as family,max(curEntry.member) as
member,max(curEntry.score),count(*) as orderstatistic
from theTable as curEntry, theTable as greaterEntries
where curEntry.family = greaterEntries.family and curEntry.score <=
greaterEntries.score
group by curEntry.family, curEntry.member
having orderstatistic < 3
order by family, orderstatistic;

This manages to pack to whole logic into one query, and is very easy to
change to compute the top three, top four, etc. with no additional code.
Unfortunately, MySQL is unlikely to be able to optimize this to look at less
than the entire table regardless of what indices you provide. (A 'having'
clause usually implies that this is the case.) If you have a very large
number of members per family, this is clearly inefficient since you should
only really need to analyze O(number of families) rows in order to compute
your result.

Finally, you can use temporary tables to compute intermediary results, with
each step making use of MAX() in the hope that MySQL is able to optimize
MAX() by using an index. I'm not sure if MySQL does currently optimize this
or not (please post the answer, if anyone knows), but it's definitely doable
in theory if you've got the right indices lying around.
The SQL here would be along the lines of:

create temporary table topScores
select max(family) as family, max(score) as score
from theTable
group by theTable.family;

create temporary table topMembers
select max(theTable.family) as family,
max(theTable.member) as member,
max(theTable.score) as score
from theTable, topScores
where theTable.family = topScores.family
and theTable.score = topScores.score
group by theTable.family;

create temporary table sndScores
select max(theTable.family) as family,
max(theTable.score) as score
from theTable,topMembers
where theTable.family = topMembers.family
and theTable.member != topMembers.member
group by theTable.family;

create temporary table sndMembers
select max(theTable.family) as family,
max(theTable.member) as member,
max(theTable.score) as score
from theTable,sndScores,topMembers
where theTable.family = sndScores.family
and theTable.score = sndScores.score
and sndScores.family = topMembers.family
and theTable.member != topMembers.member
group by theTable.family;

insert into topMembers select * from sndMembers;

select * from topMembers order by family,score DESC;

drop table sndMembers;
drop table sndScores;
drop table topMembers;
drop table topScores;

The last temporary table, at the least, can be turned into a simple insert
in our limited example, but I've provided it as a distinct table for
clarity. This technique is certainly much more complex, but for very large
numbers of members per family it might be more efficient, and it still
requires no logic in the interface language: there is only one command which
returns a result set for processing, and that is the final result. This 
approach does require additional logic to extend to more than the top two
scores, which requires another pair of SQL statements per result. If you are
looking for order statistics on the order of the average number of members
per family, then the single-statement approach will almost certainly be a
better choice.

-Rob

(because I'm bored at work, that's why...)

On 22/5/02 at 10:36 am, R.Dobson <[EMAIL PROTECTED]> wrote:

> Hi,
> thanks for all of the replies to my query.
> I'm not sure that I explained my problem very well as the solutions 
> received are solutions to the problem I described, but not the one I 
> meant :-)  (I don't think anyway)
> 
> 
> I have a table in the format:
> 
> family | member | score
> --------------------------------
> 1          |       1      |    10
> 1          |       2      |    15
> 1          |       3      |    12
> 1          |       4      |    17
> 2          |       1      |     5
> 2          |       2      |     7
> 2          |       3      |     9
> 2          |       4      |    10
> 3          |       1      |     4
> 3          |       2      |     8
> 3          |       3      |     2
> 
> I want the top 2 highest scorers for each family as in:
> 
> family | member | score
> -------------------------------
> 1          |     4        |    17
> 1          |     2        |    15
> 2          |     4        |    10
> 2          |     3        |     9
> 3          |     2        |     8
> 3          |     1        |     4
> 
> 
> Thanks again,
> Rich
> 
> mysql

---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: returning top two values

Reply via email to