[HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

Andrew Gierth Wed, 17 Jul 2013 20:16:43 -0700

The spec defines two types of aggregate function classed as "ordered set
function", as follows:
 
1. An "inverse distribution function" taking one argument (which must be
   a grouped column or otherwise constant within groups) plus a sorted
   group with exactly one column:
 
   =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
 
   The motivating example for this (and the only ones in the spec) are
   percentile_cont and percentile_disc, to return a percentile result
   from a continuous or discrete distribution. (Thus
   percentile_cont(0.5) within group (order by x) is the spec's version
   of a median(x) function.)
 
2. A "hypothetical set function" taking N arguments of arbitrary types
   (a la VARIADIC "any", rather than a fixed list) plus a sorted group
   with N columns of matching types:
 
   =# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ...
 
   (where typeof(p1)==typeof(q1) and so on, at least up to trivial
   conversions)
 
   The motivating example here is to be able to do rank(p1,p2,...) to
   return the rank that the specified values would have had if they were
   added to the group.
 
As usual, we do not want to constrain ourselves to supporting only the
specific cases in the spec, but would prefer a general solution.
 
We (meaning myself and Atri) have an implementation that basically
works, though it is not yet complete, but before taking it any further
we need to resolve the design question of how to represent these two
types of function in the system catalogs. The fact that there are in
effect two parts to the parameter list, which are either independent
(for inverse distribution funcs) or closely related (for hypothetical
set functions), doesn't seem to point to an obvious way to represent
this in pg_proc/pg_aggregate.
 
I'm not yet satisfied with the method used in our implementation, so
we're throwing this open for suggestions. We will post the
work-in-progress patch along with a description of its current
implementation shortly.
 
One of the major complications is that we ideally want to be able to do
polymorphism based on the type of the sorted group, specifically in
order to be able to do
 
percentile_disc(float8) within group (order by anyelement)
 
returning anyelement. (i.e. we should be able to get a discrete
percentile from any type that is orderable.) The question here is how to
resolve the return type both of the aggregate itself and of the finalfn.
 
We've also had an expression of interest in extending this to allow
percentile_disc(float8[]) and percentile_cont(float8[]) returning
arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return an
array containing the bounds, median and quartiles in one go. This is an
extension to the spec but it seems sufficiently obviously useful to be
worth supporting.


Comments?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

Reply via email to