On 03/26/2018 10:27 PM, Tom Lane wrote: > David Rowley <david.row...@2ndquadrant.com> writes: >> [ combinefn_for_string_and_array_aggs_v7.patch ] > > I spent a fair amount of time hacking on this with intent to commit, > but just as I was getting to code that I liked, I started to have second > thoughts about whether this is a good idea at all. I quote from the fine > manual: > > The aggregate functions array_agg, json_agg, jsonb_agg, > json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as > similar user-defined aggregate functions, produce meaningfully > different result values depending on the order of the input > values. This ordering is unspecified by default, but can be controlled > by writing an ORDER BY clause within the aggregate call, as shown in > Section 4.2.7. Alternatively, supplying the input values from a sorted > subquery will usually work ... > > I do not think it is accidental that these aggregates are exactly the ones > that do not have parallelism support today. Rather, that's because you > just about always have an interest in the order in which the inputs get > aggregated, which is something that parallel aggregation cannot support. >
I don't think that's quite true. I know plenty of people who do things like this: SELECT a, b, avg(c), sum(d), array_agg(e), array_agg(f), string_agg(g) FROM hugetable GROUP BY a,b HAVING avg(c) > 100.89; and then do some additional processing on the result in some way (subquery, matview, ...). They don't really care about ordering of values in the arrays, as long as orderings of all the arrays match. Currently queries like this can use parallelism at all, and the patch fixes that I believe. > I fear that what will happen, if we commit this, is that something like > 0.01% of the users of array_agg and string_agg will be pleased, another > maybe 20% will be unaffected because they wrote ORDER BY which prevents > parallel aggregation, and the remaining 80% will scream because we broke > their queries. Telling them they should've written ORDER BY isn't going > to cut it, IMO, when the benefit of that breakage will accrue only to some > very tiny fraction of use-cases. > Isn't the ordering unreliable *already*? It depends on ordering of tuples on the input. So if the table is scanned by index scan or sequential scan, that will affect the array_agg/string_agg results. If the input is a join, it's even more volatile. IMHO it's not like we're making the ordering unpredictable - it's been like that since forever. Also, how is this different from ORDER BY clause? If a user does not specify an ORDER BY clause, I don't think we'd care very much about changes to output ordering due to plan changes, for example. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services