On 03/26/2018 11:19 PM, Tom Lane wrote: > Tomas Vondra <tomas.von...@2ndquadrant.com> writes: >> On 03/26/2018 10:27 PM, Tom Lane wrote: >>> I fear that what will happen, if we commit this, is that something like >>> 0.01% of the users of array_agg and string_agg will be pleased, another >>> maybe 20% will be unaffected because they wrote ORDER BY which prevents >>> parallel aggregation, and the remaining 80% will scream because we broke >>> their queries. Telling them they should've written ORDER BY isn't going >>> to cut it, IMO, when the benefit of that breakage will accrue only to some >>> very tiny fraction of use-cases. > >> Isn't the ordering unreliable *already*? > > Not if the query is such that what gets chosen is, say, an indexscan > or mergejoin. It might be theoretically unreliable and yet work fine > for a given application. >
What about parallel versions of those plans, though? That is, what prevents the planner from generating plans like this Hash Aggregate -> Gather -> Partial Index Scan AFAIK that's a valid plan, and it does not have the stable ordering property of a non-parallel index scan. Similarly for Merge Join. > I might be too pessimistic about the fraction of users who are depending > on ordered input without having written anything that explicitly forces > that ... but I stand by the theory that it substantially exceeds the > fraction of users who could get any benefit. > Not sure, it's pretty much impossible to produce a number that would meaningfully represent the whole user base. All I can say is that the number of people I know, who would benefit from this is significant. > > Your own example of assuming that separate aggregates are computed > in the same order reinforces my point, I think. In principle, anybody > who's doing that should write > > array_agg(e order by x), > array_agg(f order by x), > string_agg(g order by x) > > because otherwise they shouldn't assume that; the manual certainly doesn't > promise it. But nobody does that in production, because if they did > they'd get killed by the fact that the sorts are all done independently. Yes, the manual does not promise that. But I find that expectation way more "natural" than deterministic ordering without ORDER BY clause. I expect each tuple to be processed "at once" - that it's added to all arrays in the same step, irrespectedly of the ordering of input tuples. > (We should improve that someday, but it hasn't been done yet.) So I think > there are an awful lot of people out there who are assuming more than a > lawyerly reading of the manual would allow. Their reaction to this will > be about like ours every time the GCC guys decide that some longstanding > behavior of C code isn't actually promised by the text of the C standard. > But we're not *already* providing reliable ordering without an ORDER BY clause (both for queries as a whole and array_agg/string_agg). It can produce arbitrary ordering, depending on what plan you happen to get. So I don't think we'd be changing any longstanding behavior, because it's already unreliable. If it was semi-reliable before, assuming the query only ever chose index scans or merge joins, it likely got broken by introduction of parallel versions of those plans. If the users disable parallel queries (to get the old plans), they're going to get non-parallel semi-reliable plans again. OTOH we are currently accumulating all the arrays in the same order, so the results will match. (At least that's what I believe is the current reliable behavior.) So following the GCC analogy, changing this seems much more akin to GCC people changing some longstanding behavior. Also, if the arrays were not accumulated in the same way reliably, wouldn't this mean all the discussion about reliable ordering with certain plans is pointless? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services