Re: Parallel Aggregates for string_agg and array_agg

Tels Thu, 05 Apr 2018 12:11:17 -0700

Moin,

On Wed, April 4, 2018 11:41 pm, David Rowley wrote:
> Hi Tomas,
>
> Thanks for taking another look.
>
> On 5 April 2018 at 07:12, Tomas Vondra <tomas.von...@2ndquadrant.com>
> wrote:
>> Other than that, the patch seems fine to me, and it's already marked as
>> RFC so I'll leave it at that.
>
> Thanks.


I have one more comment - sorry for not writing sooner, the flu got to me ...

Somewhere in the code there is a new allocation of memory when the string
grows beyond the current size - and that doubles the size. This can lead
to a lot of wasted space (think: constructing a string that is a bit over
1 Gbyte, which would presumable allocate 2 GByte).

The same issue happens when each worker allocated 512 MByte for a 256
Mbyte + 1 result.

IMHO a factor of like 1.4 or 1.2 would work better here - not sure what
the current standard in situations like this in PG is.

>> The last obstacle seems to be the argument about the risks of the patch
>> breaking queries of people relying on the ordering. Not sure what's are
>> the right next steps in this regard ...
>
> yeah, seems like a bit of a stalemate.
>
> Personally, think if the group of people Tom mentions do exist, then
> they've already been through some troubled times since Parallel Query
> was born. I'd hope they've already put up their defenses due to the
> advent of that feature. I stand by my thoughts that it's pretty
> bizarre to draw the line here when we've probably been causing these
> people issues for many years already. I said my piece on this already
> so likely not much point in going on about it. These people are also
> perfectly capable of sidestepping the whole issue with setting
> max_parallel_workers_per_gather TO 0.

Coming from the Perl side, this issue has popped up there a lot with the
ordering of hash keys (e.g. "for my $key (keys %$hash) { ... }") and even
tho there was never a guarantee, it often caught people by surprise.
Mostly in testsuites, tho, because real output often needs some ordering,
anyway.

However, changes where nevertheless done, and code had to be adapted. But
the world didn't end, so to speak.

For PG, I think the benefits greatly outweight the problems with the
sorting order - after all, if you have code that relies on an implicit
order, you've already got a problem, no?

So my favourite would be something along these lines:

* add the string_agg
* document it in the release notes, and document workaround/solutions (add
ORDER BY, disabled workers etc.)
* if not already done, stress in the documentation that if you don't ORDER
things, things might not come out in the order you expect - especially
with paralell queries, new processing nodes etc.

Best wishes,

Tels

PS: We use string_agg() in a case where we first agg each row, then
string_agg() all rows, and the resulting string is really huge. We did run
into the "out of memory"-problem, so we now use a LIMIT and assembly the
resulting parts on the client side. My wish here would be to better know
how large the LIMIT can be, I found it quite difficult to predict with how
many rows PG runs out of memory for the string buffer, even tho all rows
have almost the same length as text. But that aside, getting the parts
faster with parallel agg would be very cool, too.

Re: Parallel Aggregates for string_agg and array_agg

Reply via email to