On Thu, Jun 25, 2020 at 01:17:56PM -0400, Bruce Momjian wrote:
On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote:
On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian <br...@momjian.us> wrote:
> I think my main point is that work_mem was not being honored for
> hash-agg before, but now that PG 13 can do it, we are again allowing
> work_mem not to apply in certain cases. I am wondering if our hard
> limit for work_mem is the issue, and we should make that more flexible
> for all uses.
I mean, that's pretty much what we're talking about here, isn't it? It
seems like in your previous two replies you were opposed to separating
the plan-type limit from the execution-time limit, but that idea is
precisely a way of being more flexible (and extending it to other plan
nodes is a way of making it more flexible for more use cases).
I think it is was Tom who was complaining about plan vs. execution time
control.
As I think you know, if you have a system where the workload varies a
lot, you may sometimes be using 0 copies of work_mem and at other
times 1000 or more copies, so the value has to be chosen
conservatively as a percentage of system memory, else you start
swapping or the OOM killer gets involved. On the other hand, some plan
nodes get a lot less efficient when the amount of memory available
falls below some threshold, so you can't just set this to a tiny value
and forget about it. Because the first problem is so bad, most people
set the value relatively conservatively and just live with the
performance consequences. But this also means that they have memory
left over most of the time, so the idea of letting a node burst above
its work_mem allocation when something unexpected happens isn't crazy:
as long as only a few nodes do that here and there, rather than, say,
all the nodes doing it all at the same time, it's actually fine. If we
had a smarter system that could dole out more work_mem to nodes that
would really benefit from it and less to nodes where it isn't likely
to make much difference, that would be similar in spirit but even
better.
I think the issue is that in PG 13 work_mem controls sorts and hashes
with a new hard limit for hash aggregation:
https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash
tables are used in hash joins, hash-based aggregation, and hash-based
processing of IN subqueries.
In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if
was expected it to exceed work_mem, but if we assumed it would be less
than work_mem and it was more, we exceeded work_mem allocation for that
node. In PG 13, we "limit" memory to work_mem and spill to disk if we
exceed it.
We should really have always documented that hash agg could exceed
work_mem for misestimation, and if we add a hash_agg work_mem
misestimation bypass setting we should document this setting in work_mem
as well.
I don't think that would change anything, really. For the users the
consequences would be still exactly the same, and they wouldn't even be
in position to check if they are affected.
So just documenting that hashagg does not respect work_mem at runtime
would be nice, but it would not make any difference for v13, just like
documenting a bug is not really the same thing as fixing it.
But then the question is why do we allow this bypass only for hash agg?
Should work_mem have a settings for ORDER BY, merge join, hash join, and
hash agg, e.g.:
work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB'
Yeah, crazy syntax, but you get the idea. I understand some nodes are
more sensitive to disk spill than others, so shouldn't we be controlling
this at the work_mem level, rather than for a specific node type like
hash agg? We could allow for misestimation over allocation of hash agg
work_mem by splitting up the hash agg values:
work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB
hash_agg_max=200MB'
but _avoiding_ hash agg if it is estimated to exceed work mem and spill
to disk is not something to logically control at the work mem level,
which leads so something like David Rowley suggested, but with different
names:
enable_hashagg = on | soft | avoid | off
where 'on' and 'off' are the current PG 13 behavior, 'soft' means to
treat work_mem as a soft limit and allow it to exceed work mem for
misestimation, and 'avoid' means to avoid hash agg if it is estimated to
exceed work mem. Both 'soft' and 'avoid' don't spill to disk.
David's original terms of "trynospill" and "neverspill" were focused on
spilling, not on its interaction with work_mem, and I found that
confusing.
Frankly, if it took me this long to get my head around this, I am
unclear how many people will understand this tuning feature enough to
actually use it.
Yeah. I agree with Andres we this may be a real issue, and that adding
some sort of "escape hatch" for v13 would be good. But I'm not convinced
adding a whole lot of new memory limits for every node that might spill
is the way to go. What exactly would be our tuning advice to users? Of
course, we could keep it set to work_mem by default, but we all know
engineers - we can't resist tuning a know when we get one.
I'm not saying it's not beneficial to use different limits for different
nodes. Some nodes are less sensitive to the size (e.g. sorting often
gets faster with smaller work_mem). But I think we should instead have a
per-session limit, and the planner should "distribute" the memory to
different nodes. It's a hard problem, of course.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services