On Sun, Jun 28, 2020 at 06:39:40PM -0700, Peter Geoghegan wrote:
On Sat, Jun 27, 2020 at 3:41 AM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
Well, there are multiple ideas discussed in this (sub)thread, one of
them being a per-query memory limit. That requires decisions how much
memory should different nodes get, which I think would need to be
cost-based.
A design like that probably makes sense. But it's way out of scope for
Postgres 13, and not something that should be discussed further on
this thread IMV.
100% agree
Of course, a simpler scheme like this would not require that. And maybe
introducing hash_mem is a good idea - I'm not particularly opposed to
that, actually. But I think we should not introduce separate memory
limits for each node type, which was also mentioned earlier.
I had imagined that hash_mem would apply to hash join and hash
aggregate only. A GUC that either represents a multiple of work_mem,
or an absolute work_mem-style KB value.
I'm not against having a hash_mem (and I'd vote to be it a simple
work_mem-style value, not a multiple). Maybe we should have it, the
argument to allow hashagg (and perhaps hashjoin) to use more memory than
some other nodes seems convincing to me.
I'm just not sure which of the problems mentioned in this thread it
actually addresses ...
The problem of course is that hash_mem does not really solve the issue
discussed at the beginning of this thread, i.e. regressions due to
underestimates and unexpected spilling at execution time.
Like Andres, I suspect that that's a smaller problem in practice. A
hash aggregate that spills often has performance characteristics
somewhat like a group aggregate + sort, anyway. I'm worried about
cases where an *in-memory* hash aggregate is naturally far far faster
than other strategies, and yet we can't use it -- despite the fact
that Postgres 12 could "safely" do so. (It probably doesn't matter
whether the slow plan that you get in Postgres 13 is a hash aggregate
that spills, or something else -- this is not really a costing
problem.)
Not sure I follow. Which cases do you mean when you say that 12 could
safely do them, but 13 won't? I see the following two cases:
a) Planner in 12 and 13 disagree about whether the hash table will fit
into work_mem.
I don't quite see why this would be the case (given the same cardinality
estimates etc.), though. That is, if 12 says "will fit" I'd expect 13 to
end up with the same conclusion. But maybe 13 has higher per-tuple
overhead or something? I know we set aside some memory for BufFiles, but
not when we expect the whole hash table to fit into memory.
b) Planner believes the hash table will fit, due to underestimate.
On 12, we'd just let the hash table overflow, which may be a win when
there's enough RAM and the estimate is not "too wrong". But it may
easily end with a sad OOM.
On 13, we'll just start spilling. True - people tend to use conservative
work_mem values exactly because of cases like this (which is somewhat
futile as the underestimate may be arbitrarily wrong) and also because
they don't know how many work_mem instances the query will use.
So yeah, I understand why people may not want to increase work_mem too
much, and maybe hash_work would be a way to get the "no spill" behavior.
Besides, hash_mem *can* solve that problem to some extent. Other cases
(cases where the estimates are so bad that hash_mem won't help) seem
like less of a concern to me. To some extent, that's the price you pay
to avoid the risk of an OOM.
True, avoiding the risk of OOM has it's cost.
The thread is getting a rather confusing mix of proposals how to fix
that for v13 and proposals how to improve our configuration of memory
limits :-(
As I said to Amit in my last message, I think that all of the ideas
that are worth pursuing involve giving hash aggregate nodes license to
use more memory than other nodes. One variant involves doing so only
at execution time, while the hash_mem idea involves formalizing and
documenting that hash-based nodes are special -- and taking that into
account during both planning and execution.
Understood. I mostly agree with this.
Interesting. What is not entirely clear to me how do these databases
decide how much should each node get during planning. With the separate
work_mem-like settings it's fairly obvious, but how do they do that with
the global limit (either per-instance or per-query)?
I don't know, but that seems like a much more advanced way of
approaching the problem. It isn't in scope here.
+1
Perhaps I'm not considering some unintended consequence of the planner
giving hash-based nodes extra memory "for free" in the common case
where hash_mem exceeds work_mem (by 2x, say). But my guess is that
that won't be a significant problem that biases the planner in some
obviously undesirable way.
My concern is how much more difficult would these proposals make the
reasoning about memory usage get. Maybe not much, not sure.
I certainly agree it may be beneficial to give more memory to hashagg at
the expense of other nodes.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services