Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

Peter Geoghegan Fri, 17 Nov 2017 18:24:07 -0800

On Fri, Nov 17, 2017 at 12:48 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> I'd like to hear some opinions on the feasibility of this approach.
>
> There is indeed a big planning problem here, but Robert's sketch is
> missing an important component of it: work_mem is not an output of cost
> estimates, it is an *input*.  For example, we can sort or hash-join in
> however much memory you want, but it's going to cost different amounts.
>
> I think what we're actually looking for is to find the breakpoints in
> the cost curve where it thinks it can switch to a different sorting
> or hashing model, and then to emit paths that assume work_mem just
> above each of those breakpoints.  But the code isn't set up like that
> now, not as to either planning or execution.


It might not be that hard to come up with a model for determining
which points on the curve are of interest. It seems easy to do this
for sorting, because it's actually a very simple curve. Once you're in
single pass territory, and provided you're still using at least a few
tens of megabytes of work_mem, the availability of work_mem seems to
only make a very small difference (temp file I/O is still essentially
sequential, and the logarithmic growth in comparisons as more runs
must be merged doesn't really bite you). Perhaps this also means that
you can expect to get away with moderately bad estimates there.

> Another problem with formulating it that way is that it suddenly puts
> a much higher premium on the planner's space estimates being right,
> which is something I don't have much faith in.  For instance, if the
> planner thinks that 1000kB is just enough to hold a hash table, and
> then when we run it we find out that we need a bit more space than that,
> do we really want the executor to switch to a batched join?  Probably not,
> especially not if having set the node's work_mem to 1010kB instead
> would've let it run to completion without batching.  Addressing that
> discrepancy might be where we need the dynamic "emergency memory request"
> mechanism that Peter was postulating.  But I'm not sure exactly how that
> works, because at the point where the executor realizes it's about to
> exceed the original space budget, it generally has little idea how much
> more it would need in order to avoid spilling the sort to disk or
> adding another round of batching.

I don't know either.

I think that it's reasonable for us to make it a goal of the executor
to have operations that have a smooth cost function, in order to
manage the risk of misestimation well, and to make it a goal to have
operations that are otherwise adaptive to misestimation. To a large
degree, my abandoned "quicksort with spillover" design from a couple
of years ago was written with this in mind (it avoided a sudden
discontinuity in the cost function of sort nodes, at the point where
you must spill for the first time). Another example of an adaptive
operation is "role reversal" for hash joins, where the executor flips
the inner and outer side during execution, at a point where it becomes
clear that the optimizer had it backwards, estimation-wise. There are
probably numerous other things like this that are possible...and maybe
even worthwhile.

In summary, I agree that we're going to have big problems if the
planner needs to have very accurate estimates to see a real benefit.
It seems possible that most of the benefit of "fixing work_mem" comes
from avoiding using a woefully inadequate amount of memory where
batching was clearly always going to be necessary. There may be
limited benefit to preventing batching in the first place. So while
there could also be room for an "emergency memory request" mechanism,
it's more of a nice-to-have.

> So it's really unclear to me what either the planner or executor API
> contracts for memory consumption ought to be if we're going to try to do
> this differently.  I agree there's a lot of potential for improvement if
> we can find a better solution, but we're going to need to put some serious
> thought into it.

The devil is in the details, of course. Vladimir said something about
customer issues with sizing work_mem on Google's cloud database
service, and it reminded me of my experiences with this while working
at Heroku. I tended to hear few complaints about it, but then there'd
sometimes be serious customer issues quite suddenly.

My theory is that there can be a turning point where demands on
work_mem increase, and there are suddenly more group aggregates than
hash aggregates (to a lesser extent, there may be fewer hash joins).
Now the database is using group aggregates that are quite a bit slower
than hash aggregates, while still using approximately the same amount
of memory as before. This creates significantly more pressure quite
suddenly, because the group aggregates are quite a bit slower, and it
takes that much longer for the memory to be released.

I'm mostly concerned about avoiding instability like this. Users
greatly value predictable performance.

-- 
Peter Geoghegan

Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

Reply via email to