On Fri, Nov 17, 2017 at 12:48 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> I'd like to hear some opinions on the feasibility of this approach. > > There is indeed a big planning problem here, but Robert's sketch is > missing an important component of it: work_mem is not an output of cost > estimates, it is an *input*. For example, we can sort or hash-join in > however much memory you want, but it's going to cost different amounts. > > I think what we're actually looking for is to find the breakpoints in > the cost curve where it thinks it can switch to a different sorting > or hashing model, and then to emit paths that assume work_mem just > above each of those breakpoints. But the code isn't set up like that > now, not as to either planning or execution.
It might not be that hard to come up with a model for determining which points on the curve are of interest. It seems easy to do this for sorting, because it's actually a very simple curve. Once you're in single pass territory, and provided you're still using at least a few tens of megabytes of work_mem, the availability of work_mem seems to only make a very small difference (temp file I/O is still essentially sequential, and the logarithmic growth in comparisons as more runs must be merged doesn't really bite you). Perhaps this also means that you can expect to get away with moderately bad estimates there. > Another problem with formulating it that way is that it suddenly puts > a much higher premium on the planner's space estimates being right, > which is something I don't have much faith in. For instance, if the > planner thinks that 1000kB is just enough to hold a hash table, and > then when we run it we find out that we need a bit more space than that, > do we really want the executor to switch to a batched join? Probably not, > especially not if having set the node's work_mem to 1010kB instead > would've let it run to completion without batching. Addressing that > discrepancy might be where we need the dynamic "emergency memory request" > mechanism that Peter was postulating. But I'm not sure exactly how that > works, because at the point where the executor realizes it's about to > exceed the original space budget, it generally has little idea how much > more it would need in order to avoid spilling the sort to disk or > adding another round of batching. I don't know either. I think that it's reasonable for us to make it a goal of the executor to have operations that have a smooth cost function, in order to manage the risk of misestimation well, and to make it a goal to have operations that are otherwise adaptive to misestimation. To a large degree, my abandoned "quicksort with spillover" design from a couple of years ago was written with this in mind (it avoided a sudden discontinuity in the cost function of sort nodes, at the point where you must spill for the first time). Another example of an adaptive operation is "role reversal" for hash joins, where the executor flips the inner and outer side during execution, at a point where it becomes clear that the optimizer had it backwards, estimation-wise. There are probably numerous other things like this that are possible...and maybe even worthwhile. In summary, I agree that we're going to have big problems if the planner needs to have very accurate estimates to see a real benefit. It seems possible that most of the benefit of "fixing work_mem" comes from avoiding using a woefully inadequate amount of memory where batching was clearly always going to be necessary. There may be limited benefit to preventing batching in the first place. So while there could also be room for an "emergency memory request" mechanism, it's more of a nice-to-have. > So it's really unclear to me what either the planner or executor API > contracts for memory consumption ought to be if we're going to try to do > this differently. I agree there's a lot of potential for improvement if > we can find a better solution, but we're going to need to put some serious > thought into it. The devil is in the details, of course. Vladimir said something about customer issues with sizing work_mem on Google's cloud database service, and it reminded me of my experiences with this while working at Heroku. I tended to hear few complaints about it, but then there'd sometimes be serious customer issues quite suddenly. My theory is that there can be a turning point where demands on work_mem increase, and there are suddenly more group aggregates than hash aggregates (to a lesser extent, there may be fewer hash joins). Now the database is using group aggregates that are quite a bit slower than hash aggregates, while still using approximately the same amount of memory as before. This creates significantly more pressure quite suddenly, because the group aggregates are quite a bit slower, and it takes that much longer for the memory to be released. I'm mostly concerned about avoiding instability like this. Users greatly value predictable performance. -- Peter Geoghegan