On Wed, Nov 15, 2017 at 1:06 PM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > In the old days, Oracle had only simple per-operation memory limits > too, and that applied to every operation in every thread just like our > work_mem. It's interesting that they had separate knobs for sort and > hash though, and defaulted to giving hash twice as much.
It makes plenty of sense to give hash twice as much IMV. I'm interested in the *economics* of how we use memory -- I think we could do a lot better there. Memory capacities have certainly increased dramatically over the years since sort_mem became work_mem, but I suspect that users are (quite reasonably) still giving most of that over to shared_buffers/FS cache. And, storage/network bandwidth also increased dramatically during that time, so single-pass external sorts will continue to be a sensible thing to do frequently. Hashing is a different story, though -- you really do want to make sure that hash-based operations have access to more memory, where it can really go to use. Though I am primarily concerned about the fact that we don't give any weight to how sensitive hash-based operations are to having less memory, I don't really want to talk about band-aid measures like a hash_mem GUC (though that does have a certain appeal). I want to talk about moving past work_mem, and towards a model where work_mem-like memory is treated like a shared resource, under a regime that intelligently weighs the effects of making more memory available to one plan based on system-wide accounting, and the sensitivity of each memory-consuming operation to the availability of memory. This thread is intended to get the ball rolling on that. It seems like we need something like this to get the full benefit of our numerous enhancements to sorting and hashing. > With a whole-plan memory target, our planner would probably begin to > plan join order differently to minimise the number of hash tables in > memory at once, like other RDBMSs. Not sure how the plan-wide target > should work though -- try many plans, giving different portions of > budget to different subplans? That should work fine if you like > O(please-melt-my-computer), especially if combined with a similar > approach to choosing worker numbers. Some kind of feedback system? > Seems like a different kind of planner, but I have no clue. If you > have ideas/papers/references, it'd be great to see a new thread on > that subject. My first thought is that we might implement a model where little changes about work_mem itself -- it becomes a minimum for each work_mem consuming operation. There could be an additional "emergency memory fund" that individual plan nodes can avail of during execution, if and when it looks like they'll fall underneath an allocation that allows the work_mem-consuming operation to perform "optimally" or "reasonably". This would happen at certain "penny-wise, pound foolish" points. There'd be big differences in how we do this for sorts as compared to hash joins, because the underlying cost function for each look totally different. There'd probably be a maximum amount of memory that each executor node could request from the emergency fund, such as a smallish multiple of work_mem (work_mem being the minimum budget it can *reliably* claim). A single hash join asking for the entire emergency fund for itself all at once seems excessive, and likely to create problems in other areas, so that should be impossible. And, it should be "hard" for a node to ask for and/or receive the absolute maximum a node can get, because we want to keep that for cases that would otherwise truly be much slower. All in all, this approach shouldn't be too radical a departure from the work_mem model. I admit that there are significant risks with this approach as a project. It seems like there is a chance that it won't be ambitious enough in the end, because there are so many competing considerations. At the same time, I cannot help but be concerned about how naive we are about how sorting and hashing respond to work_mem. Anyway, here is how I see the extra/emergency requests working for specific operations: * For sorts, a non-optimal sort (a sort that asks for more memory) would ask for memory when it first looked like multiple passes will be required. As I pointed out in my Sort vs. Hash talk, that's actually pretty rare these days, because as work_mem doubles, the capacity to do everything in one pass quadruples. You should really never get multiple passes for an external sort -- the economics of doing it that way with any frequency are not likely to be sensible on modern machines. * For hash join, the "emergency request" logic could be much more sophisticated, and would be much more likely to be used in practice. I think that we'd probably want to worry about the difference between performing a hash join entirely in-memory versus having to do some amount of batching (unlike sorting). This would generally be more "eager" than the requests that tuplesort makes, because smaller differences matter much more, much sooner. * For hash aggregates, we might have "overage requests" from the emergency fund, or something along those lines. The emergency fund might therefore be in deficit (negative) when hash aggregates misbehave, since it cannot "say no" to these requests (hash aggregate will not currently take no for an answer, since it has no emergency spill mechanism). This could limit the overall impact of that happening, and might also provide a useful choke point for new alerting and monitoring systems that can hook into the "central memory management" logic. Hash aggregates might go over work_mem without it really mattering much of the time. ISTM that there is a similar dilemma to this "sort versus hash" dilemma for maintenance_work_mem tasks: the "CREATE INDEX versus VACUUM" dilemma. We should try to address that as part of this effort. (This dilemma is one reason why I wrote the autovacuum_work_mem patch -- that's not a million miles from the idea of a hash_mem GUC.) To come up with a real proposal for treating local memory as a shared resource, I think that we need: * To hear more ideas about keeping things in balance here. How clever should we be? * To experimentally verify that the cost functions (latency as a function of memory) for things like sorting, merge join, hash join, and hash aggregate are what we think they are. * To understand how this relates to admission control. The only obvious difference that I can think of is that admission control probably involves queuing when very memory constrained, and isn't limited to caring about memory. I'm not trying to make swapping/OOM impossible here; I'm trying to make it easier to be a Postgres DBA sizing work_mem, and make it so that DBAs don't have to be stingy with work_mem. The work_mem sizing formulas we sometimes promote (based on max_connections) are probably very limiting in the real world. * To better understand the role of the optimizer here. Can we get more hash aggregate plans in the first place without risking swapping/OOM? Is it okay that what I propose kind of works against that goal? * To better understand the role of parallel query here. * To try to find a way to assess what "better" here looks like, overall. What benchmark might we devise to assess what a good, robust strategy looks like? I freely admit that my proposal is pretty hand-wavy at this point, but, as I said, I want to at least get the ball rolling. -- Peter Geoghegan