I have been pondering how to deal with work_mem for a couple of months myself 
and had very similar thoughts.
As far as I can tell the problem goes beyond work_mem though: 1. There are 
several hash operations such as set-ops, hashed subplans, and hash aggregates 
which today are not spilling at all. We have solved them partially so far and, 
once complete, think the fixes can be pushed into community PG if there is 
desire for it 2. We also worry about large objects which can bloat a backend 3. 
Others random allocations I fear I just don’t know about. 4. OS are chronically 
poor in trading memory between processes even after the memory is freed unless 
it’s returned to the OS in big contiguous chunks.
Just as you have, we have also considered holistic provisioning of work_mem 
across all consumers, but we find that to be too complex. Having an “emergency 
fund” in shared memory is also an option, but I find it too limiting. Also this 
approach what was done at DB2 when I was there and it proved cumbersome.
So I’m currently pressing forward with a much more fundamental approach: 
Pushing Top Transaction Context and its children into shared memory. To avoid 
fragmentation and serialization on latches I have defined the concept of “a 
context cluster”. The root of the cluster is the sole true allocator of memory. 
Child contexts allocate blocks as pallocs from the cluster root. Basically 
memory management goes recursive and children live within the root. The root 
(TopTransactionContext) allocates big blocks. e.g. 8MB at a time. Within a 
transaction PG operates as usual with freelists and all turning over these same 
8MB or allocating more if needed. But at the end of every transaction big 
chunks of memory become available to share with other transactions again. A few 
places where we reparent contexts need to detect that this can’t be done 
between or in/out of clusters and do deep copies if needed, but there are few 
of those. Come to think of it all the cases I encountered so far were SFDC 
specific…
I’m also moving the e-state from the Portal Heap to the Top Transaction 
Context. At the end of the day the assumption is that most transactions only 
need one block from shared memory, and I can probably pin it to the backend, 
further reducing contention. If there is an Out Of Memory situation - should be 
very rare - there are multiple ways to deal with it. If there is no dead-lock 
we can simply wait. If there is one rolling back the transaction that 
encountered the OOM is the obvious - if not optimal solution. Finding the 
biggest consumer and sending it a signal to back of would be another way to do 
it.
My goal is to run a backend with 50-100MB with all local caches controlled for 
size. Transaction Memory with e-states included should sized for 8MB/backend 
plus a fixed “spill” of some GB.
Yes, this is invasive and I’m sure to debug this for a while given my limited 
knowledge of the engine. I may yet fail spectacularly. On the other hand it’s 
conceptually pretty straight forward.
Cheers Serge Rielau SFDC

Reply via email to