I have been pondering how to deal with work_mem for a couple of months myself
and had very similar thoughts.
As far as I can tell the problem goes beyond work_mem though: 1. There are
several hash operations such as set-ops, hashed subplans, and hash aggregates
which today are not spilling at all. We have solved them partially so far and,
once complete, think the fixes can be pushed into community PG if there is
desire for it 2. We also worry about large objects which can bloat a backend 3.
Others random allocations I fear I just don’t know about. 4. OS are chronically
poor in trading memory between processes even after the memory is freed unless
it’s returned to the OS in big contiguous chunks.
Just as you have, we have also considered holistic provisioning of work_mem
across all consumers, but we find that to be too complex. Having an “emergency
fund” in shared memory is also an option, but I find it too limiting. Also this
approach what was done at DB2 when I was there and it proved cumbersome.
So I’m currently pressing forward with a much more fundamental approach:
Pushing Top Transaction Context and its children into shared memory. To avoid
fragmentation and serialization on latches I have defined the concept of “a
context cluster”. The root of the cluster is the sole true allocator of memory.
Child contexts allocate blocks as pallocs from the cluster root. Basically
memory management goes recursive and children live within the root. The root
(TopTransactionContext) allocates big blocks. e.g. 8MB at a time. Within a
transaction PG operates as usual with freelists and all turning over these same
8MB or allocating more if needed. But at the end of every transaction big
chunks of memory become available to share with other transactions again. A few
places where we reparent contexts need to detect that this can’t be done
between or in/out of clusters and do deep copies if needed, but there are few
of those. Come to think of it all the cases I encountered so far were SFDC
specific…
I’m also moving the e-state from the Portal Heap to the Top Transaction
Context. At the end of the day the assumption is that most transactions only
need one block from shared memory, and I can probably pin it to the backend,
further reducing contention. If there is an Out Of Memory situation - should be
very rare - there are multiple ways to deal with it. If there is no dead-lock
we can simply wait. If there is one rolling back the transaction that
encountered the OOM is the obvious - if not optimal solution. Finding the
biggest consumer and sending it a signal to back of would be another way to do
it.
My goal is to run a backend with 50-100MB with all local caches controlled for
size. Transaction Memory with e-states included should sized for 8MB/backend
plus a fixed “spill” of some GB.
Yes, this is invasive and I’m sure to debug this for a while given my limited
knowledge of the engine. I may yet fail spectacularly. On the other hand it’s
conceptually pretty straight forward.
Cheers Serge Rielau SFDC