On Sat, Apr 26, 2025 at 5:07 PM Tomas Vondra <to...@vondra.me> wrote: > > Just for the record, it sounds quite unsafe to me too. I could > > credit it being all right to examine the process' MemoryContext data > > structures, but calling dsa_create() from CFI seems really insane. > > Way too many moving parts there. > > Maybe I'm oblivious to some very obvious issues, but why is this so > different from everything else that is already called from > ProcessInterrupts()? Perhaps dsa_create() is more complex compared to > the other stuff (haven't checked), but why would it be unsafe? > > The one risk I can think of is a risk of recursion - if any of the > functions called from ProcessGetMemoryContextInterrupt() does CFI, could > that be a problem if there's another pending signal. I see some other > handlers (e.g. ProcessParallelApplyMessages) handle this by explicitly > holding interrupts. Should ProcessGetMemoryContextInterrupt() do the > same thing? > > In any case, if DSA happens to not be the right way to transfer this, > what should we use instead? The only thing I can think of is some sort > of pre-allocated chunk of shared memory.
The big disadvantage of a pre-allocated chunk of shared memory is that it would necessarily be fixed size, and a memory context dump can be really big. Another alternative would be a temporary file. But none of that settles the question of safety. I think part of the reason why it's possible for us to have disagreements about whether this is safe is that we don't have any clear documentation of what you can assume to be true at a CHECK_FOR_INTERRUPTS(). It's not possible for there to be an LWLock held at that point, because we hold off interrupts when you acquire an LWLock and don't re-enable them until all LWLocks have been released. We can't be holding a spinlock, either, because we only insert CHECK_FOR_INTERRUPTS() at the top of loops and code that holds a spinlock is supposed to be straight-line code that never loops. But what else can you assume? Can you assume, for example, that there's a transaction? I doubt it. Can you assume that the transaction is non-aborted? I doubt that even more. There's no obvious-to-me reason why those things should be true. And in fact if you try this on a backend waiting in an aborted transaction, it breaks: robert.haas=# select pg_backend_pid(); pg_backend_pid ---------------- 19321 (1 row) robert.haas=# begin; BEGIN robert.haas=*# select 1/0; ERROR: division by zero And then from another session, run this command, using the PID from above: select * from pg_get_process_memory_contexts(19321, false, 1); Then you get: 2025-04-30 15:14:33.878 EDT [19321] ERROR: ResourceOwnerEnlarge called after release started 2025-04-30 15:14:33.878 EDT [19321] FATAL: terminating connection because protocol synchronization was lost I kind of doubt whether that's the only problem here, but it's *a* problem. -- Robert Haas EDB: http://www.enterprisedb.com