On 14/06/2024 14:37, Heikki Linnakangas wrote:
I was performing tests around multixid wraparound, when I ran into this
assertion:

TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: 
"../src/backend/utils/mmgr/mcxt.c", Line: 1353, PID: 920981
postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e]
postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d]
postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e]
postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb]
postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a]
postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1]
postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b]
postgres: autovacuum worker 
template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3]
postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66]
postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d]
postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead]
postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e]
postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb]
postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e]
/lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45]
postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31]
2024-06-14 13:11:02.025 EEST [920971] LOG:  server process (PID 920981) was 
terminated by signal 6: Aborted
2024-06-14 13:11:02.025 EEST [920971] DETAIL:  Failed process was running: 
autovacuum: VACUUM pg_toast.pg_toast_13407 (to prevent wraparound)

The attached python script reproduces this pretty reliably. It's a
reduced version of a larger test script I was working on, it probably
could be simplified further for this particular issue.

Looking at the code, it's pretty clear how it happens:

1. TruncateMultiXact does START_CRIT_SECTION();

2. In the critical section, it calls PerformMembersTruncation() ->
SlruDeleteSegment() -> SlruInternalDeleteSegment() ->
RegisterSyncRequest() -> ForwardSyncRequest()

3. If the fsync request queue is full, it calls
CompactCheckpointerRequestQueue(), which calls palloc0. Pallocs are not
allowed in a critical section.

A straightforward fix is to add a check to
CompactCheckpointerRequestQueue() to bail out without compacting, if
it's called in a critical section. That would cover any other cases like
this, where RegisterSyncRequest() is called in a critical section. I
haven't tried searching if any more cases like this exist.

But wait there is more!

After applying that fix in CompactCheckpointerRequestQueue(), the test
script often gets stuck. There's a deadlock between the checkpointer,
and the autovacuum backend trimming the SLRUs:

1. TruncateMultiXact does this:

        MyProc->delayChkptFlags |= DELAY_CHKPT_START;

2. It then makes that call to PerformMembersTruncation() and
RegisterSyncRequest(). If it cannot queue the request, it sleeps a
little and retries. But the checkpointer is stuck waiting for the
autovacuum backend, because of delayChkptFlags, and will never clear the
queue.

To fix, I propose to add AbsorbSyncRequests() calls to the wait-loops in
CreateCheckPoint().


Attached patch fixes both of those issues.

Committed and backpatched down to v14. This particular scenario cannot happen in older versions because the RegisterFsync() on SLRU truncation was added in v14. In principle, I think older versions might have similar issues, but given that when assertions are disabled this is only a problem if you happen to run out of memory in the critical section, it doesn't seem worth backpatching further unless someone reports a concrete case.

--
Heikki Linnakangas
Neon (https://neon.tech)



Reply via email to