Bonjour Thomas, On Wed, Jan 15, 2025 at 6:21 AM Thomas Munro <thomas.mu...@gmail.com> wrote: > Right. Before commit 0da096d7 in v17, the recovery conflict code > running in a signal handler would have set ProcDiePending, so this > looks like an unintended regression due to that commit.
The issue is happening on instances with 14.13. Though I haven't tried to reproduce it yet on a version older than the current HEAD yet. > In your patch you have CHECK_FOR_INTERRUPTS(), but I think that means > it could also service other interrupts that we probably don't want to > introduce without more study, no? For example if > ProcessRecoveryConflictInterrupts() returns without throwing, and > another pending interrupt flag is also set. > > Thinking of other proc signals/interrupts, of course it's not really > acceptable that we don't process at least also global barriers here, > ie for a time controllable by a client, but let's call that an > independent problem. I think the minimal thing for this specific > problem might be to use ProcessRecoveryConflictInterrupts() directly. I think that's definitely happening, my first version used a pg_unreachable after the CHECK_FOR_INTERRUPTS() which was triggered. I've replaced it by a direct call to ProcessRecoveryConflictInterrupts(). > The read from the socket also seems to have a variant of the problem, > unless I'm missing something, except in that case I'm not sure it's > new. For sending you're proposing that our only choice is to kill the > session, which makes sense, but the equivalent read side stay-in-sync > strategy is to keep deferring until the client gets around to sending > a complete message, if I read that right. I'm not sure the read path has the same issue. That would require a replica to do a non command read but I don't think it's possible? A recovering instance can't run COPY, and is there another situation where a replica would do a blocking read without the DoingCommandRead flag? Also, the read side is also killing the session since DoingCommandRead will be true most of the time, falling through to the FATAL case. > Hmm, I wonder if we could write tests for both directions in > 031_recovery_conflict.pl, using a Perl to do raw pq protocol in TCP, > as seen in some other recent proposals... I might have a crack at > that soon unless you want to. I've tried adding a test for this. I've implemented some utility functions to start a session, read a payload and send a simple query using raw TCP. The implementation is definitely a bit rough as I'm far from being fluent in perl. The socket utility functions should probably be moved to a dedicated file and the Windows build is currently failing. However, the test does trigger the issue with the recovery blocked by the session blocked in ClientWrite.
v02-0001-Accept-recovery-conflict-interrupt-on-blocked-wr.patch
Description: Binary data