On Fri, Jun 14, 2024 at 3:46 PM Alexander Korotkov <aekorot...@gmail.com> wrote: > On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan > <i.kartys...@postgrespro.ru> wrote: > > > > Hi, Alexander, Here, I made some improvements according to your > > discussion with Heikki. > > > > On 2024-04-11 18:09, Alexander Korotkov wrote: > > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinn...@iki.fi> > > > wrote: > > >> In a nutshell, it's possible for the loop in WaitForLSN to exit > > >> without > > >> cleaning up the process from the heap. I was able to hit that by > > >> adding > > >> a delay after the addLSNWaiter() call: > > >> > > >> > TRAP: failed Assert("!procInfo->inHeap"), File: > > >> > "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152 > > >> > postgres: heikki postgres [local] > > >> > CALL(ExceptionalCondition+0xab)[0x55da1f68787b] > > >> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8] > > >> > postgres: heikki postgres [local] > > >> > CALL(WaitForLSN+0x139)[0x55da1f2052cc] > > >> > postgres: heikki postgres [local] > > >> > CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5] > > >> > postgres: heikki postgres [local] > > >> > CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a] > > >> > postgres: heikki postgres [local] > > >> > CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9] > > >> > > >> I think there's a similar race condition if the timeout is reached at > > >> the same time that the startup process wakes up the process. > > > > > > Thank you for catching this. I think WaitForLSN() just needs to call > > > deleteLSNWaiter() unconditionally after exit from the loop. > > > > Fix and add injection point test on this race condition. > > > > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinn...@iki.fi> > > > wrote: > > >> The docs could use some-copy-editing, but just to point out one issue: > > >> > > >> > There are also procedures to control the progress of recovery. > > >> > > >> That's copy-pasted from an earlier sentence at the table that lists > > >> functions like pg_promote(), pg_wal_replay_pause(), and > > >> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control > > >> the > > >> progress of recovery like those functions do, it only causes the > > >> calling > > >> backend to wait. > > > > Fix documentation and add extra tests on multi-standby replication > > and cascade replication. > > Thank you for the revised patch. > > I see couple of items which are not addressed in this revision. > * As Heikki pointed, that it's currently not possible in one round > trip to call call pg_wal_replay_wait() and do other job. The attached > patch addresses this. It milds the requirement. Now, it's not > necessary to be in atomic context. It's only necessary to have no > active snapshot. This new requirement works for me so far. I > appreciate a feedback on this. > * As Alvaro pointed, multiple waiters case isn't covered by the test > suite. That leads to no coverage of some code paths. The attached > patch addresses this by adding a test case with multiple waiters. > > The rest looks good to me.
Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl. 1. It's not clear why this test needs node_standby2 at all. It seems useless. 2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This location seems to be unachievable in this test. So, it's not clear what race condition this test could potentially detect. 3. I think it would make sense to check for the race condition reported by Heikki. That is to insert the injection point at the beginning of WaitLSNSetLatches(). Links. 1. https://www.postgresql.org/message-id/flat/CAPpHfdvGRssjqwX1%2Bidm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw%40mail.gmail.com#557900e860457a9e24256c93a2ad4920 ------ Regards, Alexander Korotkov Supabase