Re: VM corruption on standby

2025-09-11 Thread Alexander Korotkov
On Thu, Sep 11, 2025 at 1:59 AM Thomas Munro wrote: > On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin wrote: > > > On 10 Sep 2025, at 15:25, Alexander Korotkov wrote: > > > I believe we need some > > > general solution. We might have a special kind of condition variable, > > > a critical secti

Re: VM corruption on standby

2025-09-10 Thread Michael Paquier
On Thu, Sep 11, 2025 at 10:59:19AM +1200, Thomas Munro wrote: > FWIW I'm working on a patch set that kills all backends without > releasing any locks when the postmaster exists. Then CVs and other > latch-based stuff should be safe in this context. Work was > interrupted by a vacation but I hope

Re: VM corruption on standby

2025-09-10 Thread Thomas Munro
On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin wrote: > > On 10 Sep 2025, at 15:25, Alexander Korotkov wrote: > > I believe we need some > > general solution. We might have a special kind of condition variable, > > a critical section condition variable, where both waiting and > > signaling mus

Re: VM corruption on standby

2025-09-10 Thread Andrey Borodin
> On 10 Sep 2025, at 15:25, Alexander Korotkov wrote: > > I think the approach #2 is more appropriate for bc22dc0e0d, because in > the critical section we only wait for other processes also in the > critical section (so, there is no risk they will exit immediately > after postmaster death maki

Re: VM corruption on standby

2025-09-10 Thread Alexander Korotkov
On Wed, Sep 3, 2025 at 11:28 AM Alexander Korotkov wrote: > > On Wed, Sep 3, 2025 at 9:47 AM Andrey Borodin wrote: > > > > > On 3 Sep 2025, at 11:37, Alexander Korotkov wrote: > > > > > > Could you, please, recheck? > > > > That patch also adds CondVar sleep in critical section. That patch is ho

Re: VM corruption on standby

2025-09-03 Thread Alexander Korotkov
On Wed, Sep 3, 2025 at 9:47 AM Andrey Borodin wrote: > > > On 3 Sep 2025, at 11:37, Alexander Korotkov wrote: > > > > Could you, please, recheck? > > That patch also adds CondVar sleep in critical section. That patch is how we > understood that such sleep is dangerous. > > Actual patch to deteac

Re: VM corruption on standby

2025-09-02 Thread Andrey Borodin
> On 3 Sep 2025, at 11:37, Alexander Korotkov wrote: > > Could you, please, recheck? That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous. Actual patch to deteact a problem is much simpler: ``` diff --git a/src/backend/storage/

Re: VM corruption on standby

2025-09-02 Thread Alexander Korotkov
On Tue, Aug 12, 2025 at 8:38 AM Kirill Reshke wrote: > On Wed, 6 Aug 2025 at 20:00, Andrey Borodin wrote: > > > > Hi hackers! > > > > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting. > > At Yandex we had several incidents with corrupted VM

Re: VM corruption on standby

2025-08-28 Thread Nathan Bossart
On Mon, Aug 25, 2025 at 10:07:26AM -0500, Nathan Bossart wrote: > Now that this is reverted, can the related open item be marked as resolved? Since there has been no further discussion, I will go ahead and resolve the open item. -- nathan

Re: VM corruption on standby

2025-08-25 Thread Nathan Bossart
[RMT hat] On Thu, Aug 21, 2025 at 06:42:48PM -0400, Tom Lane wrote: > Alexander Korotkov writes: >> On Tue, Aug 19, 2025 at 10:50 PM Tom Lane wrote: >>> Therefore, I vote for reverting bc22dc0e0. Hopefully only >>> temporarily, but it's too late to figure out another way for v18, >>> and I don'

Re: VM corruption on standby

2025-08-21 Thread Thomas Munro
On Fri, Aug 22, 2025 at 10:27 AM Alexander Korotkov wrote: > And let's retry it for v19. +1 I'm hoping we can fix PM death handling soon, and then I assume this can go straight back in without modification. CVs are an essential low level synchronisation component that really should work in lots

Re: VM corruption on standby

2025-08-21 Thread Alexander Korotkov
Hi, Tom! On Tue, Aug 19, 2025 at 10:50 PM Tom Lane wrote: > I'm inclined to think that we do want to prohibit WaitEventSetWait > inside a critical section --- it just seems like a bad idea all > around, even without considering this specific failure mode. > Therefore, I vote for reverting bc22dc0

Re: VM corruption on standby

2025-08-21 Thread Tom Lane
Alexander Korotkov writes: > On Tue, Aug 19, 2025 at 10:50 PM Tom Lane wrote: >> Therefore, I vote for reverting bc22dc0e0. Hopefully only >> temporarily, but it's too late to figure out another way for v18, >> and I don't think that bc22dc0e0 is such an essential improvement >> that we can't af

Re: VM corruption on standby

2025-08-21 Thread Michael Paquier
On Fri, Aug 22, 2025 at 01:27:17AM +0300, Alexander Korotkov wrote: > I'm OK about this. Do you mind if I revert bc22dc0e0 myself? > And let's retry it for v19. Yes, agreed that it may be the best thing to do for v18 based on the information we have gathered until now. -- Michael signature.asc

Re: VM corruption on standby

2025-08-20 Thread Thomas Munro
On Wed, Aug 20, 2025 at 3:47 PM Tom Lane wrote: > Having said that, we should in any case have a better story on > what WaitEventSetWait should do after detecting postmaster death. > So I'm all for trying to avoid the proc_exit path if we can > design a better answer. Yeah. I've posted a concept

Re: VM corruption on standby

2025-08-20 Thread Andrey Borodin
> On 20 Aug 2025, at 00:55, Tom Lane wrote: > > Andrey Borodin writes: >> I believe there is a bug with PageIsAllVisible(page) && >> visibilitymap_clear(). But I cannot prove it with an injection point test. >> Because injections points rely on CondVar, that per se creates corruption in >>

Re: VM corruption on standby

2025-08-20 Thread Michael Paquier
On Wed, Aug 20, 2025 at 09:14:04AM -0400, Andres Freund wrote: > On 2025-08-19 23:47:21 -0400, Tom Lane wrote: >> Hm. It still makes me mighty uncomfortable, because the point of a >> critical section is "crash the database if anything goes wrong during >> this bit". Waiting for another process -

Re: VM corruption on standby

2025-08-20 Thread Andres Freund
Hi, On 2025-08-19 23:47:21 -0400, Tom Lane wrote: > Thomas Munro writes: > > On Wed, Aug 20, 2025 at 7:50 AM Tom Lane wrote: > >> I'm inclined to think that we do want to prohibit WaitEventSetWait > >> inside a critical section --- it just seems like a bad idea all > >> around, even without cons

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Wed, 20 Aug 2025 at 00:50, Tom Lane wrote: > > Kirill Reshke writes: > > I revert this commit (these were conflicts but i resolved them) and > > added assert for crit sections in WaitEventSetWait. > > Your patch still contains some conflict markers :-(. Attached is > a corrected version, jus

Re: VM corruption on standby

2025-08-19 Thread Tom Lane
Thomas Munro writes: > On Wed, Aug 20, 2025 at 7:50 AM Tom Lane wrote: >> I'm inclined to think that we do want to prohibit WaitEventSetWait >> inside a critical section --- it just seems like a bad idea all >> around, even without considering this specific failure mode. > FWIW aio/README.md des

Re: VM corruption on standby

2025-08-19 Thread Thomas Munro
On Wed, Aug 20, 2025 at 11:59 AM Thomas Munro wrote: > they can't all be > blocked in sig_wait() unless there is already a deadlock. s/sig_wait()/sem_wait()/

Re: VM corruption on standby

2025-08-19 Thread Thomas Munro
On Wed, Aug 20, 2025 at 7:50 AM Tom Lane wrote: > I'm inclined to think that we do want to prohibit WaitEventSetWait > inside a critical section --- it just seems like a bad idea all > around, even without considering this specific failure mode. FWIW aio/README.md describes a case where we'd need

Re: VM corruption on standby

2025-08-19 Thread Michael Paquier
On Tue, Aug 19, 2025 at 03:55:41PM -0400, Tom Lane wrote: > Yeah, I was coming to similar conclusions in the reply I just sent: > we don't really want a policy that we can't put injection-point-based > delays inside critical sections. So that infrastructure is leaving > something to be desired. Y

Re: VM corruption on standby

2025-08-19 Thread Tom Lane
Andrey Borodin writes: > I believe there is a bug with PageIsAllVisible(page) && > visibilitymap_clear(). But I cannot prove it with an injection point test. > Because injections points rely on CondVar, that per se creates corruption in > critical section. So I'm reading this discussion and won

Re: VM corruption on standby

2025-08-19 Thread Tom Lane
Kirill Reshke writes: > On Tue, 19 Aug 2025 at 10:32, Thomas Munro wrote: >> Any idea involving deferring the handling of PM death from here >> doesn't seem right: you'd keep waiting for the CV, but the backend >> that would wake you might have exited. Yeah. Taking the check for PM death out of

Re: VM corruption on standby

2025-08-19 Thread Andrey Borodin
> On 19 Aug 2025, at 23:23, Kirill Reshke wrote: > >> We'd probably be best off to get back to the actual bug the >> thread started with, namely whether we aren't doing the wrong >> thing with VM-update order of operations. >> >>regards, tom lane > > My understanding

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 23:08, Tom Lane wrote: > > Kirill Reshke writes: > > On Tue, 19 Aug 2025 at 21:16, Yura Sokolov wrote: > >> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in > >> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS > >> SIGKILLED, and doesn

Re: VM corruption on standby

2025-08-19 Thread Tom Lane
Kirill Reshke writes: > On Tue, 19 Aug 2025 at 21:16, Yura Sokolov wrote: >> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in >> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS >> SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed >>

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 20:24, Andres Freund wrote: > > Hi, > > On 2025-08-20 03:19:38 +1200, Thomas Munro wrote: > > On Wed, Aug 20, 2025 at 2:57 AM Andres Freund wrote: > > > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > > > On linux - the primary OS with OOM killer troubles - I'm pret

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 21:16, Yura Sokolov wrote: > > > `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in > WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS > SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed > (since postmaster will S

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 21:16, Yura Sokolov wrote: > > That is not true. > elog(PANIC) doesn't clear LWLocks. And XLogWrite, which is could be called > from AdvanceXLInsertBuffer, may call elog(PANIC) from several places. > > It doesn't lead to any error, because usually postmaster is alive and it

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
19.08.2025 16:43, Kirill Reshke пишет: > On Tue, 19 Aug 2025 at 18:29, Yura Sokolov wrote: > > >> Latch and ConditionVariable (that uses Latch) are among basic >> synchronization primitives in PostgreSQL. > > Sure > >> Therefore they have to work correctly in any place: in critical section, in

Re: VM corruption on standby

2025-08-19 Thread Andres Freund
Hi, On 2025-08-20 03:19:38 +1200, Thomas Munro wrote: > On Wed, Aug 20, 2025 at 2:57 AM Andres Freund wrote: > > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll > > > > lwlock > > > > waiters would get killed due t

Re: VM corruption on standby

2025-08-19 Thread Thomas Munro
On Wed, Aug 20, 2025 at 2:57 AM Andres Freund wrote: > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll > > > lwlock > > > waiters would get killed due to the postmaster death signal we've > > > configured > > > (c.f.

Re: VM corruption on standby

2025-08-19 Thread Andres Freund
Hi, On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll > > lwlock > > waiters would get killed due to the postmaster death signal we've configured > > (c.f. PostmasterDeathSignalInit()). > > No, that has a handler that ju

Re: VM corruption on standby

2025-08-19 Thread Thomas Munro
On Wed, Aug 20, 2025 at 1:56 AM Andres Freund wrote: > On 2025-08-19 02:13:43 -0400, Tom Lane wrote: > > > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after > > > someone who holds x calls _exit()? > > > > If someone who holds x is killed by (say) the OOM killer, how do > > we

Re: VM corruption on standby

2025-08-19 Thread Andres Freund
Hi, On 2025-08-19 02:13:43 -0400, Tom Lane wrote: > Thomas Munro writes: > > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: > >> But I'm of the opinion that proc_exit > >> is the wrong thing to use after seeing postmaster death, critical > >> section or no. We should assume that system integri

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 18:29, Yura Sokolov wrote: > Latch and ConditionVariable (that uses Latch) are among basic > synchronization primitives in PostgreSQL. Sure > Therefore they have to work correctly in any place: in critical section, in > wal logging, etc. No. Before bc22dc0e0ddc2dcb6043a

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
19.08.2025 16:17, Kirill Reshke пишет: > On Tue, 19 Aug 2025 at 14:14, Kirill Reshke wrote: >> >> This thread is a candidate for [0] >> >> >> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items >> > > Let me summarize this thread for ease of understanding of what's going on: > > Timelin

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
19.08.2025 16:09, Andres Freund пишет: > Hi, > > On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote: >> 09.08.2025 22:54, Kirill Reshke пишет: >>> On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev >>> wrote: >>> Perhaps there was a good reason to update the VM *before* creating WAL records

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 14:14, Kirill Reshke wrote: > > This thread is a candidate for [0] > > > [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items > Let me summarize this thread for ease of understanding of what's going on: Timeline: 1) Andrey Borodin sends a patch (on 6 Aug) claiming

Re: VM corruption on standby

2025-08-19 Thread Andres Freund
Hi, On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote: > 09.08.2025 22:54, Kirill Reshke пишет: > > On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev > > wrote: > > > >> Perhaps there was a good > >> reason to update the VM *before* creating WAL records I'm unaware of. > > > > Looks like 503c730 in

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
09.08.2025 22:54, Kirill Reshke пишет: > On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev > wrote: > >> Perhaps there was a good >> reason to update the VM *before* creating WAL records I'm unaware of. > > Looks like 503c730 intentionally does it this way; however, I have not > yet fully underst

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
10.08.2025 08:45, Kirill Reshke пишет: > On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev > wrote: >> For this reason we have PageHeaderData.pd_lsn for instance - to make sure >> pages are evicted only *after* the record that changed it is written >> to disk (because WAL records can't be applied

Re: VM corruption on standby

2025-08-19 Thread Tomas Vondra
On 8/19/25 11:14, Kirill Reshke wrote: > This thread is a candidate for [0] > > > [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items> > Added, with Alexander as an owner (assuming it really is caused by commit bc22dc0e0d. regards -

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
This thread is a candidate for [0] [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items Best regards, Kirill Reshke

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 10:32, Thomas Munro wrote: > > I don't know if there are other ways that LWLockReleaseAll() can lead > to persistent corruption that won't be corrected by crash recovery, > but this one is probably new since the following commit, explaining > the failure to reproduce on v17

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 11:13, Tom Lane wrote: > > Thomas Munro writes: > > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: > >> But I'm of the opinion that proc_exit > >> is the wrong thing to use after seeing postmaster death, critical > >> section or no. We should assume that system integrity

Re: VM corruption on standby

2025-08-18 Thread Tom Lane
Thomas Munro writes: > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: >> But I'm of the opinion that proc_exit >> is the wrong thing to use after seeing postmaster death, critical >> section or no. We should assume that system integrity is already >> compromised, and get out as fast as we can w

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
Hi! Thank you for putting attention to this. On Tue, 19 Aug 2025 at 10:32, Thomas Munro wrote: > > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: > > But I'm of the opinion that proc_exit > > is the wrong thing to use after seeing postmaster death, critical > > section or no. We should assume

Re: VM corruption on standby

2025-08-18 Thread Thomas Munro
On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: > But I'm of the opinion that proc_exit > is the wrong thing to use after seeing postmaster death, critical > section or no. We should assume that system integrity is already > compromised, and get out as fast as we can with as few side-effects > as

Re: VM corruption on standby

2025-08-18 Thread Tom Lane
Kirill Reshke writes: > On Sun, 17 Aug 2025 at 19:33, Tom Lane wrote: >> I do not like this patch one bit: it will replace one set of problems >> with another set, namely systems that fail to shut down. > I did not observe this during my by-hand testing. I am under the > impression that CRIT sec

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
On Mon, 18 Aug 2025 at 13:15, I wrote: > > I do not like this patch one bit: it will replace one set of problems > > with another set, namely systems that fail to shut down. > > I did not observe this during my by-hand testing. I am sorry: I was wrong. This is exactly what happens in this test (mo

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
Hi! Thank you for putting attention to this. On Sun, 17 Aug 2025 at 19:33, Tom Lane wrote: > > Kirill Reshke writes: > > [ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ] > > I do not like this patch one bit: it will replace one set of problems > with another set, namely system

Re: VM corruption on standby

2025-08-17 Thread Andrey Borodin
> On 17 Aug 2025, at 17:33, Tom Lane wrote: > > So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the > places that are responding to postmaster death. +1. But should we _exit(2) only in critical section or always in case of postmaster death? Another question that was botherin

Re: VM corruption on standby

2025-08-17 Thread Tom Lane
Kirill Reshke writes: > [ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ] I do not like this patch one bit: it will replace one set of problems with another set, namely systems that fail to shut down. I think the actual bug here is the use of proc_exit(1) after observing postma

Re: VM corruption on standby

2025-08-13 Thread Kirill Reshke
On Thu, 14 Aug 2025 at 10:41, Kirill Reshke wrote: > o I am trying to reproduce is following: > > 1) Some process p1 locks some buffer (name it buf1), enters CRIT > section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar > in (GetXLogBuffer -> AdvanceXLInsertBuffer). > 2) CHECKPOINT

Re: VM corruption on standby

2025-08-13 Thread Kirill Reshke
On Wed, 13 Aug 2025 at 16:15, I wrote: > I did not find any doc or other piece of information indicating > whether WaitEventSetWait and critical sections are allowed. But I do > thing this is bad, because we do not process interruptions during > critical sections, so it is unclear to me why we shou

Re: VM corruption on standby

2025-08-13 Thread Kirill Reshke
On Tue, 12 Aug 2025 at 13:00, I wrote: > While this aims to find existing VM corruption (i mean, in PG <= 17), > this reproducer does not seem to work on pg17. At least, I did not > manage to reproduce this scenario on pg17. > > This makes me think this exact corruption may be pg18-only. Is it > p

Re: VM corruption on standby

2025-08-12 Thread Kirill Reshke
On Wed, 6 Aug 2025 at 20:00, Andrey Borodin wrote: > > Hi hackers! > > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL > machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev > colleagues from AWS confirmed that they sa

Re: VM corruption on standby

2025-08-11 Thread Kirill Reshke
On Tue, 12 Aug 2025 at 10:38, I wrote: > CHECKPOINT > somehow manages to flush the heap page when instance kill-9-ed. This corruption does not reproduce without CHECKPOINT call, however I do not see any suspicious syscal that CHECKPOINT's process does. It does not write anything to disk here, isn

Re: VM corruption on standby

2025-08-11 Thread Kirill Reshke
On Wed, 6 Aug 2025 at 20:00, Andrey Borodin wrote: > > Hi hackers! > > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL > machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev > colleagues from AWS confirmed that they sa

Re: VM corruption on standby

2025-08-09 Thread Andrey Borodin
> On 9 Aug 2025, at 23:54, Aleksander Alekseev wrote: > > IMHO: logging the changes first, then allowing to evict the page. VM and BufferManager code does not allow flush of a buffer until changes are logged. The problem is that our crash-exiting system destroys locks that protect buffer fr

Re: VM corruption on standby

2025-08-09 Thread Kirill Reshke
On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev wrote: > For this reason we have PageHeaderData.pd_lsn for instance - to make sure > pages are evicted only *after* the record that changed it is written > to disk (because WAL records can't be applied to pages from the > future). We don't bump th

Re: VM corruption on standby

2025-08-09 Thread Aleksander Alekseev
Hi Andrey, > 0. checkpointer is going to flush a heap buffer but waits on content lock > 1. client is resetting PD_ALL_VISIBLE from page > 2. postmaster is killed and command client to go down > 3. client calls LWLockReleaseAll() at ProcKill() (?) > 4. checkpointer flushes buffer with reset PG_ALL

Re: VM corruption on standby

2025-08-09 Thread Kirill Reshke
On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev wrote: > Perhaps there was a good > reason to update the VM *before* creating WAL records I'm unaware of. Looks like 503c730 intentionally does it this way; however, I have not yet fully understood the reasoning behind it. -- Best regards, Kir

Re: VM corruption on standby

2025-08-09 Thread Andrey Borodin
> On 9 Aug 2025, at 18:28, Andrey Borodin wrote: > > Also I investigated that in a moment of kill -9 checkpointer flushes heap > page to disk despite content lock. I haven't found who released content lock > though. I've written this message and understood: its LWLockReleaseAll(). 0. check

Re: VM corruption on standby

2025-08-09 Thread Andrey Borodin
> On 7 Aug 2025, at 19:36, Aleksander Alekseev wrote: > > Maybe I misunderstood the intent of the test. You understood precisely my intent of writing the test. But it fail not due to a bug I anticipated! So far I noticed that if I move injection point before PageClearAllVisible(BufferGetPa

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi Andrey, > the test passes because you moved injection point to a very safe position > [...] > I want to emphasize that it seems to me that position of injection point is > not a hint, but rather coincidental. Well, I wouldn't say that the test passes merely because the location of the injecti

Re: VM corruption on standby

2025-08-07 Thread Andrey Borodin
> On 7 Aug 2025, at 17:09, Aleksander Alekseev wrote: > > If my understanding is correct, we should make a WAL record with the > XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within > the same critical section (in order to avoid race conditions within > the same backend). Wel

Re: VM corruption on standby

2025-08-07 Thread Andrey Borodin
> On 7 Aug 2025, at 18:54, Andrey Borodin wrote: > > moved injection point to a very safe position. BTW, your fix also fixes ALL_FROZEN stuff, just because WAL for heap insert is already emitted by the time of -9. I want to emphasize that it seems to me that position of injection point is n

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi, > If my understanding is correct, we should make a WAL record with the > XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within > the same critical section [...] > > A draft patch is attached. It makes the test pass and doesn't seem to > break any other tests. > > Thoughts? In

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
> > This is a tricky bug. Do you also have a proposal of a particular fix? > > If my understanding is correct, we should make a WAL record with the > XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within > the same critical section (in order to avoid race conditions within > the sam

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi again, > I meant instance, not backend. Sorry for confusion. It looks like I completely misunderstood what START_CRIT_SECTION() / END_CRIT_SECTION() are for here. Simply ignore this part :) Apologies for the noise.

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi, > This is a tricky bug. Do you also have a proposal of a particular fix? If my understanding is correct, we should make a WAL record with the XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within the same critical section (in order to avoid race conditions within the same back

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi Andrey, > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL > machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev > colleagues from AWS confirmed that they saw something similar too. > So I toyed around and accidenta

VM corruption on standby

2025-08-06 Thread Andrey Borodin
Hi hackers! I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting. At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too. So I toyed around and accidentally wrote