VM corruption on standby

Andrey Borodin Wed, 06 Aug 2025 08:00:36 -0700

Hi hackers!

I was reviewing the patch about removing xl_heap_visible and found the VM\WAL 
machinery very interesting.
At Yandex we had several incidents with corrupted VM and on pgconf.dev 
colleagues from AWS confirmed that they saw something similar too.
So I toyed around and accidentally wrote a test that reproduces $subj.


I think the corruption happens as follows:
0. we create a table with one frozen tuple
1. next heap_insert() clears VM bit and hangs immediately, nothing was logged 
yet
2. VM buffer is flushed on disk with checkpointer or bgwriter
3. primary is killed with -9
now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits 
on primary
4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
5. pg_visibility detects corruption

Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE 
is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, 
not a state of the VM. But for ALL_FROZEN this is not a case:

    /* Clear only the all-frozen bit on visibility map if needed */
    if (PageIsAllVisible(page) &&
        visibilitymap_clear(relation, block, vmbuffer,
            VISIBILITYMAP_ALL_FROZEN))
        cleared_all_frozen = true; // this won't happen due to flushed VM 
buffer before a crash

Anyway, the test reproduces corruption of both bits. And also reproduces 
selecting deleted data on standby.

The test is not intended to be committed when we fix the problem, so some waits 
are simulated with sleep(1) and test is placed at modules/test_slru where it 
was easier to write. But if we ever want something like this - I can design a 
less hacky version. And, probably, more generic.

Thanks!


Best regards, Andrey Borodin.

v1-0001-Corrupt-VM-on-standby.patch
Description: Binary data

VM corruption on standby

Reply via email to