Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

Chao Li Mon, 02 Mar 2026 23:33:23 -0800

> On Mar 3, 2026, at 07:38, Melanie Plageman <[email protected]> wrote:
> 
> On Fri, Feb 20, 2026 at 12:59 PM Andres Freund <[email protected]> wrote:
>> 
>> On 2026-01-28 18:16:10 -0500, Melanie Plageman wrote:
>> 
>>> I could see an argument for moving identify_and_fix_vm_corruption()
>>> out of the helper and into heap_page_prune_and_freeze() but then we'd
>>> have to move visibilitymap_get_status() out too. And that takes away a
>>> lot of the benefit of encapsulating all that logic.
>> 
>> I was wondering about that option. Relatedly, I also was wondering if we 
>> ought
>> to do identify_and_fix_vm_corruption() regardless of ->attempt_update_vm.
> 
> Attached v35 does this. I always pin the vmbuffer if we are going to
> prune in heap_page_prune_opt(). In many cases, because it's saved in
> the scan descriptor, it won't actually need to take a new pin. During
> pruning, I check for VM corruption even if I am not considering
> setting the VM.
> 
>>> Well, after this patch set, clearing the VM does happen before we emit
>>> WAL for pruning.
>> 
>> That I think is a substantial improvement, the current (i.e. before your
>> series) placement really is pretty insane due to the guaranteed divergence it
>> causes.
>> 
>> I wonder if we actually should just force an FPI whenever we detect such
>> corruption, that way it would reliably fixed on the standby as well.
> 
> Only problem is we would have to do an FPI of the VM page as well if
> we wanted the corruption to be reliably fixed on the standby.
> 
>>> It wouldn't be hard to move the corruption fixups to the beginning of
>>> heap_page_prune_and_freeze() in the new code structure.
>> 
>> As identify_and_fix_vm_corruption() needs lpdead_items, I'm not sure that's
>> true?
>> 
>> I wonder if at least the warning for the "(PageIsAllVisible(heap_page) &&
>> nlpdead_items > 0)" test should be moved to
>> heap_prune_record_dead_or_unused(). That way the WARNING could include the
>> offset number and it'd also work in the mark_unused_now case.
>> 
>> Perhaps it also should trigger for RECENTLY_DEAD, INSERT_IN_PROGRESS,
>> DELETE_IN_PROGRESS?
>> 
>> At that point the !page_all_visible && vm_all_visible part could indeed be
>> moved to the start of heap_page_prune_and_freeze()
> 
> I've done all this. There is heap page/VM corruption check at the
> beginning of heap_page_prune_and_freeze() and then checking for
> corruption during pruning in the previously covered case (lpdead
> items) as well as the mark_unused_now case, and
> RECENTLY_DEAD/INSERT_IN_PROGRESS/DELETE_IN_PROGRESS.
> 
>>> Would it be worth it? What benefit would we get? Do you just feel that it
>>> should logically come first?
>> 
>> One insanity is that right now we will process all frozen pages over and over
>> due to he skip pages threshold, wasting a *lot* of CPU and memory bandwidth.
>> It'd be quite defensible to just skip processing the page once we determined
>> it's already all frozen.  But for that we'd probably want to do the
>> "page_all_visible && vm_all_visible" check before returning...
> 
> I've added a fast path to bypass pruning/freezing when the page is
> already all-visible. And I check for pg_all_visible && vm_all_visible
> beforehand. The one downside this has is if there is a page marked
> all-frozen but has dead tuples on it, we'll never get to fix that
> corruption nor clean up the dead tuples. But the fast path kind of
> seems worth it to me.
> 
>>>> Do we actually forsee a case where only one of HEAP_PAGE_PRUNE_FREEZE |
>>>> HEAP_PAGE_PRUNE_UPDATE_VM would be set?
>>> 
>>> Yes, when setting the VM on-access, it is too expensive to call
>>> heap_prepare_freeze_tuple() on each tuple. I could work on trying to
>>> optimize it, but it isn't currently viable.
>> 
>> Is it too expensive to do so even when we already decided to do some pruning?
>> I am not surprised it's too expensive when there's not even a dead tuple on
>> the page.  But I am mildly surprised if it's too expensive to do when we'd 
>> WAL
>> log anyway?
> 
> It's not really possible in the current code structure to only call
> heap_prepare_freeze_tuple() when there are at least some prunable
> tuples. We go through the line pointers and record them as prunable at
> the same time we call heap_prepare_freeze_tuple(), so we won't know
> until we've examined all line pointers that there are no prunable
> tuples, at which point we will have called heap_prepare_freeze_tuple()
> for every tuple.
> 
>>> I think using all_frozen_except_dead while maintaining
>>> visibility_cutoff_xid (in heap_prune_record_unchanged_lp_normal()) has
>>> the potential to be confusing, though. We'd need to keep updating
>>> visibility_cutoff_xid when all_visible is false but
>>> all_frozen_except_dead is true as well as when all_visible is true.
>>> And because we don't care about all_visible_except_dead, it gets even
>>> more confusing to make sure we are maintaining the right variables in
>>> the right situations.
>> 
>> I suspect we should just track all of the horizons/cutoffs all the time. This
>> whole stuff about optimizing out a few conditional assignments complicates 
>> the
>> code substantially and feels extremely error prone to me.
> 
> I've done this in v35. I posted the freeze horizon tracking patch
> separately in [1] but it is in v35 as 0004. Tracking the newest live
> xid is in 0009. This also always tracks all_visible for all callers
> since I unconditionally pass the vmbuffer now. I still don't set the
> VM if the query is modifying the relation, though.
> 
>> I probably complained about this before, and it's not this patch's fault, but
>> PruneState->{all_visible,all_frozen} are imo confusingly named, due to
>> sounding like they describe the current state, rather than the possible state
>> after pruning.  It's not helped by this comment:
>> 
>>         * NOTE: all_visible and all_frozen initially don't include LP_DEAD 
>> items.
>>         * That's convenient for heap_page_prune_and_freeze() to use them to
>>         * decide whether to opportunistically freeze the page or not.  The
>>         * all_visible and all_frozen values ultimately used to set the VM are
>>         * adjusted to include LP_DEAD items after we determine whether or 
>> not to
>>         * opportunistically freeze.
>> 
>> "all-visible ... are adjusted to include LP_DEAD" ... - just reading that 
>> it's
>> hard to know what it means.
> 
> 0003 does the rename.
> 
>> The first thing to improve pruning performance that I would do is to 
>> introduce
>> a fastpath for pages that a) area already frozen b) do not have dead items 
>> (if
>> we're not freezing). Iterating through HOT chains is far from cheap, and if
>> all rows are live, there's not really a point in doing so.  This is
>> particulary important for VACUUMs where we end up freezing a ton of pages 
>> that
>> are already frozen, due to the silly skip_pages_threshold thing.
> 
> 0007 adds a fast path.
> 
>>> +static TransactionId
>>> +get_conflict_xid(bool do_prune, bool do_freeze, bool do_set_vm,
>>> +                              uint8 old_vmbits, uint8 new_vmbits,
>>> +                              TransactionId latest_xid_removed, 
>>> TransactionId frz_conflict_horizon,
>>> +                              TransactionId visibility_cutoff_xid)
>>> +{
>>> +     TransactionId conflict_xid;
>>> +
>>> +     /*
>>> +      * We can omit the snapshot conflict horizon if we are not pruning or
>>> +      * freezing any tuples and are setting an already all-visible page
>>> +      * all-frozen in the VM.
>> 
>> Maybe mention when this can happen, because it's not immediately obvious.
> 
> I've added this to my TODO. I honestly can't think of a scenario where
> it can happen. But I remember spending quite a bit of time thinking
> about it on another occasion. The current code (in master) does
> specifically account for this scenario, which is why I kept the logic,
> but I'm not sure how it can happen.
> 
> I made all the other changes to specific comments you mentioned in
> your mail but I won't bore you with itemization.
> 
>>>      if (do_set_vm)
>>>              conflict_xid = visibility_cutoff_xid;
>>>      else if (do_freeze)
>>>              conflict_xid = frz_conflict_horizon;
>>>      else
>>>              conflict_xid = InvalidTransactionId;
>> 
>> Could it be worth checking that if (do_set_vm && do_freeze) the
>> frz_conflict_horizon won't "violated" by using visibility_cutoff_xid instead?
> 
> Yes, as you mentioned off-list, this wasn't right. New code is like this
> 
> TransactionId conflict_xid = InvalidTransactionId;
> ...
>    if (do_set_vm)
>        conflict_xid = newest_live_xid;
>    if (do_freeze && TransactionIdFollows(newest_frozen_xid, conflict_xid))
>        conflict_xid = newest_frozen_xid;
> 
>>> From 8d350868206456f631883a40a955dff480e408d3 Mon Sep 17 00:00:00 2001
>>> From: Melanie Plageman <[email protected]>
>>> Date: Wed, 17 Dec 2025 16:51:05 -0500
>>> Subject: [PATCH v34 09/14] Use GlobalVisState in vacuum to determine page
>>> level visibility
>>> 
>>> [...]
>>> 
>>> Because comparing a transaction ID against GlobalVisState is more
>>> expensive than comparing against a single XID, we defer this check until
>>> after scanning all tuples on the page.
>> 
>> Curious, is this a precaution or was this a measurable bottleneck?
> 
> I did see GlobalVisTestXidMaybeRunning() in a profile I did when it
> was still called for every HEAPTUPLE_LIVE tuple in
> heap_prune_record_unchanged_lp_normal(), but I don't have the profile
> or test case around anymore.
> 
> However, since I now unconditionally maintain the newest_live_xid,
> moving GlobalVisTestXidMaybeRunning() back into
> heap_prune_record_unchanged_lp_normal() wouldn't help us avoid any
> work. It would just make the values of prstate.set_all_visible and
> prstate.set_all_frozen more accurate sooner. But I don't think it's
> worth the extra function call since set_all_frozen and set_all_visible
> won't be totally "done" until after we decide whether or not to
> opportunistically freeze anyway.
> 
>>> @@ -1077,6 +1078,24 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>>>      prune_freeze_plan(RelationGetRelid(params->relation),
>>>                                        buffer, &prstate, off_loc);
>>> 
>>> +     /*
>>> +      * After processing all the live tuples on the page, if the newest 
>>> xmin
>>> +      * amongst them may be considered running by any snapshot, the page 
>>> cannot
>>> +      * be all-visible.
>>> +      */
>>> +     if (prstate.all_visible &&
>>> +             TransactionIdIsNormal(prstate.visibility_cutoff_xid) &&
>> 
>> Any reason to test IsNormal rather than just IsValid()?  There should never 
>> be
>> a reason it's a valid but not "normal" xid, right?
> 
> Well the reason I did this was that the existing code in master
> tracking visibility_cutoff_xid only advances it if
> TransactionIdIsNormal(). I'm a bit confused about it too because it
> seems like we would still want to do it for bootstrap mode xids. But I
> see PageSetPrunable() only allows normal xids.
> 
>>> @@ -1794,28 +1812,15 @@ heap_prune_record_unchanged_lp_normal(Page page, 
>>> PruneState *prstate, OffsetNumb
>>>                              }
>>> 
>>>                              /*
>>> -                              * The inserter definitely committed.  But is 
>>> it old enough
>>> -                              * that everyone sees it as committed?  A 
>>> FrozenTransactionId
>>> -                              * is seen as committed to everyone.  
>>> Otherwise, we check if
>>> -                              * there is a snapshot that considers this 
>>> xid to still be
>>> -                              * running, and if so, we don't consider the 
>>> page all-visible.
>>> +                              * The inserter definitely committed. But we 
>>> don't know if it
>>> +                              * is old enough that everyone sees it as 
>>> committed. Later,
>>> +                              * after processing all the tuples on the 
>>> page, we'll check if
>>> +                              * there is any snapshot that still considers 
>>> the newest xid
>>> +                              * on the page to be running. If so, we don't 
>>> consider the
>>> +                              * page all-visible.
>>>                               */
>>>                              xmin = HeapTupleHeaderGetXmin(htup);
>>> 
>>> -                             /*
>>> -                              * For now always use prstate->cutoffs for 
>>> this test, because
>>> -                              * we only update 'all_visible' and 
>>> 'all_frozen' when freezing
>>> -                              * is requested. We could use 
>>> GlobalVisTestIsRemovableXid
>>> -                              * instead, if a non-freezing caller wanted 
>>> to set the VM bit.
>>> -                              */
>>> -                             Assert(prstate->cutoffs);
>>> -                             if (!TransactionIdPrecedes(xmin, 
>>> prstate->cutoffs->OldestXmin))
>>> -                             {
>>> -                                     prstate->all_visible = false;
>>> -                                     prstate->all_frozen = false;
>>> -                                     break;
>>> -                             }
>>> -
>>>                              /* Track newest xmin on page. */
>>>                              if (TransactionIdFollows(xmin, 
>>> prstate->visibility_cutoff_xid) &&
>>>                                      TransactionIdIsNormal(xmin))
>> 
>> Kinda wonder if this cod eshould be in something like
>> heap_prune_record_freezable() or such, rather than be inside
>> heap_prune_record_unchanged_lp_normal().
> 
> I played around with it, but it all felt a bit awkward. I wrote it
> down for a future enhancement idea.
> 
>>> Subject: [PATCH v34 10/14] Unset all_visible sooner if not freezing
>>> 
>>> In the prune/freeze path, we currently delay clearing all_visible and
>>> all_frozen in the presence of dead items to allow opportunistic
>>> freezing.
>>> 
>>> However, if no freezing will be attempted, there’s no need to delay.
>>> Clearing the flags earlier avoids extra bookkeeping in
>>> heap_prune_record_unchanged_lp_normal(). This currently has no runtime
>>> effect because all callers that consider setting the VM also prepare
>>> freeze plans, but upcoming changes will allow on-access pruning to set
>>> the VM without freezing. The extra bookkeeping was noticeable in a
>>> profile of on-access VM setting.
>> 
>> What workload was that?
> 
> It was a select * offset all query with a few fat tuples on each page
> and none of them prunable. I'm planning on digging up the
> case/creating a new one to see if it is reproducible. This was with an
> older version of the code that had more conditionals as well. This
> commit is actually dropped in v35 because I now always keep
> newest_live_xid up-to-date (0009) which means unsetting
> set_all_visible sooner has no benefit.
> 
>> Theoretically, even if we don't freeze, the page still may be all-visible or
>> all frozen after the removal of dead items, no? Practically that won't 
>> happen,
>> because we don't remove dead items in any of the relevant paths, but from the
>> commit message and comments that's not entirely clear.
> 
> Yea, it's clearer with the commit dropped.
> 
>>> @@ -678,6 +678,12 @@ typedef struct EState
>>>                                                                       * 
>>> ExecDoInitialPruning() */
>>>      const char *es_sourceText;      /* Source text from QueryDesc */
>>> 
>>> +     /*
>>> +      * RT indexes of relations modified by the query through a
>>> +      * UPDATE/DELETE/INSERT/MERGE or targeted by a SELECT FOR UPDATE.
>>> +      */
>>> +     Bitmapset  *es_modified_relids;
>>> +
>> 
>> Other EState fields are initialized in CreateExecutorState, this isn't 
>> afaict?
> 
> Oops, yes. I based it on es_unpruned_relids which wasn't initialized
> there either. I've added a commit (0013) to initialize a few EState
> fields that weren't initialized in CreateExecutorState() as well.
> 
>> Wonder if it's worth adding a crosscheck somewhere, verifying that if a
>> relation is modified, it's in es_modified_relids. Otherwise this could very
>> well silently get out of date.
> 
> Done in v35 (0014).
> 
>> Also, there's some overlap between the informtion collected this way, and
>> AcquireExecutorLocks(), ScanQueryForLocks(), which determine the needed lock
>> modes via rte->rellockmode.
> 
> Those are in parser/planner, so it doesn't seem like a good fit. I
> populate es_modified_relids in the executor.
> 
> I don't know exactly what the overlap would be between RTEs with an
> exclusive rellockmode and es_modified_relids. It seems like you could
> have RTEs which don't end up getting modified that have a lock level
> that would have made you think that they would be modified.
> 
> But were you imagining a substitution or a cross-check?
> 
>>> From 8205b2d7da0c3ad3cbc5cead336ced677996b37d Mon Sep 17 00:00:00 2001
>>> From: Melanie Plageman <[email protected]>
>>> Date: Wed, 3 Dec 2025 15:12:18 -0500
>>> Subject: [PATCH v34 12/14] Pass down information on table modification to 
>>> scan
>>> node
>> 
>> Perhaps worth splitting up, so the addition of the 0 flag is separate from 
>> the
>> the read only hint aspect.
> 
> Done.
> 
> [1] 
> https://www.postgresql.org/message-id/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com
> <v35-0001-Move-commonly-used-context-into-PruneState-and-s.patch><v35-0002-Add-PageGetPruneXid-helper.patch><v35-0003-Rename-PruneState-all_visible-all_frozen.patch><v35-0004-Use-the-newest-to-be-frozen-xid-as-the-conflict-.patch><v35-0005-Save-vmbuffer-in-heap-specific-scan-descriptors-.patch><v35-0006-Fix-visibility-map-corruption-in-more-cases.patch><v35-0007-Add-pruning-fast-path-for-all-visible-and-all-fr.patch><v35-0008-Use-GlobalVisState-in-vacuum-to-determine-page-l.patch><v35-0009-Keep-newest-live-XID-up-to-date-even-if-page-not.patch><v35-0010-Eliminate-XLOG_HEAP2_VISIBLE-from-vacuum-phase-I.patch><v35-0011-Eliminate-XLOG_HEAP2_VISIBLE-from-empty-page-vac.patch><v35-0012-Remove-XLOG_HEAP2_VISIBLE-entirely.patch><v35-0013-Initialize-missing-fields-in-CreateExecutorState.patch><v35-0014-Track-which-relations-are-modified-by-a-query.patch><v35-0015-Make-begin_scan-functions-take-a-flags-argument.patch><v35-0016-Pass-down-information-on-table-modification-to-s.patch><v35-0017-Allow-on-access-pruning-to-set-pages-all-visible.patch><v35-0018-Set-pd_prune_xid-on-insert.patch>

1 - 0001
```
+prune_freeze_plan(PruneState *prstate, OffsetNumber *off_loc)
 {
-       Page            page = BufferGetPage(buffer);
-       BlockNumber blockno = BufferGetBlockNumber(buffer);
-       OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+       Page            page = prstate->page;
+       BlockNumber blockno = prstate->block;
+       OffsetNumber maxoff = PageGetMaxOffsetNumber(prstate->page);
```

As there is a local “page”, maybe just use the local one for 
PageGetMaxOffsetNumber.

0002 looks good.

2 - 0003 - Does it make sense to also do the same renaming in PruneFreezeResult?

3 - 0004
```
-
-               /*
-                * Calculate what the snapshot conflict horizon should be for a 
record
-                * freezing tuples. We can use the visibility_cutoff_xid as our 
cutoff
-                * for conflicts when the whole page is eligible to become 
all-frozen
-                * in the VM once we're done with it. Otherwise, we generate a
-                * conservative cutoff by stepping back from OldestXmin.
-                */
-               if (prstate->set_all_frozen)
-                       prstate->frz_conflict_horizon = 
prstate->visibility_cutoff_xid;
-               else
-               {
-                       /* Avoids false conflicts when hot_standby_feedback in 
use */
-                       prstate->frz_conflict_horizon = 
prstate->cutoffs->OldestXmin;
-                       TransactionIdRetreat(prstate->frz_conflict_horizon);
-               }
+               
Assert(TransactionIdPrecedesOrEquals(prstate->pagefrz.FreezePageConflictXid,
+                                                                               
         prstate->cutoffs->OldestXmin));
```

At this point of Assert, can prstate->pagefrz.FreezePageConflictXid be 
InvalidTransactionId? My understanding is no, in that case, would it make sense 
to also Assert(prstate->pagefrz.FreezePageConflictXid != InvalidTransactionId)?

Otherwise, if prstate->pagefrz.FreezePageConflictXid is still possibly be 
InvalidTransactionId, then the Assert should be changed to something like:

Assert(prstate->pagefrz.FreezePageConflictXid == InvalidTransactionId || 
  TransactionIdPrecedesOrEquals(prstate->pagefrz.FreezePageConflictXid, 
prstate->cutoffs->OldestXmin)

I will continue with 0005 tomorrow.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

Reply via email to