Hi,

Attaching v2, rebased on top of current master (v1 no longer applied
cleanly per cfbot).  Two improvements were also made since v1:

1. Test stability
   The TAP test (053_hs_feedback_catalog_xmin.pl) replaced a sleep(2)
   with a poll on pg_stat_replication.reply_time, so it deterministically
   waits for a fresh hot standby feedback round to reach the primary
   instead of relying on wall-clock time.  This removes the most obvious
   source of flakiness on slower CI runners.

2. Performance impact (pgbench)
   I was concerned about adding a UINT32_ACCESS_ONCE(proc->catalog_xmin)
   read and an extra TransactionIdOlder() call per PGPROC iteration in
   ComputeXidHorizons(), which is on a hot path.  Measured on a 104-core
   box, scale=30, 64 clients / 32 jobs / 60s, -M prepared, optimized
   build (CFLAGS=-O2, no --enable-cassert):

   workload                 baseline median   patched median   delta
   ----------------------   ---------------   --------------   -----
   pgbench -S (read-only)        1,596,196        1,617,083    +1.3%
   pgbench    (read-write)         129,550          132,024    +1.9%

   Numbers are medians of 3 warm runs after a 30s warmup, after I
   discarded the cold-start runs which were dominated by cache priming.
   The patched build comes out marginally above baseline, which is of
   course noise -- adding code does not make things faster -- but it
   confirms the change is below the measurement floor on this hardware.
   Happy to rerun on different shapes (more clients, smaller scale,
   different machine) if anyone wants to see specific numbers.

Summary of the change (unchanged in substance from v1):

  - proc.h: add catalog_xmin field to PGPROC (4 bytes)
  - proc.c: initialize it in InitProcess / InitAuxiliaryProcess
  - procarray.c: accumulate proc_catalog_xmin in ComputeXidHorizons()
    and apply it only to catalog_oldest_nonremovable and
    shared_oldest_nonremovable; include it in GetReplicationHorizons()
    so the catalog_xmin propagates correctly in cascading setups
  - walsender.c: in the no-slot path of ProcessStandbyHSFeedbackMessage(),
    set MyProc->xmin and MyProc->catalog_xmin separately instead of
    folding them together

Registered in the 2026-07 commitfest:
  https://commitfest.postgresql.org/patch/6861/

Adding Andres (who wrote much of the ProcArray horizon code) and Amit
Kapila (logical replication / slots), in case you have any thoughts on
the approach here -- in particular whether tracking catalog_xmin in
PGPROC is preferable to generalizing the ephemeral-slot idea hinted at
by the existing XXX comment in ProcessStandbyHSFeedbackMessage().

Regards,
Rui Zhao

Attachment: v2-0001-Separate-catalog_xmin-from-xmin-in-walsender-hot-.patch
Description: Binary data


> 2026年4月30日 16:29,Rui Zhao <[email protected]> 写道:
> 
> Hi hackers,
> 
> I'd like to propose a fix for a long-standing issue where hot standby
> feedback catalog_xmin incorrectly holds back vacuuming of user data
> tables on the primary when no physical replication slot is used.
> 
> == Problem ==
> 
> When a standby sends hot standby feedback to a primary without a
> physical replication slot, ProcessStandbyHSFeedbackMessage() takes
> min(feedbackCatalogXmin, feedbackXmin) and stores it into
> MyProc->xmin:
> 
>    if (TransactionIdIsNormal(feedbackCatalogXmin)
>        && TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
>        MyProc->xmin = feedbackCatalogXmin;
>    else
>        MyProc->xmin = feedbackXmin;
> 
> Since ComputeXidHorizons() treats proc->xmin uniformly for both data
> and catalog horizons, the catalog_xmin ends up holding back
> data_oldest_nonremovable, preventing vacuum from cleaning dead tuples
> in regular user tables.
> 
> The existing code even acknowledges this limitation:
> 
>    "We can only track the catalog xmin separately when using a slot,
>     so we store the least of the two provided when not using a slot."
> 
> == Why this matters ==
> 
> One might argue "just use a replication slot."  However, many
> production HA deployments intentionally avoid physical replication
> slots because of their lifecycle management complexity:
> 
> - When a primary fails, physical slots on the old primary are lost
>  and cannot be automatically migrated to the promoted standby.
> - Other standbys that were using slots on the old primary must
>  re-establish their slots on the new primary, potentially requiring
>  a fresh base backup.
> - Dangling slots from disconnected standbys can cause unbounded WAL
>  accumulation until manually dropped.
> 
> These deployments use wal_keep_size or WAL archiving for WAL
> retention, combined with hot_standby_feedback for visibility horizon
> management.  This is a legitimate production configuration -- for
> example, some HA frameworks (Patroni with certain configurations,
> custom HA scripts) operate this way.
> 
> The issue becomes severe when the standby also hosts a logical
> replication slot (e.g., for change data capture or logical replication
> to a downstream).  The logical slot's catalog_xmin can be very old
> (retained for logical decoding catalog access), and this old value
> gets propagated to the primary's walsender via hot standby feedback,
> blocking vacuum on ALL user data tables on the primary.  This leads
> to table bloat that is difficult to diagnose since the DBA may not
> realize the connection between a standby's logical slot and the
> primary's vacuum behavior.
> 
> == Fix ==
> 
> The patch adds a catalog_xmin field to PGPROC (4 bytes), so the
> walsender can track catalog_xmin separately from xmin even without a
> replication slot.  This mirrors how replication slots already separate
> slot->data.xmin from slot->data.catalog_xmin.
> 
> In ComputeXidHorizons(), the new proc_catalog_xmin is accumulated
> from PGPROC entries and applied only to catalog_oldest_nonremovable
> and shared_oldest_nonremovable -- exactly how slot_catalog_xmin is
> already handled.  It does NOT affect data_oldest_nonremovable.
> 
> GetReplicationHorizons() is updated to include proc_catalog_xmin in
> the catalog_xmin sent upstream, ensuring correct behavior in
> cascading standby configurations.
> 
> Changes summary:
> - proc.h: add catalog_xmin to PGPROC
> - proc.c: initialize catalog_xmin in InitProcess/InitAuxiliaryProcess
> - procarray.c: accumulate and apply proc_catalog_xmin in
>  ComputeXidHorizons(); include in GetReplicationHorizons()
> - walsender.c: set MyProc->xmin and MyProc->catalog_xmin separately
>  in the no-slot path of ProcessStandbyHSFeedbackMessage()
> 
> == Alternatives considered ==
> 
> 1. Generalize the ephemeral slot concept (as suggested by the existing
>   XXX comment): this would automatically create a temporary slot for
>   slot-less walsenders.  More invasive, requires slot allocation
>   (max_replication_slots), and adds slot lifecycle management.
> 
> 2. Simply ignore catalog_xmin in the no-slot path: simpler but loses
>   catalog protection for the standby's logical decoding.
> 
> The proposed approach is minimal, correct, and consistent with how
> slots already handle the separation.
> 
> == Testing ==
> 
> A new TAP test (053_hs_feedback_catalog_xmin.pl) verifies:
> 1. With hot_standby_feedback=on and no physical replication slot, when
>   the standby has a logical slot with an old catalog_xmin, VACUUM on
>   the primary can still clean dead tuples in user data tables.
> 2. The standby's logical slot catalog_xmin remains properly set,
>   confirming catalog protection is preserved.
> 
> Patch attached.
> 
> Regards,
> Rui Zhao
> 
> 

Attachment: v1-0001-Separate-catalog_xmin-from-xmin-in-walsender-hs-feedback.patch
Description: Binary data

> 
> 

Reply via email to