On 1/24/22 22:28, Bossart, Nathan wrote:
On 1/22/22, 4:43 PM, "Tomas Vondra" <tomas.von...@enterprisedb.com> wrote:
There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea,
which may cause failures when starting a replica, making it unusable.
The commit message for 8431e296ea is not very clear about what exactly
is being done and why, but the root cause is that at while processing
RUNNING_XACTS, the XIDs are sorted like this:

      /*
       * Sort the array so that we can add them safely into
       * KnownAssignedXids.
       */
      qsort(xids, nxids, sizeof(TransactionId), xidComparator);

where "safely" likely means "not violating the ordering expected by
KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values
as plain uint32 values, while KnownAssignedXidsAdd actually calls
TransactionIdFollowsOrEquals() and compares the logical XIDs :-(

Wow, nice find.

This likely explains why we never got any reports about this - most
systems probably don't leave transactions running for this long, so the
probability is much lower. And replica restarts are generally not that
common events either.

I'm aware of one report with the same message [0], but I haven't read
closely enough to determine whether it is the same issue.  It looks
like that particular report was attributed to backup_label being
removed.


Yeah, I saw that thread too, and I don't think it's the same issue. As you say, it seems to be caused by the backup_label shenanigans, and there's also the RUNNING_XACTS message:

Sep 20 15:00:27 ... CONTEXT: xlog redo Standby/RUNNING_XACTS: nextXid 38585 latestCompletedXid 38571 oldestRunningXid 38572; 14 xacts: 38573 38575 38579 38578 38574 38581 38580 38576 38577 38572 38582 38584 38583 38583

The XIDs don't cross the 4B boundary at all, so this seems unrelated.


Attached patch is fixing this by just sorting the XIDs logically. The
xidComparator is meant for places that can't do logical ordering. But
these XIDs come from RUNNING_XACTS, so they actually come from the same
wraparound epoch (so sorting logically seems perfectly fine).

The patch looks reasonable to me.


Thanks!


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to