On 24/03/2025 16:56, Tomas Vondra wrote:
On 3/23/25 17:43, Heikki Linnakangas wrote:
On 21/03/2025 17:16, Andres Freund wrote:
Am I right in understanding that the only scenario (when in
STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds()
would
"legally" remove a transaction, rather than the commit / abort records
doing
so, is if the primary crash-restarted while transactions were ongoing?
Those transactions won't have a commit/abort records and thus won't
trigger
ExpireTreeKnownAssignedTransactionIds(), which otherwise would have
updated
->xactCompletionCount?
Correct.
When writing the snapshot caching patch, I tried to make sure that all
the
places that maintain ->latestCompletedXid also update
->xactCompletionCount. Afaict that's still the case. Which, I think,
means
that we're also missing calls to MaintainLatestCompletedXidRecovery()?
Yep, I was just looking at that too.
If latestCompletedXid is incorrect visibility determinations end up
wrong...
I think it happens to work, because the transactions are effectively
aborted. latestCompletedXid is used to initialize xmax in
GetSnapshotData. If, for example, latestCompletedXid is incorrectly set
to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000
still correctly considers XID 1001 as "not visible". As soon as a
transaction commits, latestCompletedXid is fixed.
AFAICS we could skip updating latestCompletedXid on aborts altogether
and rename it to latestCommittedXid. But it's hardly worth optimizing
for aborts.
For the same reason, I believe the assertion failure we're discussing
here is also harmless. Even though the reused snapshot has a smaller
xmin than expected, all those transactions aborted and are thus not
visible anyway.
In any case, let's be tidy and fix both latestCompletedXid and
xactCompletionCount.
Thanks for looking into this and pushing the fix.
Would it make sense to add a comment documenting this reasoning about
not handling aborts? Otherwise someone will get to rediscover this in
the future ...
Well, we're currently _not_ relying on the reasoning about aborts not
changing visibility. So it's not like someone will forget the reasoning
and have a bug as a result. I guess someone could rediscover that
reasoning and think that they can skip advancing those counters on
aborts as an optimization, re-introducing the assertion failure. But I
believe that was not why we had this bug, it was just a simple omission.
So the comment would look like this:
"In principle, we would not need to advance xactCompletionCount and
latestCompletedXid because aborting transactions don't change
visibility. But that could make xmin within a transaction go backwards,
if a snapshot with an older xmin but same xactCompletionCount is reused,
triggering the assertion in GetSnapshotDataReuse()."
If we add that, I guess it should go into ProcArrayEndTransaction()
which currently just notes that it's used for commits and aborts
interchangeably. Do you think it's worth it? Want to propose a patch?
--
Heikki Linnakangas
Neon (https://neon.tech)