Hi, I dug a bit into this and what looks to be happening is the comparison of the page containing the latest cutoff xid could falsely be reported as in the future of the last page number because the latest page number of the Serial slru is only set when the page is initialized [1].
So under the correct conditions, such as in the repro, the serializable XID has moved past the last page number, therefore to the next checkpoint which triggers a CheckPointPredicate, it will appear that the slru has wrapped around. It seems what may be needed here is to advance the latest_page_number during SerialSetActiveSerXmin and if we are using the SLRU. See below: diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index 1af41213b4..6946ed21b4 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -992,6 +992,9 @@ SerialSetActiveSerXmin(TransactionId xid) serialControl->tailXid = xid; + if (serialControl->headPage > 0) + SerialSlruCtl->shared->latest_page_number = SerialPage(xid); + LWLockRelease(SerialSLRULock); } [1] https://github.com/postgres/postgres/blob/master/src/backend/access/transam/slru.c#L306 Regards, Sami From: "Imseih (AWS), Sami" <sims...@amazon.com> Date: Tuesday, August 22, 2023 at 7:56 PM To: "pgsql-hack...@postgresql.org" <pgsql-hack...@postgresql.org> Subject: False "pg_serial": apparent wraparound” in logs Hi, I Recently encountered a situation on the field in which the message “could not truncate directory "pg_serial": apparent wraparound” was logged even through there was no danger of wraparound. This was on a brand new cluster and only took a few minutes to see the message in the logs. Reading on some history of this error message, it appears that there was work done to improve SLRU truncation and associated wraparound log messages [1]. The attached repro on master still shows that this message can be logged incorrectly. The repro runs updates with 90 threads in serializable mode and kicks off a “long running” select on the same table in serializable mode. As soon as the long running select commits, the next checkpoint fails to truncate the SLRU and logs the error message. Besides the confusing log message, there may also also be risk with pg_serial getting unnecessarily bloated and depleting the disk space. Is this a bug? [1] https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com Regards, Sami Imseih Amazon Web Services (AWS)