On 2020/05/22 13:51, Thomas Munro wrote:
On Tue, May 5, 2020 at 9:51 AM Thomas Munro <thomas.mu...@gmail.com> wrote:
On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossa...@amazon.com> wrote:
I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file. If a cluster crashes at the right time, the following error
appears when you attempt to restart it:
FATAL: incorrect checksum in control file
This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock. The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.
This does indeed look pretty dodgy. CreateRestartPoint() running in
the checkpointer does UpdateControlFile() to compute a checksum and
write it out, but xlog_redo() processing
XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
interlocking. It looks like the ancestors of that line were there
since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
UpdateControLFile() directly in the startup process (immediately after
that update), so no interlocking problem. Then in cdd46c76548 (2009),
RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
in another process.
Here's a version with a commit message added. I'll push this to all
releases in a day or two if there are no objections.
+1 to push the patch.
Per my quick check, XLogReportParameters() seems to have the similar issue,
i.e., it updates the control file without taking ControlFileLock.
Maybe we should fix this at the same time?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION