Newly promoted primary may leave an invalid checkpoint.

In function CreateRestartPoint, control file is updated and old wals are 
removed. But in some situations, control file is not updated, old wals are 
still removed. Thus produces an invalid checkpoint with nonexistent wal. 
Crucial log: "invalid primary checkpoint record", "could not locate a valid 
checkpoint record".




The following timeline reproduces above situation:

tl1: standby begins to create restart point (time or wal triggered).

tl2: standby promotes and control file state is updated to DB_IN_PRODUCTION. 
Control file will not update (xlog.c:9690). But old wals are still removed 
(xlog.c:9719).

tl3: standby becomes primary. primary may crash before the next complete 
checkpoint (OOM in my situation). primary will crash continually with invalid 
checkpoint.




The attached patch reproduces this problem using standard postgresql perl test, 
you can run with 

./configure --enable-tap-tests; make -j; make -C src/test/recovery/ 
check PROVE_TESTS=t/027_invalid_checkpoint_after_promote.pl

The attached patch also fixes this problem by ensuring that remove old wals 
only after control file is updated.

Attachment: 0001-Fix-primary-crash-continually-with-invalid-checkpoint-after-promote.patch
Description: Binary data

Reply via email to