Hello hackers! It seems bgwriter running on the replicas is broken in the commit 8d68ee6 and as a result bgwriter never updates minRecoveryPoint in the pg_control.Please see a detailed explanation below.
2018-08-29 22:54 GMT+02:00 Michael Paquier <mich...@paquier.xyz>: > This is not a solution in my opinion, as you could invalidate activities > of backends connected to the database when the incorrect consistent > point allows connections to come in too early. That true, but I still think it is better than aborting startup process... > What happens if you replay with hot_standby = on up to the latest point, > without any concurrent connections, then issue a checkpoint on the > standby once you got to a point newer than the complain, and finally > restart the standby with the bgworker? > > Another idea I have would be to make the standby promote, issue a > checkpoint on it, and then use pg_rewind as a trick to update the > control file to a point newer than the inconsistency. As PG < 9.6.10 > could make the minimum recovery point go backwards, applying the upgrade > after the consistent point got to an incorrect state would trigger the > failure. Well, all these actions probably help to relife symptoms and replay WAL up to the point when it becomes really consistent. I was really long trying to figure out how it could happen that some of the pages were written on disk, but pg_control wasn't updated, And I think I managed to put all pieces of the puzzle into a nice picture: static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) { /* Quick check using our local copy of the variable */ if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint)) return; /* * An invalid minRecoveryPoint means that we need to recover all the WAL, * i.e., we're doing crash recovery. We never modify the control file's * value in that case, so we can short-circuit future checks here too. The * local values of minRecoveryPoint and minRecoveryPointTLI should not be * updated until crash recovery finishes. */ if (XLogRecPtrIsInvalid(minRecoveryPoint)) { updateMinRecoveryPoint = false; return; } This code was changed in the commit 8d68ee6 and it broke bgwriter. Now bgwriter never updates pg_control when flushes dirty pages to disk. How it happens: When bgwriter starts, minRecoveryPoint is not initialized and if I attach with gdb, it shows that value of minRecoveryPoint = 0, therefore it is Invalid. As a result, updateMinRecoveryPoint is set to false and on the next call of UpdateMinRecoveryPoint from bgwriter it returns from the function after the very first if. Bgwriter itself never changes updateMinRecoveryPoint to true and boom, we can get a lot of pages written to disk, but minRecoveryPoint in the pg_control won't be updated! If the replica happened to crash in such conditions it reaches a consistency much earlier than it should! Regards, -- Alexander Kukushkin