On Sun, Oct 2, 2016 at 8:52 AM, Michael Paquier <michael.paqu...@gmail.com> wrote: >> So, if I understand correctly, then we can mark the version posted by >> you upthread [1] which includes a test along with Kyotaro's fix can be >> marked as Ready for committer. If so, then please change the status >> of patch accordingly. > > Patch moved to next CF 2016-11, still with status "ready for committer". > > IMO, this thread has reached as conclusion to use this patch to > address the problem reported: > cab7npqtv5gmkqcndofgtgqoqxz2xlz4rrw247oqojzztvy6...@mail.gmail.com
I spent some time reading this thread today and trying to understand exactly what's going on here. Here's my attempt to restate the problem: 1. If a restartpoint flushes no dirty buffers, then the redo location advances but the minimum recovery point does not; therefore, the minimum recovery point can fall behind the redo location. That's correct behavior, because if the standby is subsequently restarted, it can correctly begin at the checkpoint record associated with the restartpoint and need not replay any further. 2. However, it causes a problem when a base backup with the "include WAL" option is taken from a standby because -- on standby servers only -- the end location for a backup as returned by do_pg_stop_backup() is the minimum recovery point. If this precedes the start point for the backup -- which is the restart point -- perform_base_backup() will conclude that no WAL files are required for the backup and fail an internal sanity check. 3. Removing the sanity check wouldn't help, because it can never be right to end up with zero WAL files as a result of a base backup. At a minimum, we need the WAL file which contains the checkpoint record from which the control file specifies that redo should start. Actually, I think that checkpoint record could be spread across multiple files: it might be big. We need them all, but the current code won't ensure that we get them. The consensus solution on this thread seems to be that we should have pg_do_stop_backup() return the last-replayed XLOG location as the backup end point. If the control file has been updated with a newer redo location, then the associated checkpoint record has surely been replayed, so we'll definitely have enough WAL to replay that checkpoint record (and we don't need to replay anything else, because we're supposing that this is the case where the minimum recovery point precedes the redo location). Although this will work, it might include more WAL in the backup than is strictly necessary. If the additional WAL replayed beyond the minimum recovery point does NOT include a checkpoint, we don't need any of it; if it does, we need only the portion up to and including the last checkpoint record, and not anything beyond that. I can think of two solutions that would be "tighter": 1. When performing a restartpoint, update the minimum recovery point to just beyond the checkpoint record. I think this can't hurt anyone who is actually restarting recovery on the same machine, because that's exactly the point where they are going to start recovery. A minimum recovery point which precedes the point at which they will start recovery is no better than one which is equal to the point at which they will start recovery. 2. In perform_base_backup(), if the endptr returned by do_pg_stop_backup() precedes the end of the checkpoint record returned by do_pg_start_backup(), use the latter value instead. Unfortunately, that's not so easy: we can't just say if (endptr < starptr) startptr = endptr; because startptr is the *start* of the checkpoint record, not the end. I suspect somebody could figure out a solution to this problem, though. If we decide we don't want to aim for one of these tighter solutions and just adopt the previously-discussed patch, then at the very least it needs better comments. The minimum comment change the patch makes right now doesn't give any explanation of why the new algorithm is better than the old, and that's not cool. Without that, the next person who comes along to maintain this code won't have any easy way of knowing that we grappled with this problem and what our conclusions were. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers