Greetings,

* Thomas Munro (thomas.mu...@gmail.com) wrote:
> 2.  You made a file system-level copy of a cluster that you shut down
> cleanly first, using cp, tar, scp, rsync, xmodem etc.  Now you start
> up the copy.  Its checkpoint is a forgery.  (Maybe our manual should
> mention this problem under "25.2. File System Level Backup" where it
> teaches you to rsync your cluster.)

Yes, it'd be good to get some updates to the backup documentation around
this which stresses in all cases that your backup utility should make
sure to fsync everything it restores.

> These are both scatter gun approaches that can sometimes do a lot of
> useless work, and I'd like to find a precise version that uses
> information we already have about what might be dirty according to the
> meaning of a checkpoint and a transaction log.  The attached patch
> does that as follows:
> 
> 1.  Sync the WAL using fsync(), to enforce the log-before-data rule.
> That's moved into the existing loop that scans the WAL files looking
> for temporary files to unlink.  (I suppose it should perhaps do the
> "presync" trick too.  Not done yet.)
> 
> 2.  While replaying the WAL, if we ever decide to skip a page because
> of its LSN, remember to fsync() the file in the next checkpoint anyway
> (because the data might be dirty in the kernel).   This way we sync
> all files that changed since the last checkpoint (even if we don't
> have to apply the change again).  (A more paranoid mode would mark the
> page dirty instead, so that we'll not only fsync() it, but we'll also
> write it out again.  This would defend against kernels that have
> writeback failure modes that include keeping changes but dropping
> their own dirty flag.  Not done here.)

Presuming that we do add to the documentation the language to document
what's assumed (and already done by modern backup tools) that they're
fsync'ing everything they're restoring, do we/can we have an option
which those tools could set that explicitly tells PG "everything in the
cluster has been fsync'd already, you don't need to do anything extra"?
Perhaps also/seperately one for WAL that's restored with restore command
if we think that's necessary?

Otherwise, just in general, agree with doing this to address the risks
discussed around regular crash recovery.  We have some pretty clear "if
the DB was doing recovery and was interrupted, you need to restore from
backup" messages today in xlog.c, and this patch didn't seem to change
that?  Am I missing something or isn't the idea here that these changes
would make it so you aren't going to end up with corruption in those
cases?  Specifically looking at-

xlog.c:6509-
        case DB_IN_CRASH_RECOVERY:
            ereport(LOG,
                    (errmsg("database system was interrupted while in recovery 
at %s",
                            str_time(ControlFile->time)),
                     errhint("This probably means that some data is corrupted 
and"
                             " you will have to use the last backup for 
recovery.")));
            break;

        case DB_IN_ARCHIVE_RECOVERY:
            ereport(LOG,
                    (errmsg("database system was interrupted while in recovery 
at log time %s",
                            str_time(ControlFile->checkPointCopy.time)),
                     errhint("If this has occurred more than once some data 
might be corrupted"
                             " and you might need to choose an earlier recovery 
target.")));
            break;

Thanks!

Stephen

Attachment: signature.asc
Description: PGP signature

Reply via email to