On Sun, Mar 4, 2018 at 3:49 PM, Stephen Frost <sfr...@snowman.net> wrote:
> Greetings Magnus, all,
>
> * Magnus Hagander (mag...@hagander.net) wrote:
> > I think it would also be a good idea to have this a three-mode setting,
> > with "no check", "check and warning", "check and error". Where "check and
> > error" should be the default, but you could turn off that in "save
> whatever
> > is left mode". But I think it's better if pg_basebackup simply fails on a
> > checksum error, because that will make it glaringly obvious that there
> is a
> > problem -- which is the main point of checksums in the first place. And
> > then an option to turn it off completely in cases where performance is
> the
> > thing.
>
> When we implemented page-level checksum checking in pgBackRest, David
> and I had a good long discussion about exactly this question of "warn"
> vs. "error" and came to a different conclusion- you want a backup to
> always back up as much as it can even in the face of corruption.  If the
> user has set up their backups in such a way that they don't see the
> warnings being thrown, it's a good bet they won't see failed backups
> happening either, in which case they might go from having "mostly" good
> backups to not having any.  Note that I *do* think a checksum failure
> should result in an non-zero exit-code result from pg_basebackup,
> indicating that there was something which went wrong.
>

I would argue that the likelihood of them seeing an error vs a warning is
orders of magnitude higher.

That said, if we want to exit pg_basebacukp with an exit code but still
complete the backup, that would also be a workable way I guess. But I
strongly feel that we should make pg_basebackup scream at the user and
actually exit with an error -- it's the exit with error that will cause
their cronjobs to fail, and thus somebody notice it.




> One difference is that with pgBackRest, we manage the backups and a
> backup with page-level checksums isn't considered "valid", so we won't
> expire old backups if a new backup has a checksum failure, but I'm not
> sure that's really enough to change my mind on if pg_basebackup should
> outright fail on a checksum error or if it should throw big warnings but
> still try to perform the backup.  If the admin sets things up in a way
> that a warning and error-exit code from pg_basebackup is ignored and
> they still expire out their old backups, then even having an actual
> error result wouldn't change that.
>

There is another important difference as well. In pgBackRest somebody will
have to explicitly enable checksum verification -- which already there
means that they are much more likely to actually check the logs from it.


As an admin, the first thing I would want in a checksum failure scenario
> is a backup of everything, even the blocks which failed (and then a
> report of which blocks failed...).  I'd rather we think about that
> use-case than the use-case where the admin sets up backups in such a way
> that they don't see warnings being thrown from the backup.
>

I agree. But I absolutely don't want my backup to be listed as successful,
because then I might expire old ones.

So sure, if we go with WARNING + exit with an errorcode, that is perhaps
the best combination of the two.

That said, it probably still makes sense to implement all modes. Or at
least to implement a "don't bother verifying the checksums" mode. This
mainly controls what the default would be.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ <http://www.hagander.net/>
 Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Reply via email to