On Sun, Mar 4, 2018 at 3:49 PM, Stephen Frost <sfr...@snowman.net> wrote:
> Greetings Magnus, all, > > * Magnus Hagander (mag...@hagander.net) wrote: > > I think it would also be a good idea to have this a three-mode setting, > > with "no check", "check and warning", "check and error". Where "check and > > error" should be the default, but you could turn off that in "save > whatever > > is left mode". But I think it's better if pg_basebackup simply fails on a > > checksum error, because that will make it glaringly obvious that there > is a > > problem -- which is the main point of checksums in the first place. And > > then an option to turn it off completely in cases where performance is > the > > thing. > > When we implemented page-level checksum checking in pgBackRest, David > and I had a good long discussion about exactly this question of "warn" > vs. "error" and came to a different conclusion- you want a backup to > always back up as much as it can even in the face of corruption. If the > user has set up their backups in such a way that they don't see the > warnings being thrown, it's a good bet they won't see failed backups > happening either, in which case they might go from having "mostly" good > backups to not having any. Note that I *do* think a checksum failure > should result in an non-zero exit-code result from pg_basebackup, > indicating that there was something which went wrong. > I would argue that the likelihood of them seeing an error vs a warning is orders of magnitude higher. That said, if we want to exit pg_basebacukp with an exit code but still complete the backup, that would also be a workable way I guess. But I strongly feel that we should make pg_basebackup scream at the user and actually exit with an error -- it's the exit with error that will cause their cronjobs to fail, and thus somebody notice it. > One difference is that with pgBackRest, we manage the backups and a > backup with page-level checksums isn't considered "valid", so we won't > expire old backups if a new backup has a checksum failure, but I'm not > sure that's really enough to change my mind on if pg_basebackup should > outright fail on a checksum error or if it should throw big warnings but > still try to perform the backup. If the admin sets things up in a way > that a warning and error-exit code from pg_basebackup is ignored and > they still expire out their old backups, then even having an actual > error result wouldn't change that. > There is another important difference as well. In pgBackRest somebody will have to explicitly enable checksum verification -- which already there means that they are much more likely to actually check the logs from it. As an admin, the first thing I would want in a checksum failure scenario > is a backup of everything, even the blocks which failed (and then a > report of which blocks failed...). I'd rather we think about that > use-case than the use-case where the admin sets up backups in such a way > that they don't see warnings being thrown from the backup. > I agree. But I absolutely don't want my backup to be listed as successful, because then I might expire old ones. So sure, if we go with WARNING + exit with an errorcode, that is perhaps the best combination of the two. That said, it probably still makes sense to implement all modes. Or at least to implement a "don't bother verifying the checksums" mode. This mainly controls what the default would be. -- Magnus Hagander Me: https://www.hagander.net/ <http://www.hagander.net/> Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>