Hi Magnus,
thanks a lot for looking at my patch!
Am Donnerstag, den 22.03.2018, 15:07 +0100 schrieb Magnus Hagander:
> On Sat, Mar 17, 2018 at 10:34 PM, Michael Banck <michael.ba...@credativ.de>
> wrote:
> > On Fri, Mar 09, 2018 at 10:35:33PM +0100, Michael Banck wrote:
> > > Possibly open questions:
> > >
> > > 1. I have not so far changed the replication protocol to make verifying
> > > checksums optional. I can go about that next if the consensus is that we
> > > need such an option (and cannot just check it everytime)?
> >
> > I think most people (including those I had off-list discussions about
> > this with) were of the opinion that such an option should be there, so I
> > added an additional option NOVERIFY_CHECKSUMS to the BASE_BACKUP
> > replication command and also an option -k / --no-verify-checksums to
> > pg_basebackup to trigger this.
> >
> > Updated patch attached
> >
>
> Notes:
>
> + if (checksum_failure == true)
>
> Really, just if(checksum_failure)
>
> + errmsg("checksum mismatch during basebackup")));
>
> Should be "base backup" in messages.
I've changed both.
> +static const char *skip[] = {
>
> I think that needs a much better name than just "skip". Skip for what?
> In particular since we are just skipping it for checksums, and not for
> the actual basebackup, that name is actively misinforming.
I have copied that verbatim from the online checksum patch, but of
course this is in src/backend/replication and not src/bin so warrants
more scrutiny. If you plan to commit both for v11, it might make sense
to have that separated out to a more central place?
But I guess what we mean is a test for "is a heap file". Do you have a
good suggestion where it should end up so that pg_verify_checksums can
use it as well?
In the meantime, I've changed the skip[] array to no_heap_files[] and
the skipfile() function to is_heap_file(), also reversing the logic. If
it helps pg_verify_checksums, we could make is_not_a_heap_file()
instead.
> + filename = basename(pstrdup(readfilename));
> + if (!noverify_checksums && DataChecksumsEnabled() &&
> + !skipfile(filename) &&
> + (strncmp(readfilename, "./global/", 9) == 0 ||
> + strncmp(readfilename, "./base/", 7) == 0 ||
> + strncmp(readfilename, "/", 1) == 0))
> + verify_checksum = true;
>
> I would include the checks for global, base etc into the skipfile()
> function as well (also renamed).
Check. I had to change the way (the previous) skipfile() works a bit,
because it was expecting a filename as argument, while we check
pathnames in the above.
> + * Only check pages which have not been modified since the
> + * start of the base backup.
>
> I think this needs a description of why, as well (without having read
> this thread, this is a pretty subtle case).
I tried to expand on this some more.
> +system_or_bail 'dd', 'conv=notrunc', 'oflag=seek_bytes', 'seek=4000',
> 'bs=9', 'count=1', 'if=/dev/zero', "of=$pgdata/$pg_class";
>
> This part of the test will surely fail on Windows, not having a
> /dev/zero. Can we easily implement this part natively in perl perhaps?
Right, this was one of the open questions. I now came up with a perl 4-
liner that seems to do the trick, but I can't test it on Windows.
> Most of that stuff is trivial and can be cleaned up at commit time. Do
> you want to send an updated patch with a few of those fixes, or should
> I clean it?
I've attached a new patch, but I have not addressed the question whether
skipfile()/is_heap_file() should be moved somewhere else yet.
I found one more cosmetic issue: if there is an external tablespace, and
pg_basebackup encounters corruption, you would get the message
"pg_basebackup: changes to tablespace directories will not be undone"
from cleanup_directories_atexit(), which I now also suppress in case of
checksum failures.
> The test thing is a stopper until we figure that one out though. And
> while at it -- it seems we don't have any tests for the checksum
> feature in general. It would probably make sense to consider that at
> the same time as figuring out the right way to do this one.
I don't want to deflect work, but it seems to me the online checksums
patch would be in a better position to generally test checksums while
it's at it. Or did you mean something related to pg_basebackup?
Michael
--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.ba...@credativ.de
credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 366b684293..ab9f2fb93a 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2382,7 +2382,7 @@ The commands accepted in replication mode are:
</varlistentry>
<varlistentry>
- <term><literal>BASE_BACKUP</literal> [ <literal>LABEL</literal> <replaceable>'label'</replaceable> ] [ <literal>PROGRESS</literal> ] [ <literal>FAST</literal> ] [ <literal>WAL</literal> ] [ <literal>NOWAIT</literal> ] [ <literal>MAX_RATE</literal> <replaceable>rate</replaceable> ] [ <literal>TABLESPACE_MAP</literal> ]
+ <term><literal>BASE_BACKUP</literal> [ <literal>LABEL</literal> <replaceable>'label'</replaceable> ] [ <literal>PROGRESS</literal> ] [ <literal>FAST</literal> ] [ <literal>WAL</literal> ] [ <literal>NOWAIT</literal> ] [ <literal>MAX_RATE</literal> <replaceable>rate</replaceable> ] [ <literal>TABLESPACE_MAP</literal> ] [ <literal>NOVERIFY_CHECKSUMS</literal> ]
<indexterm><primary>BASE_BACKUP</primary></indexterm>
</term>
<listitem>
@@ -2481,6 +2481,17 @@ The commands accepted in replication mode are:
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><literal>NOVERIFY_CHECKSUMS</literal></term>
+ <listitem>
+ <para>
+ By default, checksums are verified during a base backup if they are
+ enabled. Specifying <literal>NOVERIFY_CHECKSUMS</literal> disables
+ this verification.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
<para>
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index e8921b1bb4..95045669c9 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -506,6 +506,22 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-k</option></term>
+ <term><option>--no-verify-checksums</option></term>
+ <listitem>
+ <para>
+ Disables verification of checksums, if they are enabled on the server
+ the base backup is taken from.
+ </para>
+ <para>
+ By default, checksums are verified and checksum failures will result in
+ a non-zero exit status. However, the base backup will not be removed in
+ this case, as if the <literal>--no-clean</literal> option was used.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-v</option></term>
<term><option>--verbose</option></term>
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 185f32a5f9..8ea6682a45 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -16,6 +16,7 @@
#include <unistd.h>
#include <time.h>
+#include "access/xlog.h"
#include "access/xlog_internal.h" /* for pg_start/stop_backup */
#include "catalog/catalog.h"
#include "catalog/pg_type.h"
@@ -30,6 +31,8 @@
#include "replication/basebackup.h"
#include "replication/walsender.h"
#include "replication/walsender_private.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
#include "storage/dsm_impl.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -97,6 +100,15 @@ static TimeOffset elapsed_min_unit;
/* The last check of the transfer rate. */
static TimestampTz throttled_last;
+/* The starting XLOG position of the base backup. */
+static XLogRecPtr startptr;
+
+/* Whether a checksum failure occured. */
+static bool checksum_failure;
+
+/* Do not verify checksums. */
+static bool noverify_checksums = false;
+
/*
* The contents of these directories are removed or recreated during server
* start so they are not included in backups. The directories themselves are
@@ -189,7 +201,6 @@ base_backup_cleanup(int code, Datum arg)
static void
perform_base_backup(basebackup_options *opt)
{
- XLogRecPtr startptr;
TimeLineID starttli;
XLogRecPtr endptr;
TimeLineID endtli;
@@ -205,6 +216,8 @@ perform_base_backup(basebackup_options *opt)
labelfile = makeStringInfo();
tblspc_map_file = makeStringInfo();
+ checksum_failure = false;
+
startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, &starttli,
labelfile, &tablespaces,
tblspc_map_file,
@@ -563,6 +576,12 @@ perform_base_backup(basebackup_options *opt)
pq_putemptymessage('c');
}
SendXlogRecPtrResult(endptr, endtli);
+
+ if (checksum_failure)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("checksum mismatch during base backup")));
+
}
/*
@@ -592,6 +611,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
bool o_wal = false;
bool o_maxrate = false;
bool o_tablespace_map = false;
+ bool o_noverify_checksums = false;
MemSet(opt, 0, sizeof(*opt));
foreach(lopt, options)
@@ -671,6 +691,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
opt->sendtblspcmapfile = true;
o_tablespace_map = true;
}
+ else if (strcmp(defel->defname, "noverify_checksums") == 0)
+ {
+ if (o_noverify_checksums)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("duplicate option \"%s\"", defel->defname)));
+ noverify_checksums = true;
+ o_noverify_checksums = true;
+ }
else
elog(ERROR, "option \"%s\" not recognized",
defel->defname);
@@ -1179,13 +1208,47 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
return size;
}
+static const char *no_heap_files[] = {
+ "pg_control",
+ "pg_filenode.map",
+ "pg_internal.init",
+ "PG_VERSION",
+ NULL,
+};
+
+static bool
+is_heap_file(const char *pn)
+{
+ const char **f;
+ const char *fn = basename(pstrdup(pn));
+
+ /* Skip current and parent directory */
+ if (strcmp(pn, ".") == 0 ||
+ strcmp(pn, "..") == 0)
+ return false;
+
+ /* Check that the file is in a tablespace */
+ if (strncmp(pn, "./global/", 9) == 0 ||
+ strncmp(pn, "./base/", 7) == 0 ||
+ strncmp(pn, "/", 1) == 0)
+ {
+ /* Compare file against no_heap_files skiplist */
+ for (f = no_heap_files; *f; f++)
+ if (strcmp(*f, fn) == 0)
+ return false;
+
+ return true;
+ }
+ else
+ return false;
+}
+
/*****
* Functions for handling tar file format
*
* Copied from pg_dump, but modified to work with libpq for sending
*/
-
/*
* Given the member, write the TAR header & send the file.
*
@@ -1199,10 +1262,19 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
bool missing_ok)
{
FILE *fp;
+ BlockNumber blkno = 0;
char buf[TAR_SEND_SIZE];
+ uint16 checksum;
size_t cnt;
+ char *filename;
+ int i;
pgoff_t len = 0;
+ char page[BLCKSZ];
size_t pad;
+ PageHeader phdr;
+ int segmentno = 0;
+ char *segmentpath;
+ bool verify_checksum = false;
fp = AllocateFile(readfilename, "rb");
if (fp == NULL)
@@ -1214,10 +1286,86 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
errmsg("could not open file \"%s\": %m", readfilename)));
}
+ /*
+ * Verify checksums only when checksums are enabled, and the file to
+ * send is either in one of the default tablespaces (`./global' or
+ * `./base'), or in an external tablespace with an absolute pathname
+ * (starting with `/') and is not in the list of non-heap files to be
+ * skipped.
+ */
+ if (!noverify_checksums && DataChecksumsEnabled() &&
+ is_heap_file(readfilename))
+ verify_checksum = true;
+
_tarWriteHeader(tarfilename, NULL, statbuf, false);
+ /*
+ * Cut off at the segment boundary (".") to get the segment number in order
+ * to mix it into the checksum.
+ */
+ filename = basename(pstrdup(readfilename));
+ segmentpath = strstr(filename, ".");
+ if (verify_checksum && segmentpath != NULL)
+ {
+ *segmentpath++ = '\0';
+ segmentno = atoi(segmentpath);
+ if (segmentno == 0)
+ ereport(ERROR,
+ (errmsg("invalid segment number %d in filename %s\n",
+ segmentno, filename)));
+ }
+
while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp)) > 0)
{
+ if (verify_checksum)
+ {
+ /*
+ * The checksums are verified at block level, so we iterate over
+ * the buffer in chunks of BLCKSZ, after making sure that
+ * TAR_SEND_SIZE/buf is divisible by BLCKSZ and we read
+ * a multiple of BLCKSZ bytes.
+ */
+ Assert(TAR_SEND_SIZE % BLCKSZ == 0);
+ if (cnt % BLCKSZ != 0)
+ {
+ ereport(WARNING,
+ (errmsg("cannot verify checksum in file \"%s\", block "
+ "%d: read buffer size %d and page size %d "
+ "differ",
+ readfilename, blkno, (int)cnt, BLCKSZ)));
+ checksum_failure = true;
+ verify_checksum = false;
+ continue;
+ }
+ for (i = 0; i < cnt / BLCKSZ; i++)
+ {
+ memcpy(page, (buf + BLCKSZ * i), BLCKSZ);
+ /*
+ * Only check pages which have not been modified since the
+ * start of the base backup. Otherwise, they might have been
+ * written only halfway and the checksum would not be valid.
+ * However, replaying WAL would reinstate the correct page in
+ * this case.
+ */
+ if (PageGetLSN(page) < startptr)
+ {
+ checksum = pg_checksum_page((char *) page, blkno + segmentno*RELSEG_SIZE);
+ phdr = (PageHeader) page;
+ if (phdr->pd_checksum != checksum)
+ {
+ ereport(WARNING,
+ (errmsg("checksum verification failed in file "
+ "\"%s\", block %d: calculated %X but "
+ "expected %X",
+ readfilename, blkno, checksum,
+ phdr->pd_checksum)));
+ checksum_failure = true;
+ }
+ }
+ blkno++;
+ }
+ }
+
/* Send the chunk as a CopyData message */
if (pq_putmessage('d', buf, cnt))
ereport(ERROR,
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index beb2c2877b..843a878ff3 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -77,6 +77,7 @@ static SQLCmd *make_sqlcmd(void);
%token K_MAX_RATE
%token K_WAL
%token K_TABLESPACE_MAP
+%token K_NOVERIFY_CHECKSUMS
%token K_TIMELINE
%token K_PHYSICAL
%token K_LOGICAL
@@ -154,7 +155,7 @@ var_name: IDENT { $$ = $1; }
/*
* BASE_BACKUP [LABEL '<label>'] [PROGRESS] [FAST] [WAL] [NOWAIT]
- * [MAX_RATE %d] [TABLESPACE_MAP]
+ * [MAX_RATE %d] [TABLESPACE_MAP] [NOVERIFY_CHECKSUMS]
*/
base_backup:
K_BASE_BACKUP base_backup_opt_list
@@ -208,6 +209,11 @@ base_backup_opt:
$$ = makeDefElem("tablespace_map",
(Node *)makeInteger(true), -1);
}
+ | K_NOVERIFY_CHECKSUMS
+ {
+ $$ = makeDefElem("noverify_checksums",
+ (Node *)makeInteger(true), -1);
+ }
;
create_replication_slot:
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index fb48241e48..7bb501f594 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -92,6 +92,7 @@ PROGRESS { return K_PROGRESS; }
MAX_RATE { return K_MAX_RATE; }
WAL { return K_WAL; }
TABLESPACE_MAP { return K_TABLESPACE_MAP; }
+NOVERIFY_CHECKSUMS { return K_NOVERIFY_CHECKSUMS; }
TIMELINE { return K_TIMELINE; }
START_REPLICATION { return K_START_REPLICATION; }
CREATE_REPLICATION_SLOT { return K_CREATE_REPLICATION_SLOT; }
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1b32592063..39013ff7e0 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -39,6 +39,7 @@
#include "replication/basebackup.h"
#include "streamutil.h"
+#define ERRCODE_DATA_CORRUPTED "XX001"
typedef struct TablespaceListCell
{
@@ -81,6 +82,7 @@ static char *xlog_dir = NULL;
static char format = 'p'; /* p(lain)/t(ar) */
static char *label = "pg_basebackup base backup";
static bool noclean = false;
+static bool checksum_failure = false;
static bool showprogress = false;
static int verbose = 0;
static int compresslevel = 0;
@@ -95,6 +97,7 @@ static char *replication_slot = NULL;
static bool temp_replication_slot = true;
static bool create_slot = false;
static bool no_slot = false;
+static bool verify_checksums = true;
static bool success = false;
static bool made_new_pgdata = false;
@@ -155,7 +158,7 @@ cleanup_directories_atexit(void)
if (success || in_log_streamer)
return;
- if (!noclean)
+ if (!noclean && !checksum_failure)
{
if (made_new_pgdata)
{
@@ -195,7 +198,7 @@ cleanup_directories_atexit(void)
}
else
{
- if (made_new_pgdata || found_existing_pgdata)
+ if ((made_new_pgdata || found_existing_pgdata) && !checksum_failure)
fprintf(stderr,
_("%s: data directory \"%s\" not removed at user's request\n"),
progname, basedir);
@@ -206,7 +209,7 @@ cleanup_directories_atexit(void)
progname, xlog_dir);
}
- if (made_tablespace_dirs || found_tablespace_dirs)
+ if ((made_tablespace_dirs || found_tablespace_dirs) && !checksum_failure)
fprintf(stderr,
_("%s: changes to tablespace directories will not be undone\n"),
progname);
@@ -360,6 +363,8 @@ usage(void)
printf(_(" -P, --progress show progress information\n"));
printf(_(" -S, --slot=SLOTNAME replication slot to use\n"));
printf(_(" --no-slot prevent creation of temporary replication slot\n"));
+ printf(_(" -k, --no-verify-checksums\n"
+ " do not verify checksums\n"));
printf(_(" -v, --verbose output verbose messages\n"));
printf(_(" -V, --version output version information, then exit\n"));
printf(_(" -?, --help show this help, then exit\n"));
@@ -1808,14 +1813,15 @@ BaseBackup(void)
}
basebkp =
- psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s",
+ psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s %s",
escaped_label,
showprogress ? "PROGRESS" : "",
includewal == FETCH_WAL ? "WAL" : "",
fastcheckpoint ? "FAST" : "",
includewal == NO_WAL ? "" : "NOWAIT",
maxrate_clause ? maxrate_clause : "",
- format == 't' ? "TABLESPACE_MAP" : "");
+ format == 't' ? "TABLESPACE_MAP" : "",
+ verify_checksums ? "" : "NOVERIFY_CHECKSUMS");
if (PQsendQuery(conn, basebkp) == 0)
{
@@ -1970,8 +1976,17 @@ BaseBackup(void)
res = PQgetResult(conn);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
{
- fprintf(stderr, _("%s: final receive failed: %s"),
- progname, PQerrorMessage(conn));
+ const char *sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+ if (sqlstate &&
+ strcmp(sqlstate, ERRCODE_DATA_CORRUPTED) == 0)
+ {
+ fprintf(stderr, _("%s: checksum error occured\n"),
+ progname);
+ checksum_failure = true;
+ } else {
+ fprintf(stderr, _("%s: final receive failed: %s"),
+ progname, PQerrorMessage(conn));
+ }
disconnect_and_exit(1);
}
@@ -2140,6 +2155,7 @@ main(int argc, char **argv)
{"progress", no_argument, NULL, 'P'},
{"waldir", required_argument, NULL, 1},
{"no-slot", no_argument, NULL, 2},
+ {"no-verify-checksums", no_argument, NULL, 'k'},
{NULL, 0, NULL, 0}
};
int c;
@@ -2166,7 +2182,7 @@ main(int argc, char **argv)
atexit(cleanup_directories_atexit);
- while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWvP",
+ while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWkvP",
long_options, &option_index)) != -1)
{
switch (c)
@@ -2308,6 +2324,9 @@ main(int argc, char **argv)
case 'P':
showprogress = true;
break;
+ case 'k':
+ verify_checksums = false;
+ break;
default:
/*
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index cdf4f5be37..d4c47f1377 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -4,7 +4,7 @@ use Cwd;
use Config;
use PostgresNode;
use TestLib;
-use Test::More tests => 79;
+use Test::More tests => 84;
program_help_ok('pg_basebackup');
program_version_ok('pg_basebackup');
@@ -15,7 +15,7 @@ my $tempdir = TestLib::tempdir;
my $node = get_new_node('main');
# Initialize node without replication settings
-$node->init;
+$node->init(extra => [ '--data-checksums' ]);
$node->start;
my $pgdata = $node->data_dir;
@@ -317,3 +317,27 @@ like(
slurp_file("$tempdir/backupxs_sl_R/recovery.conf"),
qr/^primary_slot_name = 'slot1'\n/m,
'recovery.conf sets primary_slot_name');
+
+my $checksum = $node->safe_psql('postgres', 'SHOW data_checksums;');
+is($checksum, 'on', 'checksums are enabled');
+
+# induce corruption
+my $pg_class = $node->safe_psql('postgres',
+ q{SELECT pg_relation_filepath('pg_class')}
+);
+open my $file, '+<', "$pgdata/$pg_class";
+seek($file, 4000, 0);
+syswrite($file, '\0\0\0\0\0\0\0\0\0');
+close $file;
+
+$node->command_checks_all([ 'pg_basebackup', '-D', "$tempdir/backup_corrupt"],
+ 1,
+ [qr{^$}],
+ [qr/^WARNING.*checksum.verification.failed/s],
+ 'pg_basebackup reports checksum mismatch'
+);
+
+# do not verify checksums, should return ok
+$node->command_ok(
+ [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt2", '-k' ],
+ 'pg_basebackup with -k does not report checksum mismatch');