Re: Standby trying "restore_command" before local WAL

Tomas Vondra Tue, 07 Aug 2018 05:26:46 -0700



On 08/06/2018 09:32 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra ([email protected]) wrote:

On 08/06/2018 06:11 PM, Stephen Frost wrote:

* Tomas Vondra ([email protected]) wrote:

On 08/06/2018 05:19 PM, Stephen Frost wrote:

* David Steele ([email protected]) wrote:

I think for the stated scenario (known good standby that has been
shutdown gracefully) it makes perfect sense to trust the contents of
pg_wal.  Call this scenario #1.

An alternate scenario (#2) is that the data directory was copied using a
basic copy tool and the pg_wal directory was not excluded from the copy.
  This means the contents of pg_wal will be in an inconsistent state.
The files that are there might be partials (not with the extension,
though) and you can easily have multiple partials.  You will almost
certainly not have everything you need to get to consistency.


Yeah. But as Simon said, we do have fairly strong protections about applying
corrupted WAL - every record is CRC-checked. So why not to fall-back to the
restore_command only if the locally available WAL is not fully consistent?


"Corrupted" doesn't necessairly only mean "the data file was munged by
the storage somehow."  In this case, corrupted could be an old and only
partial WAL file, in which case we'd possibly be missing WAL that needs
to be replayed to bring the cluster back to a valid state, no?


Why wouldn't that be detected by checksums? Also, why wouldn't this be
handled correctly by the logic I proposed, i.e. falling-back to remote WAL
segment if the file is incomplete/broken in some sense?


WAL checksums are per WAL record, not across the whole file...  And,
yes, this seems like something we could probably write code to ensure
we're doing correctly, possibly even without changing the existing WAL
format or maybe we would have to, but either way, seems like additional
code that would need to be written and some serious thought put into it
to make sure it's correct.  I'm all for that, just to be clear, but what
I don't think we can do is move forward with a change to just prefer
pg_wal over restore command.

While the checksums are per-record, the record also contains xl_prev, soit's effectively a chain of checksums covering the whole file. The otherthing you can do is look at the first record of the next segment and usethe xl_prev to detect if the previous segment was not partial.

But I do agree doing this properly may require writing some new code andchecking that those cases are detected correctly.

(Note: There was a discussion about replacing xl_prev with LSN of thecurrent record, and IMHO that would work just as well, although it mightbe a bit more expensive for some operations.)

But there's another good scenario (#3): where the pg_wal directory was
preloaded with all the WAL required to make the cluster consistent or
all the WAL that was available at restore time.  In this case, it would
be make sense to prefer the contents of pg_wal and only switch to
restore_command after that has been exhausted.

So, the choice of whether to prefer locally-stored or
restore_command-fetched WAL is context-dependent, in my mind.


Agreed.


Maybe, not sure.


The argument that David makes above in scenario #2 certainly looks
entirely likely to me and I don't think we've got any real protections
against that.  The current common use-cases happen to work around the
risk because tools like pg_basebackup ignore the existing pg_wal
directory when doing the backup and instead populate it with exactly the
correct WAL that's needed, and in cases where a restore command is
specified will always pull back only valid WAL, but I don't think we can
decide that this scenario (#2 from above):


I'm probably missing something, but why couldn't we detect that using CRC.
Or make sure we can detect that, e.g. by adding some additional info into
each WAL segment?


CRC's are per WAL record, and possibly some WAL records might not be ok
to replay, or at least we need to make sure that we replay the right set
of WAL in the right order even when there are partial WAL files being
given to PG (that aren't named that way...).  The more I think about
this, I think we really need to avoid partial WAL files entirely- what
are we going to do when we get to the end of one?  We'd need to request
the full one from the restore command anyway, so seems like we should
just go ahead and get it from the archive, the question is if there's an
easy/cheap way to detect partial WAL files in pg_wal.

As explained above, I don't think this is actually a problem. Thechecksums do cover the whole file thanks to chaining, and there are waysto detect partial segments. IMHO it's fine if we replay a segment andthen find out it was partial and that we need to fetch it from archiveanyway and re-apply it - it should not be very common case, except whenthe user does something silly.

Ideally we could have a default that is safe in each scenario with
perhaps an override if the user knows better.  Scenario #1 would allow
WAL to be read from pg_wal by default, scenario #2 would prefer fetched
WAL, and scenario #3 could use a GUC to override the default fetch behavior.


Not sure how we'd be able to automatically realize which scenario we're
in though..?


But do we need to know it? I mean, can't we try the local WAL first, use it
if it passes the CRC checks (and possibly some other checks), and only
fallback to the remote WAL if it's identified as broken?


Maybe- but I think we need to be quite sure about that and I don't
believe that just checking the CRCs is enough.


Sure. So what else would we need to add to each WAL segment to make that
possible? It needs to be cheap enough not to cause perf regression, but e.g.
a CRC of each segment + amount of data actually stored in it should be
sufficient.


As I mention above, seems like what we should really be doing is having
a way to know when a WAL file is full but still in pg_wal for some
reason (hasn't been replayed yet due to intential lag on the replica, or
unintentional lag on the replica, etc), and perhaps we even only do that
on replicas or have tools like pg_basebackup and pgbackrest do that when
they're doing a restore, so that it's clear to PG that all these files
are full WAL and they can replay them completely.

IMHO we can deduce if from first xl_prev of the next segment (assumingwe have the next segment available locally, which we likely have exceptfor the last one).

Another idea - what if we instead allow fetching additional information of
the archived WAL, and make decision based on that? For example, before
restore_command we could request md5 checksum from the archive, compare it
to the local WAL, and if they match then use the local one. Otherwise
request the remote one. That'd still be a win.


I was actually just thinking about having pgbackrest do exactly that. :)
We already have the full sha256 checksum of every WAL file in the
pgbackrest repository, it seems like it wouldn't be too hard to
calculate the checksum of the files in pg_wal and ask the repo what the
checksums are for those WAL files and then use the files in pg_wal if
they checksums match.  I can already imagine the argument coming back
though- that'd require additional testing to ensure it's done correctly
and I'm really not sure that it'd end up being all that much better
except in some very special circumstances where there's a low bandwidth
connection to the repo and often a lot of WAL left over in pg_wal.

I don't think it can be done in an external tool entirely, at least notusing restore_command alone. We remove the local segment before eveninvoking restore_command, so it's too late to check checksums etc.

We'd need invoking some other command before restore_command, whichwould do this comparison and decide whether to use local or remote WAL.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Standby trying "restore_command" before local WAL

Reply via email to