On 08/07/2018 05:42 PM, Stephen Frost wrote:
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
On 08/06/2018 09:32 PM, Stephen Frost wrote:
* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
On 08/06/2018 06:11 PM, Stephen Frost wrote:
WAL checksums are per WAL record, not across the whole file...  And,
yes, this seems like something we could probably write code to ensure
we're doing correctly, possibly even without changing the existing WAL
format or maybe we would have to, but either way, seems like additional
code that would need to be written and some serious thought put into it
to make sure it's correct.  I'm all for that, just to be clear, but what
I don't think we can do is move forward with a change to just prefer
pg_wal over restore command.

While the checksums are per-record, the record also contains xl_prev, so
it's effectively a chain of checksums covering the whole file. The other
thing you can do is look at the first record of the next segment and use the
xl_prev to detect if the previous segment was not partial.

But I do agree doing this properly may require writing some new code and
checking that those cases are detected correctly.

Right, might be possible but isn't something we necessairly guarantee
will always work correctly today in this situation where we've got old
WAL files that were pulled over a period of time (imagine that we copy
WAL file ABC before PG was done with it, but we don't get around to
copying WAL file DEF until much later after ABC has been completed and
DEF is only partial, or other such scenarios).  Remember, the scenario
we're talking about here is where someone did a pg_start_backup, took
their time copying all the files in PGDATA, including pg_wal, and then
at some later point ran pg_stop_backup.  We have no idea when those WAL
files were copied and they could have been anywhere between the
pg_start_backup point and the pg_stop_backup point.

(Note: There was a discussion about replacing xl_prev with LSN of the
current record, and IMHO that would work just as well, although it might be
a bit more expensive for some operations.)

I haven't thought very much about this so don't have much of an opinion
on it one way or the other at this point.

CRC's are per WAL record, and possibly some WAL records might not be ok
to replay, or at least we need to make sure that we replay the right set
of WAL in the right order even when there are partial WAL files being
given to PG (that aren't named that way...).  The more I think about
this, I think we really need to avoid partial WAL files entirely- what
are we going to do when we get to the end of one?  We'd need to request
the full one from the restore command anyway, so seems like we should
just go ahead and get it from the archive, the question is if there's an
easy/cheap way to detect partial WAL files in pg_wal.

As explained above, I don't think this is actually a problem. The checksums
do cover the whole file thanks to chaining, and there are ways to detect
partial segments. IMHO it's fine if we replay a segment and then find out it
was partial and that we need to fetch it from archive anyway and re-apply it
- it should not be very common case, except when the user does something
silly.

As long as we *do* go off and try to fetch that WAL file and replay it,
and don't assume that the end of that partial WAL file means the end of
WAL replay, then I think you may be right and that it'd be fine, but it
does seem a bit risky to me.

As I mention above, seems like what we should really be doing is having
a way to know when a WAL file is full but still in pg_wal for some
reason (hasn't been replayed yet due to intential lag on the replica, or
unintentional lag on the replica, etc), and perhaps we even only do that
on replicas or have tools like pg_basebackup and pgbackrest do that when
they're doing a restore, so that it's clear to PG that all these files
are full WAL and they can replay them completely.

IMHO we can deduce if from first xl_prev of the next segment (assuming we
have the next segment available locally, which we likely have except for the
last one).

See above for why I don't think it's actually that simple..

I was actually just thinking about having pgbackrest do exactly that. :)
We already have the full sha256 checksum of every WAL file in the
pgbackrest repository, it seems like it wouldn't be too hard to
calculate the checksum of the files in pg_wal and ask the repo what the
checksums are for those WAL files and then use the files in pg_wal if
they checksums match.  I can already imagine the argument coming back
though- that'd require additional testing to ensure it's done correctly
and I'm really not sure that it'd end up being all that much better
except in some very special circumstances where there's a low bandwidth
connection to the repo and often a lot of WAL left over in pg_wal.

I don't think it can be done in an external tool entirely, at least not
using restore_command alone. We remove the local segment before even
invoking restore_command, so it's too late to check checksums etc.

Wasn't this entire discussion started because we were calling
restore_command first instead of reading from pg_wal..?  Or do we
actively go wipe out pg_wal before doing that (I didn't think we did,
but I could certainly be wrong on that point).


That's how I read this part of RestoreArchivedFile:

https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogarchive.c#L110

The very first thing it does is checking if the local file exists, and if it does then calling unlink().

We'd need invoking some other command before restore_command, which would do
this comparison and decide whether to use local or remote WAL.

Ugh, that sounds like a lot of additional complication that I'm not sure
would be all that useful or sensible for this particular case, if that's
what it would require.  I'd rather we try to figure out some way to get
rid of restore_command entirely instead of bolting on more things around
it.


The simpler the better of course.

All I'm saying is that (assuming my understanding of RestoreArchivedFile is correct) we can't just do that in the current restore_command. We do need a way to ask the archive for some metadata/checksums, and restore_command is too late.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply via email to