Re: [bug fix] Cascaded standby cannot start after a clean shutdown

Michael Paquier Fri, 23 Feb 2018 06:03:22 -0800

On Fri, Feb 23, 2018 at 11:26:31AM +0900, Michael Paquier wrote:
> An other, evil, idea that I have on top of all those things is to
> directly hexedit the WAL segment of the standby just at the limit where
> it would receive a record from the primary and insert in it garbage
> data which would make the validation tests to blow up in xlogreader.c
> for the record allocation.


OK, I have been playing with such methods and finally I have been able
to check the theory of Tsunakawa-san here.  At first I played with
hexedit to pollute the data after the last record received by a standby,
record which is not at the end of a WAL page.  This can easily happen by
stopping the primary which will most likely only send data up to the
middle of a page.  Using that I have easily faced errors like that:
LOG: invalid resource manager ID 255 at 0/75353B8
And I have been able to see that garbage could be read as the length of
a record before the validation of the header is done.  With the patch
attached you can easily see a collection of garbage:
LOG: length of 77913214 fetched for record 0/30009D8
And this happens if a standby is reading a page with the last record in
the middle of it.

At this state, the XLOG reader is dealing with random data, so this
would most likely fail in ValidXLogRecordHeader(), which is what
happened with the invalid rmgr for example.  However, if one is really
unlucky, and the probability of facing that are really low, the random
data being read *can* pass the validation checks of the header, which
would cause a set of problems:
- In the backend, this means a failure on palloc_extended if the garbage
read is higher than 1GB.  This causes a standby to stop immediately
while it should normally continue to poll the primary by spawning a new
WAL receiver and begin streaming from the beginning of the segment where
it needs its data (assuming that there is no archive).
- In the frontend, this can cause problems for pg_waldump and
pg_rewind.  For pg_waldump, this is not actually a big deal because a
failure means the end of the data that can be decoded.  However the
problem is more serious for pg_rewind.  At the initial phase of the
tool, pg_rewind scans the WAL segments of the target server to look for
the modified blocks since the last checkpoint before promotion up to the
end of the stream.  If at the end of the stream it finds garbage data,
then there is a risk that it allocates a couple of GBs of data, likely
finishing by causing the process to be killed on OOM by the automatic
OOM killer, preventing the rewind to complete :(

After some thoughts, I have also come up with a more simple way to test
the stabilility of the XLOG reader:
1) Create a primary/standby cluster.
2) Stop the standby.
3) Add in the standby's pg_wal a set of junk WAL segments generated with
if=/dev/urandom of=junk_walseg bs=16MB count=1.  Note that recycled WAL
segments are simple copies of past ones when fetching an existing one,
so on a standby when a new segment writes based on a past existing
segment it writes data on top of some garbage.  So it is possible to
face this problem, this just makes it show up faster.
4) Start the standby again, and let it alive.
5) Generate on the primary enough records worth of more or less 8kB to
fill in a complete WAL page.
6) Restart the primary cleanly, sleep like 5s to let the standby the
time to stream the new partial page and return to 5).

When 5) and 6) loop, you will see the monitoring log of the attached
patch complain after a couple of restarts.

Tsunakawa-san has proposed upthread to fix the problem by zero-ing the
page read in the WAL receiver.  While I agree that zeroing the page is
the way to go, doing so in the WAL receiver does not take care of
problems with the frontends.  I don't have the time to test that yet as
it is late here, but based on my code lookups tweaking
ReadPageInternal() so as the page is zero'ed correctly should do it for
all the cases.  This way, the checks happening after a page read would
fail because of the zero'ed fields consistently instead of the garbage
accumulated.
--
Michael

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3a86f3497e..13b25e5236 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -303,6 +303,16 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
 	total_len = record->xl_tot_len;
 
+#ifndef FRONTEND
+	/*
+	 * XXX: remove from final patch , just here to check that garbage
+	 * data can be fetched from blocks read.
+	 */
+	if (total_len > 5 * XLOG_BLCKSZ)
+		elog(LOG, "length of %u fetched for record %X/%X", total_len,
+			 (uint32) (RecPtr >> 32), (uint32) RecPtr);
+#endif
+
 	/*
 	 * If the whole record header is on this page, validate it immediately.
 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the

signature.asc
Description: PGP signature

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

Reply via email to