On Fri, Feb 23, 2018 at 11:26:31AM +0900, Michael Paquier wrote: > An other, evil, idea that I have on top of all those things is to > directly hexedit the WAL segment of the standby just at the limit where > it would receive a record from the primary and insert in it garbage > data which would make the validation tests to blow up in xlogreader.c > for the record allocation.
OK, I have been playing with such methods and finally I have been able to check the theory of Tsunakawa-san here. At first I played with hexedit to pollute the data after the last record received by a standby, record which is not at the end of a WAL page. This can easily happen by stopping the primary which will most likely only send data up to the middle of a page. Using that I have easily faced errors like that: LOG: invalid resource manager ID 255 at 0/75353B8 And I have been able to see that garbage could be read as the length of a record before the validation of the header is done. With the patch attached you can easily see a collection of garbage: LOG: length of 77913214 fetched for record 0/30009D8 And this happens if a standby is reading a page with the last record in the middle of it. At this state, the XLOG reader is dealing with random data, so this would most likely fail in ValidXLogRecordHeader(), which is what happened with the invalid rmgr for example. However, if one is really unlucky, and the probability of facing that are really low, the random data being read *can* pass the validation checks of the header, which would cause a set of problems: - In the backend, this means a failure on palloc_extended if the garbage read is higher than 1GB. This causes a standby to stop immediately while it should normally continue to poll the primary by spawning a new WAL receiver and begin streaming from the beginning of the segment where it needs its data (assuming that there is no archive). - In the frontend, this can cause problems for pg_waldump and pg_rewind. For pg_waldump, this is not actually a big deal because a failure means the end of the data that can be decoded. However the problem is more serious for pg_rewind. At the initial phase of the tool, pg_rewind scans the WAL segments of the target server to look for the modified blocks since the last checkpoint before promotion up to the end of the stream. If at the end of the stream it finds garbage data, then there is a risk that it allocates a couple of GBs of data, likely finishing by causing the process to be killed on OOM by the automatic OOM killer, preventing the rewind to complete :( After some thoughts, I have also come up with a more simple way to test the stabilility of the XLOG reader: 1) Create a primary/standby cluster. 2) Stop the standby. 3) Add in the standby's pg_wal a set of junk WAL segments generated with if=/dev/urandom of=junk_walseg bs=16MB count=1. Note that recycled WAL segments are simple copies of past ones when fetching an existing one, so on a standby when a new segment writes based on a past existing segment it writes data on top of some garbage. So it is possible to face this problem, this just makes it show up faster. 4) Start the standby again, and let it alive. 5) Generate on the primary enough records worth of more or less 8kB to fill in a complete WAL page. 6) Restart the primary cleanly, sleep like 5s to let the standby the time to stream the new partial page and return to 5). When 5) and 6) loop, you will see the monitoring log of the attached patch complain after a couple of restarts. Tsunakawa-san has proposed upthread to fix the problem by zero-ing the page read in the WAL receiver. While I agree that zeroing the page is the way to go, doing so in the WAL receiver does not take care of problems with the frontends. I don't have the time to test that yet as it is late here, but based on my code lookups tweaking ReadPageInternal() so as the page is zero'ed correctly should do it for all the cases. This way, the checks happening after a page read would fail because of the zero'ed fields consistently instead of the garbage accumulated. -- Michael
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 3a86f3497e..13b25e5236 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -303,6 +303,16 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); total_len = record->xl_tot_len; +#ifndef FRONTEND + /* + * XXX: remove from final patch , just here to check that garbage + * data can be fetched from blocks read. + */ + if (total_len > 5 * XLOG_BLCKSZ) + elog(LOG, "length of %u fetched for record %X/%X", total_len, + (uint32) (RecPtr >> 32), (uint32) RecPtr); +#endif + /* * If the whole record header is on this page, validate it immediately. * Otherwise do just a basic sanity check on xl_tot_len, and validate the
signature.asc
Description: PGP signature