Hi,

In 
https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg%40alap3.anarazel.de
I commented that we have a number of hacky workarounds to deal with the fact
that walreceiver writes partial WAL pages into reycled segments.

The problem with that practice is that within a page we cannot reliably detect
invalid record headers. This is especially true, when the record header spans
across a page boundary - currently the only check in that case is if the
record length is smaller than 1GB, and even that is just checked in the
frontend. Note that we cannot rely on the CRC checksum here - it can only be
validated once the whole record has been read.

On a primary we *do* pad partial pages, see AdvanceXLInsertBuffer():
                /*
                 * Be sure to re-zero the buffer so that bytes beyond what we've
                 * written will look like zeroes and not valid XLOG records...
                 */
                MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

Particularly the logic in allocate_recordbuf() is scary: In a completely
working setup we'll regularly try to allocate large buffers that we'll never
need - and the record buffer is not freed until the startup process exits. And
we have no corresponding size check in the frontend (which doesn't make any
sense to me). In the case of a record header across a page boundary, this
check will pass in roughly 1/4 of the cases!


As an example of the difference this makes, I ran a primary/standby setup with
continuously running regression tests, and had a psql \watch terminate
walsender every 1.5 s.

Decoding failures without zero-padding:
2021-05-10 16:52:51.448 PDT [2481446][1/0] LOG:  record with incorrect 
prev-link 103FF/73 at 4/C154FD50
2021-05-10 16:52:53.001 PDT [2481446][1/0] LOG:  record with incorrect 
prev-link 0/FFFF at 4/C3531A88
2021-05-10 16:52:57.848 PDT [2481446][1/0] LOG:  invalid resource manager ID 32 
at 4/C3B67AD8
2021-05-10 16:52:58.773 PDT [2481446][1/0] LOG:  record with incorrect 
prev-link 403FF/12 at 4/C47F35E8
2021-05-10 16:53:03.771 PDT [2481446][1/0] LOG:  invalid page at 4/C562E000
2021-05-10 16:53:04.945 PDT [2481446][1/0] LOG:  invalid record length at 
4/C6E1C1E8: wanted 24, got 0
2021-05-10 16:53:06.176 PDT [2481446][1/0] LOG:  invalid page at 4/C7040000
2021-05-10 16:53:07.624 PDT [2481446][1/0] LOG:  record with incorrect 
prev-link 2FF/64 at 4/C7475078
...


With zero-padding:
2021-05-10 16:58:20.186 PDT [2489042][1/0] LOG:  invalid record length at 
5/7049A40: wanted 24, got 0
2021-05-10 16:58:22.832 PDT [2489042][1/0] LOG:  invalid record length at 
5/801AD70: wanted 24, got 0
2021-05-10 16:58:27.548 PDT [2489042][1/0] LOG:  invalid record length at 
5/8319908: wanted 24, got 0
2021-05-10 16:58:30.945 PDT [2489042][1/0] LOG:  invalid record length at 
5/AFDC770: wanted 24, got 0
...
2021-05-10 16:59:24.546 PDT [2489042][1/0] LOG:  invalid page at 5/19284000

The "invalid page" cases are a lot rarer - previously we would hit them
whenever the record header itself passed [minimal] muster, even though it was
just padding passing as e.g. a valid record length. Now it's only when the end
of WAL actually is at the page boundary.


On 13+ we could do a bit better than the current approach, and use
pg_pwritev() to append the zeroed data. However, I'm not convinced it is a
good idea - when pg_pwritev is emulated, we'd always do the zeroing as part of
a separate write, which does seem like it increases the likelihood of
encountering such partially written pages a bit. But perhaps it's too
insignificant to matter.

Greetings,

Andres Freund
diff --git c/src/backend/replication/walreceiver.c i/src/backend/replication/walreceiver.c
index 9a0e3806fcf..3188743e59b 100644
--- c/src/backend/replication/walreceiver.c
+++ i/src/backend/replication/walreceiver.c
@@ -865,16 +865,38 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 /*
  * Write XLOG data to disk.
+ *
+ * This is a bit more complicated than immediately obvious: WAL files are
+ * recycled, which means they can contain almost arbitrary data. To ensure
+ * that we can reliably detect the end of the WAL, we need to pad partial
+ * pages with zeroes. If we didn't do so, the data beyond the end of the WAL
+ * is interpreted as a record header - while we can do some checks on the
+ * record, they are not bulletproof. Particularly because the record's CRC
+ * checksum can only be validated once the entire record has been read.
+ *
+ * There is no corresponding risk at the start of a page, because
+ * XLogPageHeader.xlp_pageaddr inside recycled segment data will always differ
+ * from the expected pageaddr.
+ *
+ * We cannot just pad 'buf' by whatever amount is necessary, as it is
+ * allocated outside of our control, preventing us from resizing it to be big
+ * enough. Copying the entire incoming data into a temporary buffer would be a
+ * noticeable overhead. Instead we separately write pages that need to be
+ * padded, and copy just that partial data to the temporary buffer (padbuf).
  */
 static void
 XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
 	int			startoff;
 	int			byteswritten;
+	XLogRecPtr	orig_recptr = recptr PG_USED_FOR_ASSERTS_ONLY;
+	Size		orig_nbytes = nbytes PG_USED_FOR_ASSERTS_ONLY;
+	PGAlignedXLogBlock padbuf;
 
 	while (nbytes > 0)
 	{
-		int			segbytes;
+		int			segbytes; /* write amount, valid WAL data only */
+		int			writebytes; /* write amount, including padding */
 
 		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
@@ -925,42 +947,95 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		startoff = XLogSegmentOffset(recptr, wal_segment_size);
 
 		if (startoff + nbytes > wal_segment_size)
-			segbytes = wal_segment_size - startoff;
-		else
-			segbytes = nbytes;
-
-		/* OK to write the logs */
-		errno = 0;
-
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
-		if (byteswritten <= 0)
 		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno;
+			/* only write bytes in the current segment */
+			segbytes = wal_segment_size - startoff;
+			writebytes = segbytes;
+		}
+		else if (recptr / XLOG_BLCKSZ == (recptr + nbytes) / XLOG_BLCKSZ &&
+				 (recptr % XLOG_BLCKSZ) == 0)
+		{
+			/*
+			 * Writing a partial page. As explained in the function header, we
+			 * need to pad the trailing space with zeroes.
+			 */
+			Assert(nbytes <= XLOG_BLCKSZ);
 
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
+			segbytes = nbytes;
+			writebytes = XLOG_BLCKSZ;
 
-			save_errno = errno;
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-			errno = save_errno;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							xlogfname, startoff, (unsigned long) segbytes)));
+			memcpy(padbuf.data, buf, nbytes);
+			memset(padbuf.data + nbytes, 0, XLOG_BLCKSZ - nbytes);
+			buf = padbuf.data;
+		}
+		else if (recptr / XLOG_BLCKSZ != (recptr + nbytes) / XLOG_BLCKSZ &&
+				 (recptr + nbytes) % XLOG_BLCKSZ != 0)
+		{
+			/*
+			 * Write partial pages separately from the rest, to allow to pad,
+			 * without needing to reallocate the whole, potentially large,
+			 * incoming buffer.
+			 */
+			segbytes = nbytes - (recptr + nbytes) % XLOG_BLCKSZ;
+			writebytes = segbytes;
+		}
+		else
+		{
+			segbytes = nbytes;
+			writebytes = segbytes;
 		}
 
-		/* Update state for write */
-		recptr += byteswritten;
+		/* OK to write the logs */
+		while (writebytes > 0)
+		{
+			errno = 0;
 
-		nbytes -= byteswritten;
-		buf += byteswritten;
+			byteswritten = pg_pwrite(recvFile, buf, writebytes, (off_t) startoff);
+			if (byteswritten <= 0)
+			{
+				char		xlogfname[MAXFNAMELEN];
+				int			save_errno;
+
+				/* if write didn't set errno, assume no disk space */
+				if (errno == 0)
+					errno = ENOSPC;
+
+				save_errno = errno;
+				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+				errno = save_errno;
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not write to log segment %s "
+								"at offset %u, length %lu: %m",
+								xlogfname, startoff, (unsigned long) writebytes)));
+			}
+
+			/* XXX: Useful for testing partial writes */
+#if 0
+			if (byteswritten > 1023)
+				byteswritten = 1023;
+#endif
+
+			/* Update state for write */
+
+			writebytes -= byteswritten;
+			buf += byteswritten;
+			startoff += byteswritten;
+
+			/* the recptr / nbytes may not include padded data */
+			if (byteswritten > segbytes)
+				byteswritten = segbytes;
+
+			segbytes -= byteswritten;
+			recptr += byteswritten;
+			nbytes -= byteswritten;
+		}
 
 		LogstreamResult.Write = recptr;
 	}
 
+	Assert(LogstreamResult.Write == orig_recptr + orig_nbytes);
+
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }

Reply via email to