Hi, In https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg%40alap3.anarazel.de I commented that we have a number of hacky workarounds to deal with the fact that walreceiver writes partial WAL pages into reycled segments.
The problem with that practice is that within a page we cannot reliably detect invalid record headers. This is especially true, when the record header spans across a page boundary - currently the only check in that case is if the record length is smaller than 1GB, and even that is just checked in the frontend. Note that we cannot rely on the CRC checksum here - it can only be validated once the whole record has been read. On a primary we *do* pad partial pages, see AdvanceXLInsertBuffer(): /* * Be sure to re-zero the buffer so that bytes beyond what we've * written will look like zeroes and not valid XLOG records... */ MemSet((char *) NewPage, 0, XLOG_BLCKSZ); Particularly the logic in allocate_recordbuf() is scary: In a completely working setup we'll regularly try to allocate large buffers that we'll never need - and the record buffer is not freed until the startup process exits. And we have no corresponding size check in the frontend (which doesn't make any sense to me). In the case of a record header across a page boundary, this check will pass in roughly 1/4 of the cases! As an example of the difference this makes, I ran a primary/standby setup with continuously running regression tests, and had a psql \watch terminate walsender every 1.5 s. Decoding failures without zero-padding: 2021-05-10 16:52:51.448 PDT [2481446][1/0] LOG: record with incorrect prev-link 103FF/73 at 4/C154FD50 2021-05-10 16:52:53.001 PDT [2481446][1/0] LOG: record with incorrect prev-link 0/FFFF at 4/C3531A88 2021-05-10 16:52:57.848 PDT [2481446][1/0] LOG: invalid resource manager ID 32 at 4/C3B67AD8 2021-05-10 16:52:58.773 PDT [2481446][1/0] LOG: record with incorrect prev-link 403FF/12 at 4/C47F35E8 2021-05-10 16:53:03.771 PDT [2481446][1/0] LOG: invalid page at 4/C562E000 2021-05-10 16:53:04.945 PDT [2481446][1/0] LOG: invalid record length at 4/C6E1C1E8: wanted 24, got 0 2021-05-10 16:53:06.176 PDT [2481446][1/0] LOG: invalid page at 4/C7040000 2021-05-10 16:53:07.624 PDT [2481446][1/0] LOG: record with incorrect prev-link 2FF/64 at 4/C7475078 ... With zero-padding: 2021-05-10 16:58:20.186 PDT [2489042][1/0] LOG: invalid record length at 5/7049A40: wanted 24, got 0 2021-05-10 16:58:22.832 PDT [2489042][1/0] LOG: invalid record length at 5/801AD70: wanted 24, got 0 2021-05-10 16:58:27.548 PDT [2489042][1/0] LOG: invalid record length at 5/8319908: wanted 24, got 0 2021-05-10 16:58:30.945 PDT [2489042][1/0] LOG: invalid record length at 5/AFDC770: wanted 24, got 0 ... 2021-05-10 16:59:24.546 PDT [2489042][1/0] LOG: invalid page at 5/19284000 The "invalid page" cases are a lot rarer - previously we would hit them whenever the record header itself passed [minimal] muster, even though it was just padding passing as e.g. a valid record length. Now it's only when the end of WAL actually is at the page boundary. On 13+ we could do a bit better than the current approach, and use pg_pwritev() to append the zeroed data. However, I'm not convinced it is a good idea - when pg_pwritev is emulated, we'd always do the zeroing as part of a separate write, which does seem like it increases the likelihood of encountering such partially written pages a bit. But perhaps it's too insignificant to matter. Greetings, Andres Freund
diff --git c/src/backend/replication/walreceiver.c i/src/backend/replication/walreceiver.c index 9a0e3806fcf..3188743e59b 100644 --- c/src/backend/replication/walreceiver.c +++ i/src/backend/replication/walreceiver.c @@ -865,16 +865,38 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len) /* * Write XLOG data to disk. + * + * This is a bit more complicated than immediately obvious: WAL files are + * recycled, which means they can contain almost arbitrary data. To ensure + * that we can reliably detect the end of the WAL, we need to pad partial + * pages with zeroes. If we didn't do so, the data beyond the end of the WAL + * is interpreted as a record header - while we can do some checks on the + * record, they are not bulletproof. Particularly because the record's CRC + * checksum can only be validated once the entire record has been read. + * + * There is no corresponding risk at the start of a page, because + * XLogPageHeader.xlp_pageaddr inside recycled segment data will always differ + * from the expected pageaddr. + * + * We cannot just pad 'buf' by whatever amount is necessary, as it is + * allocated outside of our control, preventing us from resizing it to be big + * enough. Copying the entire incoming data into a temporary buffer would be a + * noticeable overhead. Instead we separately write pages that need to be + * padded, and copy just that partial data to the temporary buffer (padbuf). */ static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) { int startoff; int byteswritten; + XLogRecPtr orig_recptr = recptr PG_USED_FOR_ASSERTS_ONLY; + Size orig_nbytes = nbytes PG_USED_FOR_ASSERTS_ONLY; + PGAlignedXLogBlock padbuf; while (nbytes > 0) { - int segbytes; + int segbytes; /* write amount, valid WAL data only */ + int writebytes; /* write amount, including padding */ if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size)) { @@ -925,42 +947,95 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) startoff = XLogSegmentOffset(recptr, wal_segment_size); if (startoff + nbytes > wal_segment_size) - segbytes = wal_segment_size - startoff; - else - segbytes = nbytes; - - /* OK to write the logs */ - errno = 0; - - byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff); - if (byteswritten <= 0) { - char xlogfname[MAXFNAMELEN]; - int save_errno; + /* only write bytes in the current segment */ + segbytes = wal_segment_size - startoff; + writebytes = segbytes; + } + else if (recptr / XLOG_BLCKSZ == (recptr + nbytes) / XLOG_BLCKSZ && + (recptr % XLOG_BLCKSZ) == 0) + { + /* + * Writing a partial page. As explained in the function header, we + * need to pad the trailing space with zeroes. + */ + Assert(nbytes <= XLOG_BLCKSZ); - /* if write didn't set errno, assume no disk space */ - if (errno == 0) - errno = ENOSPC; + segbytes = nbytes; + writebytes = XLOG_BLCKSZ; - save_errno = errno; - XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size); - errno = save_errno; - ereport(PANIC, - (errcode_for_file_access(), - errmsg("could not write to log segment %s " - "at offset %u, length %lu: %m", - xlogfname, startoff, (unsigned long) segbytes))); + memcpy(padbuf.data, buf, nbytes); + memset(padbuf.data + nbytes, 0, XLOG_BLCKSZ - nbytes); + buf = padbuf.data; + } + else if (recptr / XLOG_BLCKSZ != (recptr + nbytes) / XLOG_BLCKSZ && + (recptr + nbytes) % XLOG_BLCKSZ != 0) + { + /* + * Write partial pages separately from the rest, to allow to pad, + * without needing to reallocate the whole, potentially large, + * incoming buffer. + */ + segbytes = nbytes - (recptr + nbytes) % XLOG_BLCKSZ; + writebytes = segbytes; + } + else + { + segbytes = nbytes; + writebytes = segbytes; } - /* Update state for write */ - recptr += byteswritten; + /* OK to write the logs */ + while (writebytes > 0) + { + errno = 0; - nbytes -= byteswritten; - buf += byteswritten; + byteswritten = pg_pwrite(recvFile, buf, writebytes, (off_t) startoff); + if (byteswritten <= 0) + { + char xlogfname[MAXFNAMELEN]; + int save_errno; + + /* if write didn't set errno, assume no disk space */ + if (errno == 0) + errno = ENOSPC; + + save_errno = errno; + XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size); + errno = save_errno; + ereport(PANIC, + (errcode_for_file_access(), + errmsg("could not write to log segment %s " + "at offset %u, length %lu: %m", + xlogfname, startoff, (unsigned long) writebytes))); + } + + /* XXX: Useful for testing partial writes */ +#if 0 + if (byteswritten > 1023) + byteswritten = 1023; +#endif + + /* Update state for write */ + + writebytes -= byteswritten; + buf += byteswritten; + startoff += byteswritten; + + /* the recptr / nbytes may not include padded data */ + if (byteswritten > segbytes) + byteswritten = segbytes; + + segbytes -= byteswritten; + recptr += byteswritten; + nbytes -= byteswritten; + } LogstreamResult.Write = recptr; } + Assert(LogstreamResult.Write == orig_recptr + orig_nbytes); + /* Update shared-memory status */ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write); }