Hi all, (Thomas in CC.) Now that becfbdd6c1c9 has improved the situation to detect the difference between out-of-memory and invalid WAL data in WAL, I guess that it is time to tackle the problem of what we should do when reading WAL records bit fail on out-of-memory.
To summarize, currently the WAL reader APIs fail the same way if we detect some incorrect WAL record or if a memory allocation fails: an error is generated and returned back to the caller to consume. For WAL replay, not being able to make the difference between an OOM and the end-of-wal is a problem in some cases. For example, in crash recovery, failing an internal allocation will be detected as the end-of-wal, causing recovery to stop prematurely. In the worst cases, this silently corrupts clusters because not all the records generated in the local pg_wal/ have been replayed. Oops. When in standby mode, things are a bit better, because we'd just loop and wait for the next record. But, even in this case, if the startup process does a crash recovery while standby is set up, we may finish by attempting recovery from a different source than the local pg_wal/. Not strictly critical, but less optimal in some cases as we could switch to archive recovery earlier than necessary. In a different thread, I have proposed to extend the WAL reader facility so as an error code is returned to make the difference between an OOM or the end of WAL with an incorrect record: https://www.postgresql.org/message-id/ZRJ-p1dLUY0uoChc%40paquier.xyz However this requires some ABI changes, so that's not backpatchable. This leaves out what we can do for the existing back-branches, and one option is to do the simplest thing I can think of: if an allocation fails, just fail *hard*. The allocations of the WAL reader rely on palloc_extended(), so I'd like to suggest that we switch to palloc() instead. If we do so, an ERROR is promoted to a FATAL during WAL replay, which makes sure that we will never stop recovery earlier than we should, FATAL-ing before things go wrong. Note that the WAL prefetching already relies on a palloc() on HEAD in XLogReadRecordAlloc(), which would fail hard the same way on OOM. So, attached is a proposal of patch to do something down to 12. Thoughts and/or comments are welcome. -- Michael
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index a17263df20..a1363e3b8f 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -43,7 +43,7 @@ static void report_invalid_record(XLogReaderState *state, const char *fmt,...) pg_attribute_printf(2, 3); -static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength); +static void allocate_recordbuf(XLogReaderState *state, uint32 reclength); static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); @@ -155,14 +155,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, * Allocate an initial readRecordBuf of minimal size, which can later be * enlarged if necessary. */ - if (!allocate_recordbuf(state, 0)) - { - pfree(state->errormsg_buf); - pfree(state->readBuf); - pfree(state); - return NULL; - } - + allocate_recordbuf(state, 0); return state; } @@ -184,7 +177,6 @@ XLogReaderFree(XLogReaderState *state) /* * Allocate readRecordBuf to fit a record of at least the given length. - * Returns true if successful, false if out of memory. * * readRecordBufSize is set to the new buffer size. * @@ -196,7 +188,7 @@ XLogReaderFree(XLogReaderState *state) * Note: This routine should *never* be called for xl_tot_len until the header * of the record has been fully validated. */ -static bool +static void allocate_recordbuf(XLogReaderState *state, uint32 reclength) { uint32 newSize = reclength; @@ -206,15 +198,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) if (state->readRecordBuf) pfree(state->readRecordBuf); - state->readRecordBuf = - (char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM); - if (state->readRecordBuf == NULL) - { - state->readRecordBufSize = 0; - return false; - } + state->readRecordBuf = (char *) palloc(newSize); state->readRecordBufSize = newSize; - return true; } /* @@ -505,9 +490,7 @@ XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversi /* Not enough space in the decode buffer. Are we allowed to allocate? */ if (allow_oversized) { - decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM); - if (decoded == NULL) - return NULL; + decoded = palloc(required_space); decoded->oversized = true; return decoded; } @@ -815,13 +798,7 @@ restart: Assert(gotlen <= lengthof(save_copy)); Assert(gotlen <= state->readRecordBufSize); memcpy(save_copy, state->readRecordBuf, gotlen); - if (!allocate_recordbuf(state, total_len)) - { - /* We treat this as a "bogus data" condition */ - report_invalid_record(state, "record length %u at %X/%X too long", - total_len, LSN_FORMAT_ARGS(RecPtr)); - goto err; - } + allocate_recordbuf(state, total_len); memcpy(state->readRecordBuf, save_copy, gotlen); buffer = state->readRecordBuf + gotlen; } @@ -877,16 +854,8 @@ restart: decoded = XLogReadRecordAlloc(state, total_len, true /* allow_oversized */ ); - if (decoded == NULL) - { - /* - * We failed to allocate memory for an oversized record. As - * above, we currently treat this as a "bogus data" condition. - */ - report_invalid_record(state, - "out of memory while trying to decode a record of length %u", total_len); - goto err; - } + /* allocation should always happen under allow_oversized */ + Assert(decoded != NULL); } if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
signature.asc
Description: PGP signature