Thanks for the report, reproducer and the patches.

At Fri, 16 Jun 2023 16:27:40 +0200, Julian Markwort 
<julian.markw...@cybertec.at> wrote in 
> - prepare a transaction
> - crash postgresql
> - create standby.signal file
> - start postgresql, wait for recovery to finish
> - promote
..
> The promotion will fail with a FATAL error, stating that "requested WAL 
> segment .* has already been removed".
> The FATAL error causes the startup process to exit, so postmaster shuts down 
> again.
> 
> Here's an exemplary log output, maybe this helps people to find this issue 
> when they search for it online:

> LOG:  redo done at 0/15D89B8
> LOG:  last completed transaction was at log time 2023-06-16 13:09:53.71118+02
> LOG:  selected new timeline ID: 2
> LOG:  archive recovery complete
> FATAL:  requested WAL segment pg_wal/000000010000000000000001 has already 
> been removed
> LOG:  startup process (PID 1650358) exited with exit code 1

Reproduced here.

> The cause of this failure is an oversight (rather obvious in hindsight):
> The renaming of the WAL file (that was last written to before the crash 
> happened) to .partial is done *before* PostgreSQL
> might have to read this very file to recover prepared transactions from it.
> The relevant function calls here are durable_rename() and 
> RecoverPreparedTransactions() in xlog.c .
> This behaviour has - apparently unintentionally - been fixed in PG 15 and 
> upwards (see commit 811051c ), as part of a
> general restructure and reorganization of this portion of xlog.c (see commit 
> 6df1543 ).

I think so, the reordering might have done for some other reasons, though.

> Furthermore, it seems this behaviour does not appear in PG 12 and older, due 
> to another possible bug:

<snip>...

> In PG 13 and newer, the XLogReaderState is reset in XLogBeginRead()
> before reading WAL in XlogReadTwoPhaseData() in twophase.c .

I arraived at the same conclusion.

> In the older releases (PG <= 12), this reset is not done, so the requested 
> LSN containing the prepared transaction can
> (by happy coincidence) be read from in-memory buffers, and PostgreSQL 
> consequently manages to come up just fine (as the
> WAL has already been read into buffers prior to the .partial rename).
> If the older releases also where to properly reset the XLogReaderState, they 
> would also fail to find the LSN on disk, and
> hence PostgreSQL would crash again.

>From the perspective of loading WAL for prepared transactions, the
current code in those versions seems fine. Although I suspect Windows
may not like to rename currently-open segments, it's likely acceptable
as the current test set operates without issue.. (I didn't tested this.)

> I've attached patches for PG 14 and PG 13 that mimic the change in PG15 
> (commit 811051c ) and reorder the crucial events,
> placing the recovery of prepared transactions *before* renaming the file.

It appears to move the correct part of the code to the proper
location, modifying the steps to align with later versions.

> I've also attached recovery test scripts for PG >= 12 and PG <= 11 that can 
> be used to verify that promote after recovery
> with prepared transactions works.

It effectively detects the bug, though it can't be directly used in
the tree as-is. I'm unsure whether we need this in the tree, though.

> My humble opinion is that this fix should be backpatched to PG 14 and PG 13.

I agree with you.

> It's debatable whether the fix needs to be brought back to 12 and older also, 
> as those do not exhibit this issue, but the
> order of renaming is still wrong.
> I'm not sure if there could be cases where the in-memory buffers of the 
> walreader are too small to cover a whole WAL
> file.
> There could also be other issues from operations that require reading WAL 
> that happen after the .partial rename, I
> haven't checked in depth what else happens in the affected codepath.
> Please let me know if you think this should also be fixed in PG 12 and 
> earlier, so I can produce the patches for those
> versions as well.

There's no immediate need to change the versions. However, I would
prefer to backpatch them to the older versions for the following
reasons.

1. Applying this eases future backpatching in this area, if any.

2. I have reservations about renaming possibly-open WAL segments.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Reply via email to