On Wed, Nov 6, 2024 at 8:54 PM Tomas Vondra <to...@vondra.me> wrote:
>
> Hi,
>
> I've been investigating some issues reported by users, related to
> logical replication unexpectedly breaking with messages like:
>
>   LOG: invalidating slot "s" because its restart_lsn X/Y exceeds
>        max_slot_wal_keep_size
>
> which is pretty confusing, because the system has that GUC set to -1 (so
> disabled, there should be no limit). Or a message like this:
>
>   ERROR: requested WAL segment 000...0AA has already been removed
>
> which is a bit less confusing, but still puzzling because replication
> slots are meant to prevent exactly this.
>
> I speculated there's some sort of rare race condition, in how we advance
> the slot LSN values. I didn't know where to start, so I gave up on
> trying to understand the whole complex code. Instead, I wrote a simple
> stress test that
>
> 1) sets up a cluster (primary + 1+ logical replicas)
>
> 2) runs pgbench on the primary, checks replicas/slots
>
> 3) randomly restarts the nodes (both primary/replicas) with either fast
>   and immediate mode, with random short pauses
>
> 4) runs checkpoints in tight while loop
>
> This is based on the observations that in past reports the issues seem
> to happen only with logical replication, shortly after reconnect (e.g.
> after connection reset etc.). And the slots are invalidated / WAL
> removed during a checkpoint, so frequent checkpoints should make it
> easier to hit ...
>
> Attached is a couple scripts running this - it's not particularly pretty
> and may need some tweak to make it work on your machine (run.sh is the
> script to run).
>
> And unfortunately, this started to fail pretty quick. The short story is
> that it's not difficult to get into a situation where restart_lsn for a
> slot moves backwards, so something like this can happen:
>
> 1) slot restart_lsn moves forward to LSN A
>
> 2) checkpoint happens, updates min required LSN for slots, recycles
> segments it considers unnecessary (up to LSN A)
>
> 3) slot restart_lsn moves backwards to LSN B (where B < A)
>
> 4) kaboom
>
> This would perfectly explain all the symptoms I mentioned earlier. The
> max_slot_wal_keep_size reference is confusing, but AFAIK it's just based
> on (reasonable) belief that LSN can't move backwards, and so the only
> reason for restart_lsn being before min required LSN is that this GUC
> kicked in. But the assumption does not hold :-(
>
> Now, why/how can this happen?
>
> I kept adding a elog(LOG) messages to all places updating LSNs for a
> slot, and asserts to fail if data.restart_lsn moves backwards. See the
> attached 0001 patch. An example for a failure (for the walsended backend
> that triggered the assert) looks like this:
>
>   344.139 LOG:  starting logical decoding for slot "s1"
>
>   344.139 DETAIL:  Streaming transactions committing after 1/E97EAC30,
>                    reading WAL from 1/E96FB4D0.
>
>   344.140 LOG:  logical decoding found consistent point at 1/E96FB4D0
>
>   344.140 DETAIL:  Logical decoding will begin using saved snapshot.
>
>   344.140 LOG:  LogicalConfirmReceivedLocation 1/E9865398
>
>   344.140 LOG:  LogicalConfirmReceivedLocation updating
>                    data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
>                    candidate_restart_valid 0/0 (from 1/E9865398)
>                    candidate_restart_lsn 0/0 (from 1/E979D4C8)
>
>   344.145 LOG:  LogicalIncreaseRestartDecodingForSlot
>                    restart_decoding_lsn 1/E96FB4D0
>
>   344.145 LOG:  LogicalIncreaseRestartDecodingForSlot s1
>                    candidate_restart_valid_lsn 1/E979D4C8 (0/0)
>                    candidate_restart_lsn 1/E96FB4D0 (0/0)
>
>   344.147 LOG:  LogicalIncreaseRestartDecodingForSlot
>                    restart_decoding_lsn 1/E979D4C8
>
>   344.156 LOG:  LogicalIncreaseXminForSlot
>                    candidate_catalog_xmin 30699
>                    candidate_xmin_lsn 1/E993AD68 (0/0)
>   ...
>   344.235 LOG:  LogicalIncreaseRestartDecodingForSlot
>                    restart_decoding_lsn 1/E9F33AF8
>
>   344.240 LOG:  LogicalConfirmReceivedLocation 1/E9DCCD78
>
>   344.240 LOG:  LogicalConfirmReceivedLocation updating
>                    data.restart_lsn to 1/E96FB4D0 (from 1/E979D4C8)
>                    candidate_restart_valid 0/0 (from 1/E979D4C8)
>                    candidate_restart_lsn 0/0 (from 1/E96FB4D0)
>
>   345.536 LOG:  server process (PID 2518127) was terminated by
>                 signal 6: Aborted
>
> We start decoding at 1/E96FB4D0, and right after startup we get a
> confirmation, and LogicalConfirmReceivedLocation updates restart_lsn to
> 1/E979D4C8.
>
> But then LogicalIncreaseRestartDecodingForSlot comes along, and stores
> the restart_decoding_lsn 1/E96FB4D0 (which is the initial restart_lsn)
> into candidate_restart_lsn.
>
> And then a little bit later we get another confirmation, we call
> LogicalConfirmReceivedLocation which propagates candidate_restart_lsn
> into data.restart_lsn.
>
> This is how restart_lsn moves backwards, causing issues. I've reproduced
> this on PG14 and current master, but I believe the issue exists since
> the introduction of logical replication in 9.4 :-(
>
> I'm not claiming this is the only way how this can happen, but all the
> cases I've seen in my stress testing look like this. Moreover, I'm not
> claiming this is the only LSN field that can move backwards like this.
> It seems to me various other candidate_ fields have the same issue, but
> may have consequences other than discarding "unnecessary" WAL.
>
> I've been removed of this [1] thread from 2022. I'm 99% sure it's the
> same issue - it happened shortly after a reconnect, etc. And it seems to
> me Masahiko-san was about the causes in [2]. I don't think the fix
> suggested in that message (changing the branch to "else if") can work,
> though. At least it did not really help in my testing.
>
> And I'm not sure it'd fix all the issues - it only affects restart_lsn,
> but AFAICS the same issue (LSNs moving backwards) can happen for the
> other LSN slot field (candidate_xmin_lsn).

After examining the code before reading [2], I came to the same
conclusion as Masahiko-san in [2]. We install candidate_restart_lsn
based on the running transaction record whose LSN is between
restart_lsn and confirmed_flush_lsn. Since candidate_restart_valid of
such candidates would be lesser than any confirmed_flush_lsn received
from downstream. I am surprised that the fix suggested by Masahiko-san
didn't work though. The fix also fix the asymmetry, between
LogicalIncreaseXminForSlot and LogicalIncreaseRestartDecodingForSlot,
that you have pointed out in your next email. What behaviour do you
see with that fix applied?


[1] https://www.postgresql.org/message-id/Yz2hivgyjS1RfMKs%40depesz.com
[2] 
https://www.postgresql.org/message-id/CAD21AoBVhYnGBuW_o%3DwEGgTp01qiHNAx1a14b1X9kFXmuBe%3Dsg%40mail.gmail.com


-- 
Best Wishes,
Ashutosh Bapat


Reply via email to