Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

Tomas Vondra Tue, 12 Nov 2024 03:25:42 -0800

On 11/12/24 10:37, Ashutosh Bapat wrote:
> On Tue, Nov 12, 2024 at 4:54 AM Tomas Vondra <to...@vondra.me> wrote:
>>
>>
>>
>> On 11/11/24 23:41, Masahiko Sawada wrote:
>>> On Mon, Nov 11, 2024 at 6:17 AM Tomas Vondra <to...@vondra.me> wrote:
>>>
>>> Which made me re-investigate the issue and thought that it doesn't
>>> necessarily need to clear these candidate values in memory on
>>> releasing a slot as long as we're carefully updating restart_lsn.
>>
>> Not sure, but maybe it'd be useful to ask the opposite question. Why
>> shouldn't it be correct to reset the fields, which essentially puts the
>> slot into the same state as if it was just read from disk? That also
>> discards all these values, and we can't rely on accidentally keeping
>> something important info in memory (because if the instance restarts
>> we'd lose that).
>>
>> But this reminds me that the patch I shared earlier today resets the
>> slot in the ReplicationSlotAcquire() function, but I guess that's not
>> quite correct. It probably should be in the "release" path.
>>
>>> Which seems a bit efficient for example when restarting from a very
>>> old point. Of course, even if we reset them on releasing a slot, it
>>> would perfectly work since it's the same as restarting logical
>>> decoding with a server restart.
>>
>> I find the "efficiency" argument a bit odd. It'd be fine if we had a
>> correct behavior to start with, but we don't have that ... Also, I'm not
>> quite sure why exactly would it be more efficient?
>>
>> And how likely is this in practice? It seems to me that
>> performance-sensitive cases can't do reconnects very often anyway,
>> that's inherently inefficient. No?
>>
>>> I think
>>> LogicalIncreaseRestartDecodingForSlot() should be fixed as it seems
>>> not to be working expectedly, but I could not have proof that we
>>> should either keep or reset them on releasing a slot.
>>>
>>
>> Not sure. Chances are we need both fixes, if only to make
>> LogicalIncreaseRestartDecodingForSlot more like the other function.
>>
> 
> Thanks a lot for pointing to Masahiko's analysis. I missed that part
> when I read that thread. Sorry.
> 

No need to apologize. The discussion is complex and spread over a rather
long time period.

> A candidate_restart_lsn and candidate_restart_valid pair just tells
> that we may set slot's data.restart_lsn to candidate_restart_lsn when
> the downstream confirms an LSN >= candidate_restart_valid. That pair
> can never get inaccurate. It may get stale but never inaccurate. So
> wiping those fields from ReplicationSlot is unnecessary.
> 

Isn't this issue a proof that those fields *can* get inaccurate? Or what
do you mean by "stale but not inaccurate"?

> What should ideally happen is we should ignore candidates produced by
> older running transactions WAL records after WAL sender restart. This
> is inline with what logical replication does with transactions
> committed before slot's confirmed_flush_lsn - those are ignored. But
> the criteria for ignoring running transactions records is slightly
> different from that for transactions. If we ignore
> candidate_restart_lsn which has candidate_restart_valid <=
> confirmed_flush_lsn, we might lose some opportunity to advance
> data.restart_lsn. Instead we should ignore any candidate_restart_lsn
> <= data.restart_lsn especially before WAL sender finds first change to
> send downstream. We can do that in SnapBuildProcessRunningXacts() by
> accessing MyReplicationSlot, taking lock on it and then comparing
> data.restart_lsn with txn->restart_decoding_lsn before calling
> LogicalIncreaseRestartDecodingForSlot(). But then
> LogicalIncreaseRestartDecodingForSlot() would be doing the same anyway
> after applying your patch 0004. The only downside of 0004 is that the
> logic to ignore candidates produced by a running transactions record
> is not clearly visible in SnapBuildProcessRunningXacts(). For a
> transaction which is ignored the logic to ignore the transaction is
> visible in DecodeCommit() or DecodeAbort() - where people are likely
> to look for that logic. We may add a comment to that effect in
> SnapBuildProcessRunningXacts().
> 

I have thought about just doing something like:

   slot->data.restart_lsn = Max(slot->data.restart_lsn, new_lsn);

and similar for the other LSN fields. And it did resolve the issue at
hand, of course. But it seems sloppy, and I'm worried it might easily
mask actual issues in other cases.

I'm still of the opinion that (with the exception of a reconnect), these
places should not need to deal with values that go backwards. It should
work just fine without the Max(), and we should add Asserts() to check
that it's always a higher LSN.

For the reconnect, I think it's a bit as if the primary restarted right
before the reconnect. That could happen anyway, and we need to handle
that correctly - if not, we have yet another issue, IMHO. And with the
restart it's the same as writing the slot to disk and reading it back,
which also doesn't retain most of the fields. So it seems cleaner to do
the same thing and just reset the various fields.


I haven't thought about making SnapBuildProcessRunningXacts() more
complex to consider this stuff. But I doubt we'd like to be accessing
slots from that code - it has nothing to do with slots. If anything,
tweaking LogicalIncreaseRestartDecodingForSlot() seems more appropriate,
but I'm still wondering if the current coding was intentional and we're
just missing why it was written like this.

There's also the question of backpatching - the simpler the better, and
this I think just resetting the fields wins in this regard. The main
question is whether it's correct - I think it is. I'm not too worried
about efficiency very much, on the grounds that this should not matter
very often (only after unexpected restart). Correctness > efficiency.


regards

-- 
Tomas Vondra
Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

Reply via email to