On Wed, Nov 13, 2024 at 5:23 PM Tomas Vondra <to...@vondra.me> wrote:
>
> On 11/13/24 11:59, Amit Kapila wrote:
> > On Tue, Nov 12, 2024 at 12:43 PM Ashutosh Bapat
> > <ashutosh.bapat....@gmail.com> wrote:
> >>
> >> On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada <sawada.m...@gmail.com> 
> >> wrote:
> >>>
> >>> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <to...@vondra.me> wrote:
> >>>>
> >>>>
> >>>> But neither of those fixes prevents backwards move for confirmed_flush
> >>>> LSN, as enforced by asserts in the 0005 patch. I don't know if this
> >>>> assert is incorrect or now. It seems natural that once we get a
> >>>> confirmation for some LSN, we can't move before that position, but I'm
> >>>> not sure about that. Maybe it's too strict.
> >>>
> >>> Hmm, I'm concerned that it might be another problem. I think there are
> >>> some cases where a subscriber sends a flush position older than slot's
> >>> confirmed_flush as a feedback message.
> >>>
> >
> > Right, it can happen for cases where subscribers doesn't have to do
> > anything (for example DDLs) like I have explained in one of my emails
> > [1]
> >
>
> Thanks. I admit not being entirely familiar with all the details, but
> doesn't that email explain more "Why it currently happens?" rather than
> "Is this what should be happening?"
>

Right.

> Sure, the subscriber needs to confirm changes for which nothing needs to
> be done, like DDL. But isn't there a better way to do that, rather than
> allowing confirmed_lsn to go backwards?
>

Firstly, I agree that we should try to find ways to avoid going
confirmed_lsn going backward. We can try to explore solutions both in
the publisher and subscriber-side.

> >>> But it seems to be dangerous if
> >>> we always accept it as a new confirmed_flush value. It could happen
> >>> that confirm_flush could be set to a LSN older than restart_lsn.
> >>>
> >
> > Possible, though I haven't tried to reproduce such a case. But, will
> > it create any issues? I don't know if there is any benefit in allowing
> > to move confirmed_flush LSN backward. AFAIR, we don't allow such
> > backward values to persist. They will temporarily be in memory. I
> > think as a separate patch we should prevent it from moving backward.
> >
> >>
> >> If confirmed_flush LSN moves backwards, it means the transactions
> >> which were thought to be replicated earlier are no longer considered
> >> to be replicated. This means that the restart_lsn of the slot needs to
> >> be at least far back as the oldest of starting points of those
> >> transactions. Thus restart_lsn of slot has to be pushed further back.
> >>
> >
> > I don't see a reason to move restart_lsn backward. Why do you think so?
> >
>
> I think what Ashutosh is saying that if confirmed_flush is allowed to
> move backwards, that may result in start_lsn moving backwards too. And
> we need to be able to decode all transactions committed since start_lsn,
> so if start_lsn moves backwards, maybe restart_lsn needs to move
> backwards too. I have no idea if confirmed_flush/start_lsn can move
> backwards enough to require restart_lsn to move, though.
>
> Anyway, these discussions are a good illustration why I think allowing
> these LSNs to move backwards is a problem. It either causes bugs (like
> with breaking replication slots) and/or it makes the reasoning about
> correct behavior much harder.
>

Right, I think one needs to come up with some reproducible scenarios
where these cause any kind of problem or inefficiency. Then, we can
discuss the solutions accordingly. I mean to say that someone has to
put effort into making a bit more solid case for changing this code
because it may not be a good idea to change something just based on
some theory unless it is just adding some assertions.

-- 
With Regards,
Amit Kapila.


Reply via email to