On Wed, Nov 13, 2024 at 5:23 PM Tomas Vondra <to...@vondra.me> wrote: > > On 11/13/24 11:59, Amit Kapila wrote: > > On Tue, Nov 12, 2024 at 12:43 PM Ashutosh Bapat > > <ashutosh.bapat....@gmail.com> wrote: > >> > >> On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada <sawada.m...@gmail.com> > >> wrote: > >>> > >>> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra <to...@vondra.me> wrote: > >>>> > >>>> > >>>> But neither of those fixes prevents backwards move for confirmed_flush > >>>> LSN, as enforced by asserts in the 0005 patch. I don't know if this > >>>> assert is incorrect or now. It seems natural that once we get a > >>>> confirmation for some LSN, we can't move before that position, but I'm > >>>> not sure about that. Maybe it's too strict. > >>> > >>> Hmm, I'm concerned that it might be another problem. I think there are > >>> some cases where a subscriber sends a flush position older than slot's > >>> confirmed_flush as a feedback message. > >>> > > > > Right, it can happen for cases where subscribers doesn't have to do > > anything (for example DDLs) like I have explained in one of my emails > > [1] > > > > Thanks. I admit not being entirely familiar with all the details, but > doesn't that email explain more "Why it currently happens?" rather than > "Is this what should be happening?" >
Right. > Sure, the subscriber needs to confirm changes for which nothing needs to > be done, like DDL. But isn't there a better way to do that, rather than > allowing confirmed_lsn to go backwards? > Firstly, I agree that we should try to find ways to avoid going confirmed_lsn going backward. We can try to explore solutions both in the publisher and subscriber-side. > >>> But it seems to be dangerous if > >>> we always accept it as a new confirmed_flush value. It could happen > >>> that confirm_flush could be set to a LSN older than restart_lsn. > >>> > > > > Possible, though I haven't tried to reproduce such a case. But, will > > it create any issues? I don't know if there is any benefit in allowing > > to move confirmed_flush LSN backward. AFAIR, we don't allow such > > backward values to persist. They will temporarily be in memory. I > > think as a separate patch we should prevent it from moving backward. > > > >> > >> If confirmed_flush LSN moves backwards, it means the transactions > >> which were thought to be replicated earlier are no longer considered > >> to be replicated. This means that the restart_lsn of the slot needs to > >> be at least far back as the oldest of starting points of those > >> transactions. Thus restart_lsn of slot has to be pushed further back. > >> > > > > I don't see a reason to move restart_lsn backward. Why do you think so? > > > > I think what Ashutosh is saying that if confirmed_flush is allowed to > move backwards, that may result in start_lsn moving backwards too. And > we need to be able to decode all transactions committed since start_lsn, > so if start_lsn moves backwards, maybe restart_lsn needs to move > backwards too. I have no idea if confirmed_flush/start_lsn can move > backwards enough to require restart_lsn to move, though. > > Anyway, these discussions are a good illustration why I think allowing > these LSNs to move backwards is a problem. It either causes bugs (like > with breaking replication slots) and/or it makes the reasoning about > correct behavior much harder. > Right, I think one needs to come up with some reproducible scenarios where these cause any kind of problem or inefficiency. Then, we can discuss the solutions accordingly. I mean to say that someone has to put effort into making a bit more solid case for changing this code because it may not be a good idea to change something just based on some theory unless it is just adding some assertions. -- With Regards, Amit Kapila.