On Thu, Aug 31, 2023 at 7:28 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat > <ashutosh.bapat....@gmail.com> wrote: > > > > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > I > > > > > > think we should shut down subscriber, restart publisher and then > > > > > > make this > > > > > > check based on the contents of the replication slot instead of > > > > > > server log. > > > > > > Shutting down subscriber will ensure that the subscriber won't send > > > > > > any new > > > > > > confirmed flush location to the publisher after restart. > > > > > > > > > > > > > > > > But if we shutdown the subscriber before the publisher there is no > > > > > guarantee that the publisher has sent all outstanding logs up to the > > > > > shutdown checkpoint record (i.e., the latest record). Such a guarantee > > > > > can only be there if we do a clean shutdown of the publisher before > > > > > the subscriber. > > > > > > > > So the sequence is shutdown publisher node, shutdown subscriber node, > > > > start publisher node and carry out the checks. > > > > > > > > > > This can probably work but I still prefer the current approach as that > > > will be closer to the ideal values on the disk instead of comparison > > > with a later in-memory value of confirmed_flush LSN. Ideally, if we > > > would have a tool like pg_replslotdata which can read the on-disk > > > state of slots that would be better but missing that, the current one > > > sounds like the next best possibility. Do you see any problem with the > > > current approach of test? > > > > > + qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+), > > > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./ > > > > I don't think the LSN reported in this message is guaranteed to be the > > confirmed_flush LSN of the slot. It's usually confirmed_flush but not > > always. It's the LSN that snapshot builder computes based on factors > > including confirmed_flush. There's a chance that this test will fail > > sometimes because of this behaviour. > > > > I think I am missing something here because as per my understanding, > the LOG referred by the test is generated in CreateDecodingContext() > before which we shouldn't be changing the slot's confirmed_flush LSN. > The LOG [1] refers to the slot's persistent value for confirmed_flush, > so how it could be different from what the test is expecting. > > [1] > errdetail("Streaming transactions committing after %X/%X, reading WAL > from %X/%X.", > LSN_FORMAT_ARGS(slot->data.confirmed_flush),
I was afraid that we may move confirmed_flush while creating the snapshot builder when creating the decoding context. But I don't see any code doing that. So may be we are safe. But if the log message changes, this test would fail - depending upon the log message looks a bit fragile, esp. when we have a way to access the data directly reliably. -- Best Wishes, Ashutosh Bapat