On Tue, Jul 30, 2024 at 1:48 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > > Robert Haas <robertmh...@gmail.com> writes: > > On Sun, Jun 30, 2024 at 2:40 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > >> ... However, I added a new open item about how the > >> 040_pg_createsubscriber.pl test is slow and still unstable. > > > But that said, I see no commits in the commit history which purport to > > improve performance, so I guess the performance is probably still not > > what you want, though I am not clear on the details. > > My concern is described at [1]: > > >> I have a different but possibly-related complaint: why is > >> 040_pg_createsubscriber.pl so miserably slow? On my machine it > >> runs for a bit over 19 seconds, which seems completely out of line > >> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the > >> other test scripts in this directory take much less). It looks > >> like most of the blame falls on this step: > >> > >> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S > >> > >> AFAICS the amount of data being replicated is completely trivial, > >> so that it doesn't make any sense for this to take so long --- and > >> if it does, that suggests that this tool will be impossibly slow > >> for production use. But I suspect there is a logic flaw causing > >> this. Speculating wildly, perhaps that is related to the failure > >> Alexander spotted? > > The followup discussion in that thread made it sound like there's > some fairly fundamental deficiency in how wait_for_end_recovery() > detects end-of-recovery. I'm not too conversant with the details > though, and it's possible that pg_createsubscriber is just falling > foul of a pre-existing infelicity. > > If the problem can be correctly described as "pg_createsubscriber > takes 10 seconds or so to detect end-of-stream", >
The problem can be defined as: "pg_createsubscriber waits for an additional (new) WAL record to be generated on primary before it considers the standby is ready for becoming a subscriber". Now, on busy systems, this shouldn't be a problem but for idle systems, the time to detect end-of-stream can't be easily defined. One of the proposed solutions is that pg_createsubscriber generate a dummy WAL record on the publisher/primary by using something like pg_logical_emit_message(), pg_log_standby_snapshot(), etc. This will fix the problem (BF failures and slow detection for end-of-stream) but sounds more like a hack. The other ideas that we can consider as mentioned in [1] require API/design change which is not preferable at this point. So, the only way seems to be to accept the generation of dummy WAL records to bring predictability in the tests or otherwise in the usage of the tool. [1] - https://www.postgresql.org/message-id/CAA4eK1%2Bp%2B7Ag6nqdFRdqowK1EmJ6bG-MtZQ_54dnFBi%3D_OO5RQ%40mail.gmail.com -- With Regards, Amit Kapila.