On Fri, Jul 28, 2023 at 5:22 PM Peter Smith <smithpb2...@gmail.com> wrote: > > Hi Melih, > > BACKGROUND > ---------- > > We wanted to compare performance for the 2 different reuse-worker > designs, when the apply worker is already busy handling other > replications, and then simultaneously the test table tablesyncs are > occurring. > > To test this scenario, some test scripts were written (described > below). For comparisons, the scripts were then run using a build of > HEAD; design #1 (v21); design #2 (0718). > > HOW THE TEST WORKS > ------------------ > > Overview: > 1. The apply worker is made to subscribe to a 'busy_tbl'. > 2. After the SUBSCRIPTION is created, the publisher-side then loops > (forever) doing INSERTS into that busy_tbl. > 3. While the apply worker is now busy, the subscriber does an ALTER > SUBSCRIPTION REFRESH PUBLICATION to subscribe to all the other test > tables. > 4. We time how long it takes for all tablsyncs to complete > 5. Repeat above for different numbers of empty tables (10, 100, 1000, > 2000) and different numbers of sync workers (2, 4, 8, 16) > > Scripts > ------- > > (PSA 4 scripts to implement this logic) > > testrun script > - this does common setup (do_one_test_setup) and then the pub/sub > scripts (do_one_test_PUB and do_one_test_SUB -- see below) are run in > parallel > - repeat 10 times > > do_one_test_setup script > - init and start instances > - ipc setup tables and procedures > > do_one_test_PUB script > - ipc setup pub/sub > - table setup > - publishes the "busy_tbl", but then waits for the subscriber to > subscribe to only this one > - alters the publication to include all other tables (so subscriber > will see these only after the ALTER SUBSCRIPTION PUBLICATION REFRESH) > - enter a busy INSERT loop until it informed by the subscriber that > the test is finished > > do_one_test_SUB script > - ipc setup pub/sub > - table setup > - subscribes only to "busy_tbl", then informs the publisher when that > is done (this will cause the publisher to commence the stay_busy loop) > - after it knows the publishing busy loop has started it does > - ALTER SUBSCRIPTION REFRESH PUBLICATION > - wait until all the tablesyncs are ready <=== This is the part that > is timed for the test RESULT > > PROBLEM > ------- > > Looking at the output files (e.g. *.dat_PUB and *.dat_SUB) they seem > to confirm the tests are working how we wanted. > > Unfortunately, there is some slot problem for the patched builds (both > designs #1 and #2). e.g. Search "ERROR" in the *.log files and see > many slot-related errors. > > Please note - running these same scripts with HEAD build gave no such > errors. So it appears to be a patch problem. >
Hi FYI, here is some more information about ERRORs seen. The patches were re-tested -- applied in stages (and also against the different scripts) to identify where the problem was introduced. Below are the observations: ~~~ Using original test scripts 1. Using only patch v21-0001 - no errors 2. Using only patch v21-0001+0002 - no errors 3. Using patch v21-0001+0002+0003 - no errors ~~~ Using the "busy loop" test scripts for long transactions 1. Using only patch v21-0001 - no errors 2. Using only patch v21-0001+0002 - gives errors for "no copy in progress issue" e.g. ERROR: could not send data to WAL stream: no COPY in progress 3. Using patch v21-0001+0002+0003 - gives the same "no copy in progress issue" errors as above e.g. ERROR: could not send data to WAL stream: no COPY in progress - and also gives slot consistency point errors e.g. ERROR: could not create replication slot "pg_16700_sync_16514_7261998170966054867": ERROR: could not find logical decoding starting point e.g. LOG: could not drop replication slot "pg_16700_sync_16454_7261998170966054867" on publisher: ERROR: replication slot "pg_16700_sync_16454_7261998170966054867" does not exist ------ Kind Regards, Peter Smith. Fujitsu Australia