Hi Tomas,
I'm a bit confused by the changes to TAP tests. Per the patch summary, some .pl files get renamed (nor sure why), a new one is added, etc.
I added new tap test case, streaming=true option inside old stream_* ones and incremented streaming tests number (+2) because of the collision between 009_matviews.pl / 009_stream_simple.pl and 010_truncate.pl / 010_stream_subxact.pl. At least in the previous version of the patch they were under the same numbers. Nothing special, but for simplicity, please, find attached my new tap test separately.
So I've instead enabled streaming subscriptions in all tests, which with this patch produces two failures: Test Summary Report ------------------- t/004_sync.pl (Wstat: 7424 Tests: 1 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 7 tests but ran 1. t/011_stream_ddl.pl (Wstat: 256 Tests: 2 Failed: 1) Failed test: 2 Non-zero exit status: 1 So yeah, there's more stuff to fix. But I can't directly apply your fixes because the updated patches are somewhat different.
Fixes should apply clearly to the previous version of your patch. Also, I am not sure, that it is a good idea to simply enable streaming subscriptions in all tests (e.g. pre streaming patch t/004_sync.pl), since then they do not hit not streaming code.
Interesting. Any idea where does the extra overhead in this particular case come from? It's hard to deduce that from the single flame graph, when I don't have anything to compare it with (i.e. the flame graph for the "normal" case).I guess that bottleneck is in disk operations. You can check logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and writes (~26%) take around 35% of CPU time in summary. To compare, please, see attached flame graph for the following transaction: INSERT INTO large_text SELECT (SELECT string_agg('x', ',') FROM generate_series(1, 2000)) FROM generate_series(1, 1000000); Execution Time: 44519.816 ms Time: 98333,642 ms (01:38,334) where disk IO is only ~7-8% in total. So we get very roughly the same ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests. Therefore, probably you may write changes on receiver in bigger chunks, not each change separately.Possibly, I/O is certainly a possible culprit, although we should be using buffered I/O and there certainly are not any fsyncs here. So I'm not sure why would it be cheaper to do the writes in batches. BTW does this mean you see the overhead on the apply side? Or are you running this on a single machine, and it's difficult to decide?
I run this on a single machine, but walsender and worker are utilizing almost 100% of CPU per each process all the time, and at apply side I/O syscalls take about 1/3 of CPU time. Though I am still not sure, but for me this result somehow links performance drop with problems at receiver side.
Writing in batches was just a hypothesis and to validate it I have performed test with large txn, but consisting of a smaller number of wide rows. This test does not exhibit any significant performance drop, while it was streamed too. So it seems to be valid. Anyway, I do not have other reasonable ideas beside that right now.
Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
0xx_stream_tough_ddl.pl
Description: Perl program